Unit 1 - Notes
INT323
Unit 1: Digital Data and Business Intelligence (BI)
1. Introduction to Data
Data is defined as distinct pieces of information, usually formatted in a special way. In the context of computing and business, data consists of raw facts, figures, statistics, and qualitative or quantitative measures that have not yet been processed to reveal meaning.
The DIKW Pyramid
To understand the value of data, it is essential to distinguish it within the Data-Information-Knowledge-Wisdom (DIKW) hierarchy:
- Data: Raw, unorganized facts (e.g.,
100,Red,True). - Information: Data processed, organized, structured, or presented in a given context so as to make it useful (e.g.,
100 units of Red paint sold). - Knowledge: Information combined with experience, context, interpretation, and reflection (e.g.,
Red paint sells best in Q3). - Wisdom: The ability to utilize knowledge to make sound decisions (e.g.,
Increase inventory of Red paint in July).
Digital Data
Digital data represents information using the binary number system (0s and 1s). It is discrete and discontinuous. In the modern era, digital data is the fuel for Business Intelligence and ETL (Extract, Transform, Load) processes managed by tools like Informatica.
2. Types of Data
Data is categorized based on how it is organized and stored.
A. Structured Data
Data that resides in a fixed field within a record or file. This is the traditional data type found in relational databases (RDBMS).
- Characteristics: Highly organized, clearly defined data types (integer, varchar, date), easily searchable by algorithms.
- Storage: Relational Databases (Oracle, SQL Server, MySQL), Spreadsheets.
- Access: Managed using Structured Query Language (SQL).
- Examples:
- Customer names and addresses.
- Bank transaction logs.
- Inventory lists.
B. Semi-Structured Data
Data that does not reside in a relational database but has some organizational properties that make it easier to analyze. It contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields.
- Characteristics: Self-describing, lacks a rigid schema, contains metadata tags.
- Storage: NoSQL databases (MongoDB), File systems.
- Examples:
- XML (eXtensible Markup Language): Used for data interchange.
- JSON (JavaScript Object Notation): Common in web APIs.
- HTML: Web page structures.
- Emails: (Header contains structured metadata like sender/date; body is unstructured).
C. Unstructured Data
Data that generally has no pre-defined data model or is not organized in a pre-defined manner. This accounts for approximately 80-90% of enterprise data.
- Characteristics: Text-heavy, rich media, difficult for traditional programs to digest without conversion/parsing.
- Storage: Data Lakes, Blob storage, Hadoop Distributed File System (HDFS).
- Examples:
- Text documents (Word, PDF).
- Multimedia (Images, Audio, Video).
- Social media feeds (Tweets, Facebook posts).
- Server logs and sensor data (IoT).
| Feature | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Schema | Rigid/Fixed | Flexible/Dynamic | None |
| Format | Tables (Rows/Cols) | XML, JSON, Trees | Binary, Text, Media |
| Querying | SQL (Easy) | XQuery, JSON Path | NLP, Text Mining (Complex) |
| Scalability | Vertical (expensive) | Horizontal | Horizontal |
3. Limitations of Databases in the Real World
While Relational Database Management Systems (RDBMS) are excellent for transaction processing, they face significant limitations in the modern "Big Data" era:
- Scalability Issues: Traditional databases usually scale vertically (adding more CPU/RAM to a single server). They struggle to scale horizontally (distributing data across multiple cheap servers) compared to NoSQL or Data Lake solutions.
- Schema Rigidity: Changing the schema of a massive RDBMS (e.g., adding a column to a table with 1 billion rows) is slow, resource-intensive, and can cause downtime.
- Handling Unstructured Data: RDBMS are designed for structured data. Storing video or large text blobs reduces performance and increases cost.
- Velocity/Latency: Real-time ingestion of millions of data points per second (e.g., IoT sensors) can overwhelm the locking mechanisms (ACID properties) of standard databases.
- Cost: Enterprise-grade RDBMS licenses (like Oracle or DB2) can be prohibitively expensive for storing petabytes of historical archival data.
4. Introduction to OLTP and OLAP
Understanding the distinction between transactional and analytical processing is crucial for Informatica workflows, which typically move data from OLTP to OLAP.
OLTP (Online Transaction Processing)
- Focus: Managing day-to-day operational data.
- Goal: Efficiency, data integrity, and fast processing of atomic transactions.
- Data Source: Original source of data.
- Design: Normalized (3NF) to reduce redundancy.
- Operations: Heavy
INSERT,UPDATE,DELETE. - Example: An ATM withdrawal, an e-commerce checkout.
OLAP (Online Analytical Processing)
- Focus: Analysis, reporting, and planning.
- Goal: Response time for complex queries and aggregation of historical data.
- Data Source: Data Warehouse (data comes from OLTP).
- Design: Denormalized (Star Schema or Snowflake Schema) to improve read performance.
- Operations: Heavy
SELECT(Read-only for users). - Example: Generating a "Sales per Region per Quarter" report.
Comparison Summary
| Parameter | OLTP | OLAP |
|---|---|---|
| User | Front-line workers, Clients | Data Analysts, Executives |
| Function | Day-to-day operations | Decision support |
| Data Volume | Gigabytes | Terabytes to Petabytes |
| Data History | Current data only | Historical and current data |
| Metric | Transactions per second (TPS) | Query response time |
5. Introduction to Business Intelligence (BI)
Business Intelligence (BI) is a technology-driven process for analyzing data and delivering actionable information that helps executives, managers, and workers make informed business decisions.
- Core Concept: Turning data into insights.
- Informatica's Role: Informatica acts as the data integration backbone, extracting data from various sources, cleaning it, and loading it into a BI-ready Data Warehouse.
Key Components of BI Architecture
- Data Sources: Operational databases, CRMs, flat files.
- ETL (Extract, Transform, Load): Tools like Informatica PowerCenter that move and clean data.
- Data Warehouse/Mart: The central repository for organized historical data.
- BI Tools: Software for visualization (Tableau, PowerBI, Qlik, Cognos).
6. Evolution of BI: EIS, MIS, and Digital Dashboards
BI has evolved from static paper reports to dynamic, predictive systems.
1. MIS (Management Information Systems)
- Era: 1970s - 1980s.
- Target: Middle Management.
- Function: Produced periodic, standardized reports (e.g., Monthly Sales Report).
- Limitation: Static; if a manager wanted to know why sales dropped, they had to request a new report from IT, which took time.
2. EIS (Executive Information Systems)
- Era: 1980s - 1990s.
- Target: Senior Executives (CEOs, CFOs).
- Function: Provided a summarized view of internal and external information critical to meeting strategic goals.
- Features: "Drill-down" capabilities (clicking a summary number to see details) and graphical user interfaces.
3. Digital Dashboards
- Era: 2000s - Present.
- Target: All levels of the organization.
- Function: A visual interface that provides at-a-glance views of Key Performance Indicators (KPIs) relevant to a particular objective or business process.
- Features: Real-time data updates, interactive visualizations (charts, gauges, heat maps), and mobile accessibility.
7. Need for BI at All Levels
BI is no longer just for the CEO; it is required across the organizational hierarchy to align actions with strategy.
A. Strategic Level (Upper Management/CxO)
- Focus: Long-term goals, market direction, yearly performance.
- BI Need: Unstructured/External data, high-level summaries, trend forecasting.
- Example: "Should we acquire Competitor X?" or "Which markets should we enter in 2025?"
B. Tactical Level (Middle Management)
- Focus: Implementing strategy, resource allocation, weekly/monthly targets.
- BI Need: Comparative analysis, variance analysis (Actual vs. Budget).
- Example: "Why did the Western region miss the sales target this month?"
C. Operational Level (Line Workers/Supervisors)
- Focus: Day-to-day execution, immediate tasks.
- BI Need: Real-time data, specific detailed reports.
- Example: A call center agent seeing a customer's churn risk score in real-time during a call.
8. BI for Past, Present, and Future
Modern Business Intelligence covers the entire temporal spectrum of data analysis.
1. Descriptive Analytics (The Past)
- Question: "What happened?"
- Method: Reporting, historical data analysis.
- Tools: Standard reports, scorecards.
- Example: A report showing total revenue for the year 2023.
2. Diagnostic/Operational Analytics (The Present)
- Question: "Why did it happen?" or "What is happening now?"
- Method: Data discovery, drill-down, correlations, real-time monitoring.
- Tools: Interactive Dashboards.
- Example: Identifying that revenue dropped in 2023 because of a supply chain failure in Q2.
3. Predictive and Prescriptive Analytics (The Future)
- Predictive:
- Question: "What will happen?"
- Method: Statistical modeling, machine learning, forecasting.
- Example: Forecasting sales for Q4 2024 based on current trends.
- Prescriptive:
- Question: "How can we make it happen?"
- Method: Optimization algorithms, simulation.
- Example: Suggesting the optimal price point to maximize profit in the upcoming holiday season.