Unit 3 - Notes

INT323

Unit 3: Data Warehouse

1. Overview of Data Warehouse

A Data Warehouse (DW) is a specialized database optimized to analyze relational data coming from transactional systems and line-of-business applications. It acts as a central repository where information arrives from one or more data sources to support business intelligence (BI) activities, particularly analytics.

Key Definitions

  • Bill Inmon (Top-down approach): Defines a data warehouse as a "subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process."
  • Ralph Kimball (Bottom-up approach): Defines it as "a copy of transaction data specifically structured for query and analysis."

Data Warehouse vs. OLTP (Operational Database)

Feature OLTP (Operational System) Data Warehouse (OLAP)
Focus Daily operations, transactions Data analysis, decision support
Data Content Current, up-to-the-minute data Historical, archived data
Design Entity-Relationship (ER) Model Dimensional Modeling (Star/Snowflake)
Operation Read/Write (Updates are frequent) Read-only (mostly)
Size Megabytes to Gigabytes Terabytes to Petabytes

2. Features of Data Warehouse

According to Bill Inmon, a data warehouse allows for the integration of data from various sources to provide a unified view to the user. The four key characteristics are:

A. Subject-Oriented

  • The data warehouse is organized around major subjects of the enterprise (e.g., Customer, Product, Sales) rather than day-to-day operations or applications (e.g., Stock listing, Order processing).
  • It focuses on the modeling and analysis of data for decision-makers.

B. Integrated

  • Data is gathered from various heterogeneous sources (RDBMS, flat files, APIs).
  • Data Cleaning: Inconsistencies in naming conventions, measurement attributes, and physical attributes are removed.
    • Example: If Source A uses "M/F" and Source B uses "1/0" for gender, the DW standardizes this to a single format (e.g., "Male/Female").

C. Time-Variant

  • Data in a warehouse is identified with a particular time period.
  • It maintains historical data (often 5–10 years), whereas operational systems usually maintain current data.
  • Every key structure in the data warehouse contains an element of time (e.g., Day, Week, Month).

D. Non-Volatile

  • Data is strictly read-only for users. Once data is loaded into the warehouse, it should not change.
  • Updates (Insert/Delete/Update) do not occur in real-time; instead, data is refreshed via scheduled batch loads (ETL).

3. Applications of Data Warehouse

Data warehouses serve as the backbone for Business Intelligence. Key industry applications include:

  1. Banking & Finance: Risk management, credit card fraud detection, customer profitability analysis.
  2. Retail: Inventory management, market basket analysis (identifying products bought together), sales forecasting.
  3. Healthcare: Predicting patient outcomes, analyzing treatment effectiveness, hospital resource management.
  4. Telecommunications: Churn analysis (why customers leave), network utilization analysis, call detail record analysis.
  5. Government: Tax compliance, audit selection, demographic trend analysis.

4. Types of Data Warehouse

A. Enterprise Data Warehouse (EDW)

  • A centralized warehouse that provides a unified view of the entire organization.
  • It typically follows the "Top-Down" approach (Inmon).

B. Operational Data Store (ODS)

  • An intermediate staging area that holds current or near real-time data.
  • It is often used when operational reporting requires fresher data than the main DW provides.
  • Data in ODS is volatile (can be overwritten).

C. Data Mart

  • A subset of the data warehouse, usually oriented to a specific business line or team (e.g., Marketing, Sales).
  • Dependent Data Mart: Sourced directly from the EDW.
  • Independent Data Mart: Sourced directly from operational sources (discouraged due to data silos).

5. Functions of Data Warehouse

The fundamental functions involve the movement and transformation of data:

  1. Data Extraction: Gathering data from external sources (databases, XML, CSV).
  2. Data Cleaning: Correcting errors, filling missing values, and resolving conflicts.
  3. Data Transformation: Converting data into the data warehouse format (aggregations, summation, data type conversion).
  4. Data Loading: Physically loading the data into the warehouse tables.
  5. Refreshing: Updating the warehouse with new data periodically (incremental loading).

6. Tools and Utilities

In the context of Informatica and general warehousing, tools are categorized by their function:

A. ETL Tools (Extraction, Transformation, Loading)

  • Informatica PowerCenter: The industry standard for enterprise ETL.
  • Microsoft SSIS: Integration services for SQL Server.
  • Talend: Open-source ETL solution.
  • IBM InfoSphere DataStage: High-volume data integration.

B. Data Modeling Tools

  • ERwin: For creating logical and physical data models.
  • Toad Data Modeler: Database design tool.

C. BI and Reporting Tools (Front-end)

  • Tableau: Visualization.
  • Power BI: Microsoft's analytics service.
  • SAP BusinessObjects: Enterprise reporting.

7. Data Warehouse Architecture

A standard data warehouse adopts a Three-Tier Architecture:

Tier 1: Bottom Tier (The Database Server)

  • This is the relational database system (RDBMS) that physically stores the data.
  • Examples: Oracle, Teradata, SQL Server, Snowflake.
  • Includes the Back-end tools for feeding data (ETL process).

Tier 2: Middle Tier (The OLAP Server)

  • An intermediary that converts the relational data in the bottom tier into a structure suited for multidimensional analysis.
  • It maps the operations on multidimensional data to standard relational operations.
  • It improves query performance through caching and pre-aggregation.

Tier 3: Top Tier (The Front-End Client)

  • The interface layer where users interact with the data.
  • Includes query tools, reporting tools, analysis tools, and data mining tools.

8. OLAP (Online Analytical Processing)

OLAP refers to the technology that allows users to analyze data from multiple database systems at the same time. Ideally, it presents data in a Multidimensional Cube.

Common OLAP Operations

  1. Roll-up (Consolidation): Moving from detailed data to summarized data (e.g., Daily Sales Monthly Sales).
  2. Drill-down: Moving from summarized data to detailed data (e.g., Yearly Sales Quarterly Sales).
  3. Slice: selecting a single dimension (e.g., Sales for "2023" only).
  4. Dice: Selecting a sub-cube by specifying ranges on multiple dimensions (e.g., Sales for "2023" AND "Region=North").
  5. Pivot (Rotate): Rotating the data axes to view the data from a different perspective.

Types of OLAP

  1. ROLAP (Relational OLAP):
    • Works directly with relational databases.
    • Data is stored in tables (Rows/Columns).
    • Pros: Scalable, handles large data volumes.
    • Cons: Slower query performance compared to MOLAP.
  2. MOLAP (Multidimensional OLAP):
    • Stores data in optimized multi-dimensional array storage (Cubes).
    • Pros: extremely fast query performance (pre-calculated).
    • Cons: Data redundancy, limited scalability, difficult to update.
  3. HOLAP (Hybrid OLAP):
    • Combines ROLAP and MOLAP.
    • Summarized data is kept in MOLAP (for speed) and detailed data in ROLAP (for storage efficiency).

9. Schemas in Data Warehouse

Schemas define how the database is structured to support analysis. The core components are Fact Tables (measurements/numbers) and Dimension Tables (descriptions).

A. Star Schema

  • The simplest and most common style.
  • A central Fact Table is connected directly to multiple Dimension Tables.
  • Dimension tables are denormalized (contain redundancy but are faster to read).
  • Visual resemblance to a star.
  • Performance: High (fewer joins required).

B. Snowflake Schema

  • A variation of the Star Schema.
  • Dimension tables are normalized, meaning a dimension table splits into further tables to remove redundancy.
  • Example: In Star schema, "City" and "Country" are in the Location table. In Snowflake, "Location" links to a "Country" table.
  • Performance: Slower than Star Schema (more joins required).
  • Storage: Uses less space.

C. Fact Constellation (Galaxy Schema)

  • Contains multiple Fact Tables that share Dimension Tables.
  • Used for complex enterprise-wide systems.
  • Example: A "Sales Fact" and a "Shipping Fact" both sharing a "Time Dimension."

10. Meta-data

Meta-data is defined as "data about data." In a data warehouse, it acts as a directory or roadmap.

Types of Meta-data

  1. Technical Meta-data:
    • Used by warehouse designers and administrators.
    • Includes information about data sources, table structures, data types, transformation rules, indexes, and mapping logic (e.g., Informatica Mappings).
  2. Business Meta-data:
    • Used by end-users.
    • Includes business definitions, glossaries, report descriptions, and information on data ownership.

Role of Meta-data Repository

  • It facilitates data extraction and loading.
  • It helps in query optimization.
  • It ensures semantic consistency across the organization.

11. Data-marting

Data-marting is the process of creating and managing Data Marts.

Why Data Marts?

  • Cost: Building a full EDW is expensive; a mart is a cheaper starting point.
  • Speed: Faster implementation time.
  • Access: Improved response time for specific departmental queries.

Approaches

  1. Top-Down (Inmon): Build the Enterprise Data Warehouse first, then derive Data Marts from it. Ensures consistency (single version of truth).
  2. Bottom-Up (Kimball): Build Data Marts first based on department needs, then integrate them via "Conformed Dimensions" to create the warehouse.

12. Security and Backup

Security Measures

Data warehouses contain highly sensitive historical data, making security paramount.

  1. Authentication: Verifying user identity (LDAP, Active Directory).
  2. Authorization: Determining what resources a user can access.
    • Row-Level Security: A Sales Manager for the "North" region should only see rows where Region = 'North'.
  3. Encryption:
    • Data at Rest: Encrypting the physical database files.
    • Data in Motion: Using SSL/TLS for data moving between ETL tools and the database.
  4. Auditing: Logging who accessed what data and when.

Backup and Recovery Strategies

Since DWs are massive, standard backups can take too long.

  1. Full Backup: Backing up the entire warehouse (usually done weekly).
  2. Incremental Backup: Backing up only data changed since the last backup (daily).
  3. Differential Backup: Backing up data changed since the last full backup.
  4. Partitioning: Backing up specific partitions (e.g., backing up the "2023" data partition separately from older, static years).
  5. Disaster Recovery (DR): Maintaining a mirror site or a warm standby database in a different geographical location.