Unit 4 - Notes

INT323 7 min read

Unit 4: Multi-dimensional data

1. ETL Overview

ETL stands for Extract, Transform, and Load. It is the backbone process of Data Warehousing and Business Intelligence. It involves moving data from various operational source systems to a centralized data warehouse or data mart for analysis, reporting, and data mining.

The Purpose of ETL

Operational systems (OLTP - Online Transaction Processing) are optimized for fast read/write of individual transactions. Analytical systems (OLAP - Online Analytical Processing) require historical, aggregated, and read-optimized data. ETL bridges this gap by:

Integrating data: Combining data from heterogeneous sources (SQL databases, mainframes, flat files, cloud APIs).
Improving quality: Cleansing and standardized data before it reaches decision-makers.
Structuring data: converting complex schemas into star or snowflake schemas suitable for dimensional modeling.

The Three Stages

Extract: Reading data from specific source systems.
Transform: Converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.
Load: Writing the data into the target database.

2. Extracting Data

Extraction is the first phase where data is acquired from the source. The primary objective is to retrieve the data with minimal impact on the performance of the source systems.

Types of Extraction

Full Extraction:
- The entire dataset is extracted from the source.
- Use Case: Initial load of a Data Warehouse or when the source system has no way to identify changed records.
- Drawback: High resource consumption and time-intensive.
Incremental Extraction (Change Data Capture - CDC):
- Only data that has changed (inserted, updated, or deleted) since the last extraction is retrieved.
- Mechanisms:
  - Timestamps: Querying records where Last_Updated > Last_Run_Date.
  - Triggers: Database triggers populate a separate log table when changes occur.
  - Log Scanning: Reading the database transaction logs (least intrusive).

Common Data Sources

RDBMS: Oracle, SQL Server, MySQL, DB2.
Flat Files: CSV, Excel, XML, JSON, Fixed-width text files.
Legacy Systems: Mainframes (COBOL copybooks).
SaaS/APIs: Salesforce, REST APIs, SOAP web services.

Validation during Extraction

Confirming data types match expectations.
Checking for dropped records or source connection failures.

3. Transformations

Transformation is the "heart" of the ETL process. It involves applying a set of business rules to the extracted data to ensure conformity, accuracy, and consistency in the target system.

Core Transformation Categories (Informatica Context)

In Informatica and general ETL theory, transformations are often categorized as:

Active Transformations: Change the number of rows passing through the transformation (e.g., Filtering rows, Aggregating data).
Passive Transformations: The number of rows entering equals the number of rows leaving (e.g., Upper-casing a string, Mathematical calculation).

Common Transformation Techniques

Cleaning/Scrubbing:
- Mapping NULL to 0 or "Unknown".
- Trimming whitespace.
- Standardizing formats (e.g., converting "M", "Male", "1" to a standard "M").
Filtration:
- Loading only sales data for the current fiscal year.
- Removing test data or incomplete records.
Derivation/Calculation:
- Creating new columns based on existing ones.
- Example: Total_Cost = Quantity * Unit_Price.
Joining/Lookup:
- Combining data from two or more sources.
- Example: Looking up Customer_Name from a master table using a Customer_ID found in transaction data.
Aggregation:
- Summarizing data.
- Example: Summing daily sales to create monthly sales figures.
Splitting/Routing:
- Sending data to different targets based on conditions (e.g., Valid records go to Target A, Error records go to an Error Log).
Key Generation:
- Generating Surrogate Keys (synthetic primary keys) for the Data Warehouse to maintain historical accuracy independent of source system keys.

4. Loading Data

Loading is the final step where the transformed data is written into the target data warehouse (usually a dimensional model consisting of Fact and Dimension tables).

Loading Strategies

Full Load (Destructive Load):
- The target table is truncated (wiped clean) and fully repopulated with fresh data.
- Common for small dimension tables.
Incremental Load:
- Only new records are appended to the target.
Upsert (Update/Insert):
- The system checks if a record exists.
- If Yes: Update the existing record.
- If No: Insert a new record.

Slowly Changing Dimensions (SCD)

When loading dimension tables (e.g., Customer Address), we must decide how to handle history:

Type 1 (Overwrite): Update the address. History is lost.
Type 2 (Add Row): Create a new record with the new address and mark the old one as inactive (preserves history).
Type 3 (Add Column): Keep a "Current" and "Previous" column in the same row.

Performance Considerations

Bulk Loading: Bypassing the database logging mechanism to insert data faster.
Indexing: Indexes usually slow down the loading process. It is common practice to drop indexes before loading and rebuild them afterward.

5. Simple ETL Processing

An ETL process flows logically from Source to Target. Below is a conceptual example of a simple ETL workflow.

Scenario

A retail company needs to move daily sales data from a CSV file into a SQL Database for reporting.

Logic Flow

Source Definition: Connect to Daily_Sales.csv.
- Columns: Date, Store_ID, Product_ID, Qty, Price.
Filter Transformation:
- Rule: Discard rows where Qty < 0 (returns).
Expression Transformation (Calculation):
- Action: Create new field Revenue.
- Logic: Revenue = Qty * Price.
- Action: Format Date to YYYY-MM-DD.
Lookup Transformation:
- Action: Get Store_Region by matching Store_ID against the Stores_Master database table.
Target Definition: Map fields to Fact_Sales table in the Data Warehouse.
Load: Execute the insertion.

Pseudocode Representation

TEXT

FOR EACH row IN Source_File:
    IF row.Qty >= 0 THEN
        calculated_revenue = row.Qty * row.Price
        formatted_date = FORMAT(row.Date, 'YYYY-MM-DD')
        region = LOOKUP(Store_Database, row.Store_ID)
        
        INSERT INTO Target_Table (Date, Region, Product, Revenue)
        VALUES (formatted_date, region, row.Product_ID, calculated_revenue)
    ELSE
        WRITE TO Error_Log(row)
    END IF
NEXT

6. ETL Tools

While ETL can be coded manually (using Python, PL/SQL, or Java), specialized ETL tools provide visual interfaces, metadata management, and performance optimization.

1. Informatica PowerCenter

The market leader in enterprise ETL and the focus of this subject.

Architecture: Client-server based.
Client Tools:
- Repository Manager: Manages folders, users, and permissions.
- Designer: Used to create Mappings (the logical flow of data).
- Workflow Manager: Used to create workflows (execution sequence) and sessions.
- Workflow Monitor: Monitors the execution logs and status.
Strengths: High performance, handles massive data volumes, extensive connectivity options, reusable transformations (Mapplets).

2. Microsoft SSIS (SQL Server Integration Services)

Integrated deeply with the Microsoft ecosystem.
Visual interface similar to Flowcharts.
Excellent for environments already using SQL Server.

3. Talend

Open-source options available.
Generates Java code in the background.
Strong focus on Big Data and Cloud integration.

4. Cloud-Native Tools

AWS Glue, Azure Data Factory, Google Dataflow: Managed services that handle ETL without requiring server maintenance.

7. Data Cube

A Data Cube is a multi-dimensional array of values, generally used to describe time-series of image data or business logic. It is the core concept of OLAP (Online Analytical Processing).

Concept

In a standard spreadsheet (2D), you have Rows and Columns (e.g., Product and Sales). A Data Cube adds a third (or fourth/fifth) dimension, such as Time, Region, or Supplier. This allows data to be modeled as an $N$ -dimensional object.

Components

Dimensions: The perspectives or entities with respect to which an organization wants to keep records (e.g., Time, Location, Product).
Measures (Facts): The numerical data being analyzed (e.g., Revenue, Units Sold, Profit).
Cells: The intersection of dimensions containing the specific measure value.

OLAP Operations on Data Cubes

To analyze multi-dimensional data, users perform specific navigation operations:

Slice:
- Selecting a specific value for one dimension.
- Analogy: Cutting one slice of bread from a loaf.
- Example: "Show me sales for Region = North across all products and times."
Dice:
- Selecting a range of values for multiple dimensions.
- Analogy: Creating a smaller sub-cube.
- Example: "Show me sales for Product = 'Laptop' AND 'Tablet' in Time = Q1 and Q2."
Drill-down:
- Navigating from summary data to detailed data (stepping down the concept hierarchy).
- Example: Moving from "Yearly Sales" $\rightarrow$ "Quarterly Sales" $\rightarrow$ "Monthly Sales".
Roll-up (Drill-up):
- Summarizing detailed data into higher-level aggregates.
- Example: Moving from "City-level" data to "Country-level" data.
Pivot (Rotate):
- Rotating the data axes to view the data from a different perspective.
- Example: Swapping Rows (Products) and Columns (Months) in a report.

Types of OLAP Storage

MOLAP (Multidimensional OLAP): Stores data physically in multi-dimensional arrays (cubes). Fast retrieval but limited scalability.
ROLAP (Relational OLAP): Virtual cubes. Queries are translated to standard SQL against a relational database. Slower but handles more data.
HOLAP (Hybrid OLAP): Combines both; keeps summary data in MOLAP and detailed data in ROLAP.

Unit 3

Unit 5