Unit 2 - Notes
INT323
Unit 2: Basics of Data Integration
1. Introduction to Data Integration
Data Integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories).
In the context of Informatica and similar ETL tools, data integration is the core process of moving data from source systems (databases, flat files, SaaS applications) to a target system (usually a Data Warehouse).
1.1 The Need for Data Integration
Enterprises accumulate data in various formats and systems. The primary drivers for data integration include:
- Breaking Down Data Silos: Departments (Sales, HR, Finance) often store data in isolated systems. Integration bridges these gaps.
- Business Intelligence (BI): To perform accurate analytics, reporting tools require a consolidated view of data rather than fragmented pieces.
- Mergers and Acquisitions: When companies merge, their IT systems (ERP, CRM) must be integrated to function as a single entity.
- Legacy System Migration: Moving data from mainframe or outdated systems to modern cloud infrastructure requires robust integration pipelines.
- 360-Degree Customer View: combining support tickets, sales history, and marketing interactions to understand customer behavior fully.
1.2 Advantages of Data Integration
- Improved Decision Making: Provides holistic, accurate, and real-time data to decision-makers.
- Data Integrity and Quality: Centralized integration rules reduce duplication and inconsistencies.
- Operational Efficiency: Automating data flows reduces manual data entry and human error.
- Connectivity: Enables communication between heterogeneous systems (e.g., connecting an Oracle Database with a Salesforce cloud instance).
1.3 Approaches to Data Integration
There are several architectural approaches to integrating data, ranging from manual to fully automated.
A. Manual Data Integration
- Description: Users manually collect data from different systems and merge it (often in Excel).
- Pros: Low cost for very small datasets.
- Cons: High error rate, not scalable, no real-time capability.
B. Middleware Data Integration
- Description: Use of middleware software that acts as a translator between different applications.
- Pros: Facilitates communication between legacy and modern systems.
- Cons: Can create limited functionality; often point-to-point rather than holistic.
C. Data Virtualization (Uniform Access Integration)
- Description: Data remains in the source systems. The integration layer retrieves data on-demand and presents a unified view without physically moving the data.
- Pros: Real-time access, no storage costs for a warehouse.
- Cons: Performance impact on source systems; not suitable for heavy historical analysis.
D. Data Warehousing (Physical Data Integration)
- Description: Data is physically extracted, transformed, and loaded (ETL) into a central repository.
- Pros: Best for complex analytics and historical trending; does not impact source system performance during querying.
- Cons: High storage cost; data latency (data is only as fresh as the last load).
1.4 Technologies Used in Data Integration
- ETL Tools (Extract, Transform, Load):
- Informatica PowerCenter / IICS: Industry standard for batch processing and heavy data transformation.
- Talend / Microsoft SSIS: Other common market alternatives.
- ELT (Extract, Load, Transform):
- Modern cloud approach (Snowflake, BigQuery) where raw data is loaded first and transformed later using the power of the target database.
- Enterprise Service Bus (ESB):
- Message-based integration used for communication between applications (e.g., TIBCO, MuleSoft).
- CDC (Change Data Capture):
- Technologies that identify and capture only data that has changed in the source system to enable real-time replication.
2. Introduction to Data Warehouse (DW)
A Data Warehouse is a large, centralized repository of data integrated from disparate sources. It is designed primarily for query and analysis (OLAP - Online Analytical Processing) rather than for transaction processing (OLTP).
2.1 Bill Inmon’s Definition (The 4 Characteristics)
According to W.H. Inmon (the "Father of Data Warehousing"), a Data Warehouse is:
- Subject-Oriented: Organized around major subjects such as Customer, Product, Sales, rather than application logic (like "Invoicing Process").
- Integrated: Data from different sources is made consistent. Naming conventions, encoding structures (e.g., Male/Female vs. 1/0), and measurement units are standardized.
- Time-Variant: Data is stored with a time element (historical perspective). While operational systems show "current value," a DW shows data over the last 5–10 years.
- Non-Volatile: Once data enters the warehouse, it is not updated or deleted in real-time. It is loaded and accessed (read-only mostly).
2.2 OLTP vs. OLAP
| Feature | OLTP (Operational System) | OLAP (Data Warehouse) |
|---|---|---|
| Focus | Day-to-day operations (Entry) | Decision support (Analysis) |
| Data | Current, up-to-the-minute | Historical, summarized |
| Access | Read/Write (INSERT, UPDATE, DELETE) | Mostly Read (SELECT) |
| Users | Clerks, DBAs, Applications | Managers, Executives, Analysts |
| Size | Megabytes to Gigabytes | Terabytes to Petabytes |
| Optimization | Fast transaction processing | Complex query performance |
3. Data Mart
A Data Mart is a simple form of a Data Warehouse that is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. It is often a subset of the data warehouse.
3.1 Why use a Data Mart?
- Specific Business Focus: Tailored to the needs of a specific department.
- Performance: Queries run faster because the volume of data is smaller than the full Enterprise Data Warehouse (EDW).
- Security: Limits access. For example, the Marketing team doesn't need to see HR payroll data.
- Cost/Time: Faster and cheaper to implement than a full EDW.
3.2 Types of Data Marts
-
Dependent Data Mart:
- Sourced directly from the Enterprise Data Warehouse.
- Advantage: Ensures a "Single Version of Truth" (consistent data).
- Flow: Source ETL Data Warehouse Data Mart.
-
Independent Data Mart:
- Sourced directly from operational systems without a central warehouse.
- Disadvantage: Creates "Islands of Information" (data silos) and inconsistent metrics across the company.
- Flow: Source ETL Data Mart.
-
Hybrid Data Mart:
- Combines input from a data warehouse and other operational sources.
4. Data Quality
Data Quality refers to the state of qualitative or quantitative pieces of information. In the context of Informatica and Data Warehousing, data quality ensures that data is "fit for use" in operations, decision-making, and planning.
4.1 The "Garbage In, Garbage Out" Principle
If the source data extracted during the ETL process is flawed, the resulting analytics in the Data Warehouse will be flawed, leading to poor business decisions.
4.2 Dimensions of Data Quality
To assess quality, data is measured against the following dimensions:
- Accuracy: Does the data reflect the real-world object or event?
- Example: Is the customer's age actually 35, or was it entered incorrectly?
- Completeness: Is all requisite data available?
- Example: Are there null values in mandatory fields like "SSN" or "Last Name"?
- Consistency: Does the data match across different systems?
- Example: Does the Sales DB show a customer in "NY" while the Shipping DB shows "CA"?
- Uniqueness: Are there duplicates?
- Example: Is the same customer listed three times with slightly different name spellings?
- Timeliness: Is the data available when needed?
- Example: Is the stock market data current, or is it 15 minutes delayed?
- Validity: Does the data conform to defined rules or constraints?
- Example: A ZIP code must be numeric; a date of birth cannot be in the future.
4.3 Data Quality Management in ETL (Informatica Context)
Data Quality is not a one-time fix; it is a continuous cycle handled within the transformation layer:
- Data Profiling: Analyzing source data to understand structure, content, and relationships before building mappings.
- Data Cleansing (Scrubbing): Correcting errors (e.g., standardizing "St.", "Street", and "Str" to "St").
- Data Matching: Identifying and merging duplicate records.
- Data Monitoring: Continuously tracking data quality metrics over time.
Informatica provides specialized tools (like Informatica Data Quality - IDQ) specifically to parse, standardize, and validate addresses and other critical data elements before they are loaded into the Data Warehouse.