Unit 6 - Notes

INT323

Unit 6: Cloud Databases and Master Data Management(MDM)

Part 1: Cloud Databases

1. Introduction to Cloud Databases

A Cloud Database is a database that typically runs on a cloud computing platform. Access to the database is provided as a service (DBaaS - Database as a Service). It serves the same function as a traditional database—storing and organizing data—but provides added flexibility regarding infrastructure provisioning and maintenance.

There are two primary deployment models for cloud databases:

  1. Virtual Machine Image (IaaS): Users purchase virtual machine space from a cloud provider and run a database on it. The user is responsible for managing the database software, while the provider manages the physical hardware.
  2. Database as a Service (DBaaS/PaaS): The cloud provider manages the hardware, operating system, and database software (updates, backups, scaling). The user simply consumes the database service via an API or console.

2. Cloud Databases Vs Traditional Database

The shift from on-premises (traditional) databases to cloud databases represents a fundamental change in how data infrastructure is managed.

Feature Traditional Database (On-Premises) Cloud Database (DBaaS)
Infrastructure Hardware must be purchased, installed, and housed physically by the organization. Hosted on the provider's infrastructure; virtualized and abstracted from the user.
Cost Model CapEx (Capital Expenditure): High upfront costs for servers, cooling, and space. OpEx (Operational Expenditure): Subscription or pay-per-use models.
Scalability Difficult. Requires buying new hardware and manual migration (Vertical Scaling is limited). High. Can scale up (power) or out (capacity) almost instantly with a few clicks.
Maintenance IT team is responsible for patches, backups, upgrades, and disaster recovery. Vendor handles maintenance, automated backups, and software patching.
Accessibility Limited to the local network or VPN. Accessible from anywhere with an internet connection (secured via IP whitelisting/VPC).
Availability Dependent on local hardware health. Redundancy requires buying double the hardware. High availability built-in (99.99% SLAs) via replication across availability zones.

3. Benefits and Features of Cloud Databases

  • Elasticity and Scalability:
    • Auto-scaling: The ability to automatically add resources (CPU/RAM/Storage) during peak loads and reduce them during off-peak times.
    • Sharding: Distributing data across multiple machines effectively.
  • Cost Efficiency: Eliminated upfront hardware costs. Organizations pay only for the storage and compute resources actually consumed.
  • High Availability and Disaster Recovery: Cloud providers replicate data across different geographic regions. If one data center fails, traffic is automatically rerouted to a healthy one.
  • Agility and Speed: A new database instance can be spun up in minutes rather than the weeks required to procure and install physical servers.
  • Automated Management: Tasks such as backups, software patching, and failure detection are handled by the provider, allowing DBAs to focus on schema optimization and query performance.
  • Security Integration: Built-in encryption at rest and in transit, integrated Identity and Access Management (IAM), and compliance certifications (SOC2, HIPAA).

4. Considerations for Cloud Databases

Before migrating to the cloud, organizations must evaluate several critical factors:

  • Data Sovereignty and Compliance: Laws like GDPR require data to reside in specific geographic locations. You must ensure the cloud provider has data centers in the required regions.
  • Vendor Lock-in: Migrating data into a specific cloud provider's proprietary format (e.g., AWS DynamoDB or Google Spanner) can make it difficult to switch providers later due to technical incompatibilities or high egress fees.
  • Latency: While cloud databases are fast, accessing a database over the internet introduces latency compared to a local LAN. Application servers should ideally be hosted in the same cloud region as the database.
  • Shared Responsibility Model: While the provider secures the cloud (hardware, physical network), the customer is responsible for security in the cloud (data encryption settings, user access controls, weak passwords).
  • Migration Strategy: Moving terabytes of data requires robust ETL (Extract, Transform, Load) strategies—often involving tools like Informatica—to minimize downtime during the switch.

5. Cloud Providers

The market is dominated by major hyperscalers and specialized database vendors.

  • Amazon Web Services (AWS):
    • Amazon RDS: Managed relational databases (PostgreSQL, MySQL, Oracle, SQL Server).
    • Amazon Aurora: A cloud-native relational DB compatible with MySQL/PostgreSQL but with higher performance.
    • Amazon Redshift: Data warehousing service.
    • Amazon DynamoDB: NoSQL key-value store.
  • Microsoft Azure:
    • Azure SQL Database: Fully managed SQL Server engine.
    • Azure Cosmos DB: Multi-model NoSQL database.
    • Azure Synapse Analytics: Integrated analytics and warehousing.
  • Google Cloud Platform (GCP):
    • Cloud SQL: Managed MySQL/PostgreSQL/SQL Server.
    • Cloud Spanner: Relational database with unlimited scale and consistency.
    • BigQuery: Serverless, highly scalable data warehouse.
  • Specialized Cloud Data Platforms:
    • Snowflake: A cloud-native data warehouse/data lake platform that runs on top of AWS, Azure, or GCP.
    • MongoDB Atlas: Managed service for MongoDB (NoSQL).

Part 2: Master Data Management (MDM)

6. Introduction to Master Data

Master Data represents the business objects that contain the most valuable, agreed-upon information shared across an organization. It provides the context for business transactions.

Types of Data

To understand Master Data, one must distinguish it from Transactional Data:

  1. Master Data (Nouns): Slow-moving, non-transactional data.
    • Examples: Customers, Products, Employees, Vendors, Locations, Assets.
    • Change Rate: Low (A customer's name changes rarely).
  2. Transactional Data (Verbs): Events that occur involving master data.
    • Examples: Sales orders, Invoices, Web logs, Payments.
    • Change Rate: High/Continuous.

Analogy: In a receipt reading "John Doe bought an iPhone on Tuesday," "John Doe" and "iPhone" are Master Data, while the specific purchase event (Time, Price, Date) is Transactional Data.

7. Need of Managing Master Data

Organizations often run multiple applications (ERP, CRM, E-commerce, Supply Chain), each creating its own version of data. This leads to:

  • Data Silos: The Sales department (CRM) knows the customer as "J. Smith," but the Billing department (ERP) knows them as "John Smith."
  • Inconsistent Data: Product pricing or descriptions may differ between the warehouse system and the online store.
  • Duplication: Marketing might send the same email three times to the same person because they exist three times in the database under slightly different names.
  • Inaccurate Reporting: Executives cannot get a "Single View" of total sales per customer if the customer ID varies across systems.
  • Compliance Risks: If a customer requests data deletion (GDPR), the company might fail to delete it from all systems if the data isn't linked.

8. Introduction to Master Data Management (MDM)

Master Data Management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise’s official shared master data assets.

The Core Concept: The "Golden Record"
MDM creates a "Golden Record" (or Single Version of the Truth - SVOT). This is the best, most consolidated, and trusted version of a data entity, created by merging data from various sources.

MDM Architecture Styles:

  • Registry Style: The MDM hub holds only the index/links to where data resides in source systems (lightweight).
  • Centralized/Hub Style: Data is moved from source systems into a central MDM hub where it is mastered (Informatica MDM typically supports this).
  • Coexistence Style: Master data is synchronized between source systems and the central hub.

9. Key Functions and Benefits of MDM

Key Functions

  1. Data Ingestion: Gathering data from heterogeneous sources (CRM, ERP, spreadsheets).
  2. Data Cleansing (Data Quality): Standardizing formats (e.g., converting "Calif." and "CA" to "California"), fixing typos, and validating addresses.
  3. Match and Merge:
    • Matching: Algorithms (fuzzy logic) identify that "Bill Gates at Microsoft" and "William Gates at MSFT" are the same entity.
    • Merging: Survivorship rules dictate which data points are kept (e.g., "Always trust the phone number from the CRM, but trust the billing address from the ERP").
  4. Relationship Management: Defining hierarchies (e.g., Corporate Parent vs. Subsidiary) or groupings (Householding: linking family members living at the same address).
  5. Data Stewardship: providing a user interface for human experts (Data Stewards) to manually resolve duplicate records that algorithms couldn't handle automatically.

Benefits of MDM

  1. 360-Degree View of Business: Provides a complete view of the Customer (Customer 360), Product, or Supplier. This enables better cross-selling and up-selling.
  2. Operational Efficiency: Eliminates manual reconciliation of data between departments. Reduces shipping errors caused by bad address data.
  3. Improved Analytics (BI): Business Intelligence tools produce accurate reports because they are fed by clean, de-duplicated data.
  4. Regulatory Compliance: Essential for banking (KYC - Know Your Customer) and privacy laws (GDPR/CCPA) by ensuring distinct individuals are correctly identified and their consent is managed centrally.
  5. Successful Mergers & Acquisitions: When two companies merge, MDM helps map and integrate their disparate customer and product databases quickly.