Unit 4 - Notes

INT364

Unit 4: Databases and Data Management

1. Database Layer Considerations and Use Cases

The database layer is critical for state management in cloud applications. Choosing the right database technology depends on data structure, access patterns, scalability requirements, and consistency needs.

1.1 Relational vs. Non-Relational Databases

Relational Databases (SQL)

  • Structure: Data is stored in tables with rows and columns. Adheres to a rigid schema.
  • Transactions: Follows ACID properties (Atomicity, Consistency, Isolation, Durability).
  • Scaling: Typically vertical scaling (adding more CPU/RAM to a single instance). Read scaling is possible via replicas.
  • Query Language: Structured Query Language (SQL).
  • Use Cases: ERP systems, CRM, Financial ledgers, E-commerce inventory (where referential integrity is crucial).

Non-Relational Databases (NoSQL)

  • Structure: Flexible schemas (Key-Value, Document, Graph, Columnar).
  • Transactions: Often follows BASE model (Basically Available, Soft state, Eventual consistency).
  • Scaling: Designed for horizontal scaling (sharding data across multiple servers).
  • Query Language: API-based or specialized query languages (e.g., PartiQL).
  • Use Cases: Real-time bidding, social media feeds, content management, IoT sensor data, gaming leaderboards.

1.2 Key Considerations for Selection

  1. Data Shape: Is the data structured (SQL) or semi-structured/unstructured (NoSQL)?
  2. Query Patterns: Do you need complex joins (SQL) or simple Get/Put by key (NoSQL)?
  3. Scale: Do you anticipate terabytes or petabytes of data requiring distributed processing?
  4. Consistency: Do you need strong consistency immediately (SQL) or is eventual consistency acceptable for higher performance (NoSQL)?

2. Amazon RDS Features and Connection Management

Amazon Relational Database Service (Amazon RDS) is a managed web service that makes it easy to set up, operate, and scale a relational database in the AWS Cloud.

2.1 Amazon RDS Key Features

  • Managed Service: AWS handles patching, backups, failure detection, and recovery.
  • Supported Engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.
  • Scalability:
    • Storage Auto Scaling: Automatically increases storage capacity when free space is low.
    • Instance Scaling: Change instance types (vertical scaling) with minimal downtime.
  • High Availability: Multi-AZ deployments (synchronous replication).
  • Security: Encryption at rest (KMS) and in transit (SSL/TLS). IAM integration for authentication (MySQL/PostgreSQL).

2.2 Amazon Aurora

A proprietary database engine compatible with MySQL and PostgreSQL built for the cloud.

  • Storage: Distributed across 3 Availability Zones (AZs) with 6 copies of data.
  • Performance: Up to 5x faster than standard MySQL and 3x faster than standard PostgreSQL.
  • Aurora Serverless: Automatically starts up, shuts down, and scales capacity based on application demand.

2.3 Connection Management and RDS Proxy

Opening and closing database connections consumes significant memory and CPU. In serverless architectures (e.g., Lambda), thousands of concurrent functions can overwhelm an RDS instance with connections.

Amazon RDS Proxy:

  • A fully managed, highly available database proxy.
  • Connection Pooling: Pools and shares established database connections to improve efficiency.
  • Failover: Reduces failover times for Aurora and RDS databases by bypassing DNS cache propagation delays.
  • Security: Enforces IAM authentication, eliminating the need to hardcode database credentials in application code.

3. Automated Backups and Read Replicas in RDS

3.1 Automated Backups

RDS creates a storage volume snapshot of the database instance, backing up the entire DB instance.

  • Retention Period: Configurable between 0 to 35 days (default is 7 days).
  • Transaction Logs: RDS uploads transaction logs to S3 every 5 minutes.
  • Point-in-Time Recovery (PITR): You can restore the database to any specific second during the retention period.
  • Manual Snapshots: User-initiated backups that are retained until explicitly deleted (unlike automated backups which are deleted when the DB instance is deleted).

3.2 Read Replicas

Read Replicas are designed for performance and scalability, not primarily for disaster recovery.

  • Mechanism: Asynchronous replication from the primary instance.
  • Use Case: Offload read-heavy traffic (analytics, reporting) from the primary database to improve application performance.
  • Promotion: A read replica can be promoted to a standalone primary database.
  • Cross-Region: Replicas can be created in different AWS Regions for disaster recovery or to lower latency for global users.

3.3 Multi-AZ Deployment (vs. Read Replicas)

  • Purpose: High Availability and Disaster Recovery.
  • Mechanism: Synchronous replication to a standby instance in a different AZ.
  • Failover: Automatic DNS failover occurs if the primary fails.
  • Usage: The standby instance is not accessible for reads or writes; it is purely a passive failover target.

4. Amazon DynamoDB and Purpose-Built Databases

4.1 Amazon DynamoDB

A fast, flexible NoSQL database service for single-digit millisecond performance at any scale.

Core Concepts

  • Tables: The collection of data.
  • Items: Individual records (rows) in the table. Max 400KB per item.
  • Attributes: Data elements inside an item (columns).
  • Primary Key:
    • Partition Key (PK): Hash value determines physical storage location.
    • Composite Key (PK + Sort Key): Allows sorting of items with the same partition key.

Key Features

  • Capacity Modes:
    • On-Demand: Pay-per-request. Best for unpredictable workloads.
    • Provisioned: Specify Read/Write Capacity Units (RCU/WCU). Cheaper for predictable loads. Auto-scaling supported.
  • Global Tables: Multi-Region, fully replicated tables for global applications.
  • DAX (DynamoDB Accelerator): In-memory cache for DynamoDB reducing response times to microseconds.
  • Streams: Captures time-ordered sequence of item-level modifications. Useful for event-driven architectures (triggering Lambda).

4.2 Other Purpose-Built Databases

AWS advocates for "The Right Tool for the Right Job."

Database Service Type Use Case
Amazon ElastiCache In-memory (Redis/Memcached) Caching, session management, leaderboards, real-time analytics.
Amazon Redshift Data Warehouse (OLAP) Complex analytics on petabytes of structured data. Columnar storage.
Amazon Neptune Graph Social networks, recommendation engines, fraud detection (relationships between data).
Amazon DocumentDB Document (MongoDB compatible) Content management, catalogs, JSON data storage.
Amazon Timestream Time Series IoT sensor data, DevOps logs, application telemetry.
Amazon QLDB Ledger Supply chain, banking transactions requiring immutable, cryptographically verifiable history.
Amazon Keyspaces Wide Column (Cassandra) High-scale industrial apps using Cassandra workloads.

5. Migrating Databases to AWS

Database migration involves moving data from on-premises or other clouds to AWS.

5.1 Migration Strategies

  1. Rehost (Lift and Shift): Moving the database "as-is" to EC2 (e.g., installing Oracle on EC2).
  2. Replatform (Lift and Reshape): Moving to a managed service like RDS (e.g., Oracle on-prem to RDS for Oracle).
  3. Refactor (Re-architect): Changing the engine entirely (e.g., Oracle to Aurora PostgreSQL or DynamoDB).

5.2 AWS Database Migration Service (DMS)

A service to migrate databases easily and securely. The source database remains fully operational during the migration.

  • Homogeneous Migrations: Same engine (e.g., Oracle to Oracle). Simple schema copy.
  • Heterogeneous Migrations: Different engines (e.g., SQL Server to Aurora). Requires schema conversion.
  • Continuous Replication: DMS can perform one-time loads or continuous replication (CDC - Change Data Capture) for near-zero downtime migrations.

5.3 AWS Schema Conversion Tool (SCT)

Used specifically for heterogeneous migrations.

  • Function: Automatically converts the source database schema (views, stored procedures, functions) to a format compatible with the target AWS database.
  • Assessment Report: Generates a report highlighting items that cannot be converted automatically and require manual code changes.

6. Applying Well-Architected Principles to the Database Layer

The AWS Well-Architected Framework ensures cloud infrastructures are secure, high-performing, resilient, and efficient.

6.1 Operational Excellence

  • Monitoring: Use Amazon CloudWatch to monitor CPU, memory, connections, and disk IOPS.
  • Infrastructure as Code: Deploy databases using CloudFormation or Terraform to ensure consistent configurations.

6.2 Security

  • Encryption: Enable encryption at rest (AWS KMS) and in transit (TLS/SSL).
  • Access Control: Use IAM policies for authentication (where supported) and Security Groups to restrict network access (e.g., allow port 3306 only from the App Server Security Group).
  • Secrets Management: Use AWS Secrets Manager to rotate database credentials automatically.

6.3 Reliability

  • High Availability: Use Multi-AZ deployments for RDS to handle infrastructure failures.
  • Backups: Enable automated backups and test restoration procedures regularly.
  • Self-Healing: Use services like DynamoDB or Aurora Serverless that handle maintenance and recovery automatically.

6.4 Performance Efficiency

  • Read Scaling: Use Read Replicas for RDS or Global Tables for DynamoDB to offload read traffic.
  • Caching: Implement ElastiCache or DynamoDB DAX to serve frequently accessed data from memory.
  • Selection: Choose the right engine (e.g., don't use a relational DB for time-series data; use Timestream).

6.5 Cost Optimization

  • Right-Sizing: Monitor usage and downgrade instance types if over-provisioned.
  • Purchasing Options: Use Reserved Instances (RIs) for steady-state RDS workloads (1 or 3-year commitment).
  • Stop/Start: Stop RDS instances used for development/testing during non-business hours.
  • Auto-Scaling: Use DynamoDB Auto Scaling or Aurora Serverless to match cost to actual usage.

6.6 Sustainability

  • Managed Services: Managed services (RDS, DynamoDB) optimize hardware utilization better than self-managed EC2 databases.
  • Hardware: Utilize ARM-based Graviton2/3 instances for RDS, which offer better performance-per-watt.