Unit 3 - Notes

INT364

Unit 3: Storage and Compute Services

1. Amazon S3 (Simple Storage Service)

Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance. Unlike block storage (which manages data as blocks within sectors and tracks) or file storage (which manages data as files in a hierarchy), S3 manages data as objects.

Fundamentals of S3

  • Object-Based Storage: Data is stored as objects. Each object consists of:
    • Key: The unique identifier (name) of the object.
    • Value: The data itself (sequence of bytes).
    • Version ID: For tracking changes.
    • Metadata: Data about data (e.g., content-type, custom tags).
  • Buckets:
    • Buckets are containers for objects.
    • Global Namespace: Bucket names must be globally unique across all AWS accounts (DNS-compliant).
    • Regional: While the namespace is global, the bucket resides in a specific AWS Region selected by the user.
  • Durability and Availability:
    • Designed for 99.999999999% (11 9s) of durability.
    • Data is redundantly stored across a minimum of three Availability Zones (AZs) in a Region (except for One Zone classes).
  • Consistency Model: S3 provides strong consistency for all GET, PUT, and LIST operations. A read after a write receives the latest version of the object.

S3 Storage Classes

AWS offers various storage classes optimized for different access patterns and costs.

  1. S3 Standard:
    • General-purpose storage for frequently accessed data.
    • Low latency and high throughput.
    • Use cases: Cloud applications, dynamic websites, content distribution.
  2. S3 Intelligent-Tiering:
    • Automatically moves data between access tiers based on changing access patterns.
    • No retrieval fees.
    • Use cases: Data with unknown or changing access patterns.
  3. S3 Standard-Infrequent Access (S3 Standard-IA):
    • For data accessed less frequently but requires rapid access when needed.
    • Lower storage cost than Standard, but higher retrieval cost.
    • Use cases: Long-term storage, backups, disaster recovery.
  4. S3 One Zone-Infrequent Access (S3 One Zone-IA):
    • Stores data in a single AZ (lower availability/resilience).
    • 20% cheaper than Standard-IA.
    • Use cases: Secondary backup copies, easily recreatable data.
  5. Amazon S3 Glacier Types (Archival Storage):
    • Glacier Instant Retrieval: Millisecond access for rarely accessed data.
    • Glacier Flexible Retrieval: Retrieval times from minutes to hours; free bulk retrievals.
    • Glacier Deep Archive: Lowest cost; retrieval takes 12-48 hours.

S3 Versioning

Versioning is a means of keeping multiple variants of an object in the same bucket.

  • Functionality: Once enabled on a bucket, it cannot be disabled (only suspended).
  • Benefits:
    • Unintended Deletes: If an object is deleted, S3 inserts a "Delete Marker." The object becomes hidden but is not permanently removed.
    • Overwrites: If an object is overwritten, S3 stores the new version while retaining the old version ID.
  • Cost Implication: You pay for storage of every version of the object retained.

Lifecycle Policies

To manage costs, S3 Lifecycle policies automate the transition of objects between storage classes or deletion.

  • Transition Actions: Define when objects transition to another storage class (e.g., move to Standard-IA after 30 days, then Glacier after 90 days).
  • Expiration Actions: Define when objects expire and should be permanently deleted.
  • Versioning Integration: Policies can apply specifically to current versions or previous versions of objects.

Data Transfer Mechanisms

  • S3 Transfer Acceleration: Uses Amazon CloudFront’s globally distributed Edge Locations. Data enters the AWS network at the nearest edge location and travels over the optimized AWS internal network to the S3 bucket.
  • AWS Snow Family: Physical devices (Snowcone, Snowball, Snowmobile) used to migrate petabytes or exabytes of data into S3 where internet bandwidth is insufficient.
  • Multipart Upload: Allows uploading a single object as a set of parts. Recommended for files >100MB; required for files >5GB.

2. Amazon EC2 (Elastic Compute Cloud)

Amazon EC2 provides scalable computing capacity in the AWS Cloud. It allows users to launch virtual servers, known as instances.

Launching EC2 Instances and Choosing AMIs

To launch an instance, you must define the Amazon Machine Image (AMI). An AMI provides the information required to launch an instance.

  • Components of an AMI:
    • A template for the root volume (OS, application server, applications).
    • Launch permissions (control which AWS accounts can use the AMI).
    • Block Device Mapping (specifies volumes to attach to the instance).
  • Sources of AMIs:
    • Quick Start: AWS-provided base images (Amazon Linux 2, Ubuntu, Windows Server).
    • My AMIs: Custom images created from previous instances.
    • AWS Marketplace: Paid/Free images provided by third-party vendors (e.g., firewall appliances, hardened OS).
    • Community AMIs: Images shared by the user community.

EC2 Instance Types

Instances are grouped into families based on target use cases. Naming convention example: m5.large (Family: m, Generation: 5, Size: large).

  1. General Purpose (T, M): Balanced compute, memory, and networking resources. Good for web servers and code repositories.
  2. Compute Optimized (C): High performance processors. Ideal for batch processing, media transcoding, and scientific modeling.
  3. Memory Optimized (R, X, Z): Fast performance for workloads that process large data sets in memory (RAM). Ideal for databases and in-memory caches.
  4. Accelerated Computing (P, G, F): Hardware accelerators (GPUs, FPGAs). Used for machine learning, graphics processing.
  5. Storage Optimized (I, D, H): High sequential read/write access to very large data sets on local storage. Ideal for NoSQL databases and data warehousing.

User Data and Configuration

  • User Data: A script passed to the instance that runs automatically only during the first boot cycle.
    • Used for Bootstrapping: Installing software, applying updates, and configuring the environment immediately after launch without manual intervention.
    • Formats: Shell scripts (Linux) or PowerShell scripts (Windows).
  • Key Pairs: AWS uses public-key cryptography to secure login information. You store the Private Key (.pem or .ppk) locally; AWS stores the Public Key on the instance.
  • Security Groups: A virtual stateful firewall that controls inbound and outbound traffic.
    • Stateful: If an inbound request is allowed, the outbound response is automatically allowed (and vice versa).
  • IAM Roles: An identity with specific permissions attached to the EC2 instance, allowing the instance to access other AWS services (like S3 or DynamoDB) without storing long-term credentials on the server.

3. Storage Options for EC2

When running an EC2 instance, you need storage for the OS and data. The two primary persistent storage options are EBS and EFS.

Amazon EBS (Elastic Block Store)

EBS provides block-level storage volumes for use with EC2 instances. It behaves like a physical hard drive attached to a server.

  • Characteristics:
    • Availability Zone Locked: An EBS volume is created in a specific AZ and can only be attached to instances in that same AZ.
    • Persistence: Data persists independently of the instance's life (unless "Delete on Termination" is selected).
    • Snapshots: Point-in-time backups of EBS volumes stored in S3. Snapshots are incremental (only changed blocks are saved).
  • Volume Types:
    • General Purpose SSD (gp2/gp3): Balances price and performance. Used for boot volumes and dev/test.
    • Provisioned IOPS SSD (io1/io2): High performance for mission-critical, I/O intensive transactional workloads (databases).
    • Throughput Optimized HDD (st1): Low cost, throughput-intensive workloads (Big Data, log processing).
    • Cold HDD (sc1): Lowest cost for infrequently accessed data.

Amazon EFS (Elastic File System)

EFS provides a simple, scalable, fully managed elastic NFS file system.

  • Characteristics:
    • Shared Storage: Can be mounted by hundreds of EC2 instances simultaneously across multiple Availability Zones.
    • Protocol: Supports Network File System version 4 (NFSv4).
    • Elasticity: Automatically grows and shrinks as files are added or removed; no pre-provisioning required.
    • OS Support: Linux only (Not supported on Windows).
  • Storage Classes:
    • Standard: Multi-AZ redundancy.
    • One Zone: Single AZ redundancy (lower cost).
    • Infrequent Access (IA): Cheaper storage for files not accessed daily (Lifecycle management moves files automatically).

Comparison: EBS vs. EFS

Feature EBS EFS
Type Block Storage File Storage
Scope Single AZ Regional (Multi-AZ access)
Attachability Usually one instance at a time (Multi-Attach exists for io1/io2) Thousands of instances concurrently
Scaling Manual (must resize volume) Automatic
Performance Lowest latency Higher latency than EBS

4. Applying the Well-Architected Framework to Storage and Compute

The AWS Well-Architected Framework describes key concepts for designing and running workloads in the cloud.

1. Operational Excellence

  • Compute: Use Infrastructure as Code (CloudFormation) to deploy EC2 instances to ensuring consistency. Automate patching using AWS Systems Manager.
  • Storage: Automate S3 bucket creation and policy application. Use S3 Access Logging to monitor access patterns.

2. Security

  • Compute:
    • Apply the principle of least privilege using IAM Roles for EC2.
    • Harden operating systems (use custom secured AMIs).
    • Minimize attack surface using tight Security Groups (allow specific IPs/ports).
  • Storage:
    • Enable Encryption at Rest (KMS) for EBS volumes and S3 buckets.
    • Block Public Access on S3 buckets by default.
    • Use S3 Object Lock for WORM (Write Once Read Many) compliance.

3. Reliability

  • Compute:
    • Deploy instances across multiple Availability Zones using Auto Scaling Groups.
    • Implement Health Checks to automatically replace failed instances.
  • Storage:
    • Use S3 Versioning to recover from accidental deletions.
    • Enable Cross-Region Replication (CRR) for S3 for disaster recovery.
    • Use EFS for workloads requiring shared state across AZs to survive a single AZ failure.

4. Performance Efficiency

  • Compute:
    • Right-sizing: Select the correct Instance Type (e.g., Compute Optimized for transcoding) based on workload metrics.
    • Use Auto Scaling to match capacity to demand dynamically.
  • Storage:
    • Choose the correct EBS volume type (e.g., Provisioned IOPS for Databases).
    • Use S3 Transfer Acceleration or CloudFront for global content delivery.

5. Cost Optimization

  • Compute:
    • Use Spot Instances for fault-tolerant, flexible workloads (up to 90% discount).
    • Use Savings Plans or Reserved Instances for steady-state workloads.
    • Stop instances when not in use (e.g., dev environments after hours).
  • Storage:
    • Implement S3 Lifecycle Policies to move old data to Glacier or Deep Archive.
    • Delete unattached EBS volumes and old EBS snapshots.

6. Sustainability

  • Compute: Maximize utilization (do not leave idle resources running). Use ARM-based Graviton processors for better performance-per-watt.
  • Storage: Use compression technologies to store less data. Use tape-based archival (Glacier Deep Archive) which consumes less energy than spinning disks.