Unit 6 - Notes

INT312 8 min read

Unit 6: Introduction to Apache Cassandra

1. Installation of Apache Cassandra

Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers.

Prerequisites

  • Java Runtime Environment (JRE/JDK): Cassandra is written in Java. Java 8 or Java 11 is strictly required for Cassandra 3.x and 4.x.
  • Python: Required to run cqlsh, the Cassandra Query Language shell (Python 3.6+ is recommended).

General Installation Steps (Linux/Unix-based)

  1. Download Cassandra: Download the latest tarball from the official Apache Cassandra website.
    BASH
        wget https://downloads.apache.org/cassandra/4.0.4/apache-cassandra-4.0.4-bin.tar.gz
        
  2. Extract the Archive:
    BASH
        tar -xzvf apache-cassandra-4.0.4-bin.tar.gz
        cd apache-cassandra-4.0.4
        
  3. Configuration (Optional but recommended):
    • Modify conf/cassandra.yaml to set cluster names, node IP addresses (listen_address), and directory paths for data, commit logs, and saved caches.
  4. Start Cassandra:
    BASH
        bin/cassandra -f # The -f flag runs it in the foreground
        
  5. Verify Installation: Run the node tool status command to check if the node is up.
    BASH
        bin/nodetool status
        
  6. Access CQL Shell:
    BASH
        bin/cqlsh
        

2. Cassandra Architecture

Cassandra’s architecture is designed specifically to prevent single points of failure and to handle massive, continuous data streams.

Peer-to-Peer Architecture

  • Ring Topology: Cassandra operates on a distributed, peer-to-peer (P2P) ring architecture.
  • Masterless: Unlike Hadoop HDFS or MongoDB (which use master-slave architectures), every node in a Cassandra cluster is identical. There is no master node.
  • Any Node can handle requests: A client can connect to any node (acting as the "coordinator" node) to read or write data. If the coordinator does not hold the requested data, it routes the request to the nodes that do.

Gossip Protocol

  • Concept: A peer-to-peer communication protocol used by nodes to exchange state and location information about themselves and other nodes.
  • Mechanism: Every second, each node initiates a Gossip exchange with one to three other random nodes. They exchange versioned information.
  • Fault Detection: Gossip is the underlying mechanism for fault detection. If a node stops gossiping, other nodes will eventually mark it as "Down" (dead), routing traffic away from it.

Replication and Consistency Levels

Replication

Cassandra replicates data across multiple nodes to ensure fault tolerance.

  • Replication Strategy: Defined at the Keyspace level.
    • SimpleStrategy: Used for single datacenter deployments.
    • NetworkTopologyStrategy: Used for multiple datacenters. Allows specifying how many replicas should be in each datacenter.
  • Replication Factor (RF): The total number of copies of the data across the cluster. An RF of 3 means three nodes will hold a copy of a specific row.

Consistency Levels (Tunable Consistency)

Cassandra offers tunable consistency, meaning you can balance strong consistency and high availability per operation.

  • Consistency Level (CL): Defines how many replica nodes must acknowledge a read or write operation before it is considered successful.
  • Common Write/Read CLs:
    • ONE: Only one replica needs to respond. (High availability, low consistency).
    • QUORUM: A majority of replicas must respond (RF / 2) + 1. (Strong consistency balance).
    • ALL: All replicas must respond. (Highest consistency, lowest availability—if one node is down, the operation fails).
  • Note: Strong consistency is achieved when Write CL + Read CL > Replication Factor.

Read and Write Paths

Write Path

Cassandra’s write path is highly optimized for write-heavy workloads.

  1. CommitLog: Data is first written to the CommitLog on disk. This ensures durability (crash recovery).
  2. Memtable: Simultaneously, data is written to an in-memory data structure called the Memtable. Once written to both, the write is acknowledged as successful.
  3. Flush to SSTable: When the Memtable fills up, its contents are flushed to disk into an immutable file called an SSTable (Sorted String Table).

Read Path

Reads in Cassandra are more complex because data for a single row might be fragmented across multiple SSTables and the Memtable.

  1. Memtable Check: Cassandra checks the active Memtable.
  2. Row Cache: Checks the row cache (if enabled).
  3. Bloom Filter: An in-memory, probabilistic data structure. It quickly tells Cassandra if an SSTable might contain the requested data, or definitely does not. This saves unnecessary disk I/O.
  4. Key Cache: An in-memory cache holding partition key locations.
  5. Partition Summary & Index: If not in the key cache, Cassandra reads the Partition Summary (in memory) to find the location in the Partition Index (on disk), which points to the exact byte offset in the SSTable.
  6. SSTable: Data is fetched from the SSTable, merged with Memtable data, and returned.

3. Cassandra Data Model

Cassandra is a partitioned row store. Data modeling in Cassandra is query-driven (you must model your data based on the queries you intend to run).

Keyspace

  • The outermost container for data in Cassandra.
  • Analogous to a Database in an RDBMS.
  • This is where replication strategy and replication factor are defined.

Table (formerly Column Family)

  • A container for rows, similar to a table in an RDBMS.
  • Enforces a schema using CQL, defining the primary key and data types.

Primary Key

The Primary Key in Cassandra uniquely identifies a row, but it also dictates exactly how and where the data is stored on the cluster. It has two components:
PRIMARY KEY (Partition Key, Clustering Columns)

Partition Key

  • The first part of the Primary Key.
  • Cassandra hashes the Partition Key to determine exactly which node in the cluster will store the row.
  • Ensures that data is evenly distributed across the cluster.

Clustering Columns

  • The second part of the Primary Key (optional).
  • Determines the on-disk sorting order of data within a single partition.
  • Crucial for range queries (e.g., finding events between two dates).

Example:
PRIMARY KEY ((user_id), created_at)

  • user_id is the Partition Key (all data for a user is kept together on one node).
  • created_at is the Clustering Column (data within that user's partition is sorted by time).

Wide Rows

  • A "wide row" occurs when a single partition key contains a massive number of clustering columns (and therefore, rows).
  • Cassandra is designed to handle wide rows (up to 2 billion cells per partition), making it excellent for time-series data where a device ID is the partition key and timestamps are clustering columns.

4. CQL (Cassandra Query Language)

CQL is the primary language for communicating with Cassandra. It looks similar to SQL but lacks complex relational features like JOINs or subqueries, due to the distributed nature of the data.

Data Types

  • Numeric: int, bigint, float, double, decimal, varint.
  • Text/Strings: text, varchar, ascii.
  • Time: timestamp, date, time, timeuuid.
  • Identifiers: uuid.
  • Collections: list, set, map (useful for denormalizing data).

Creating Keyspaces and Tables

Creating a Keyspace:

SQL
CREATE KEYSPACE IF NOT EXISTS store
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

Creating a Table:

SQL
USE store;

CREATE TABLE IF NOT EXISTS user_activity (
    user_id uuid,
    activity_date date,
    activity_time timestamp,
    action text,
    PRIMARY KEY ((user_id, activity_date), activity_time)
) WITH CLUSTERING ORDER BY (activity_time DESC);

(Here, (user_id, activity_date) is a composite partition key, and activity_time is the clustering column sorted descending).

CRUD Operations

Insert:

SQL
INSERT INTO user_activity (user_id, activity_date, activity_time, action) 
VALUES (123e4567-e89b-12d3-a456-426614174000, '2023-10-25', '2023-10-25 10:00:00', 'login');

Update:
In Cassandra, UPDATE is essentially an "Upsert". If the primary key exists, it updates; if not, it inserts.

SQL
UPDATE user_activity 
SET action = 'logout' 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000 
  AND activity_date = '2023-10-25' 
  AND activity_time = '2023-10-25 10:00:00';

Delete:
Deletions in Cassandra do not immediately remove data. Instead, they insert a marker called a Tombstone.

SQL
DELETE FROM user_activity 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000 
  AND activity_date = '2023-10-25';

Select:

SQL
SELECT * FROM user_activity 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000 
  AND activity_date = '2023-10-25';

Filtering and Indexing

Filtering (ALLOW FILTERING):
If you try to query a column that is not part of the Primary Key, Cassandra will reject it because it requires scanning all partitions across the cluster. You can bypass this using ALLOW FILTERING, but it is highly discouraged in production due to massive performance hits.

SQL
SELECT * FROM user_activity WHERE action = 'login' ALLOW FILTERING; 
-- WARNING: Scans the entire cluster!

Secondary Indexing:
Secondary indexes allow querying on non-primary key columns without ALLOW FILTERING. However, they are still distributed across the cluster and can be slow for high-cardinality data.

SQL
CREATE INDEX action_idx ON user_activity (action);
SELECT * FROM user_activity WHERE action = 'login'; -- Now works without ALLOW FILTERING


5. Cassandra Administration - Compaction

Because Cassandra SSTables are immutable (read-only), updates and deletes result in new data being written to new SSTables. Over time, this leads to data fragmentation across many files, slowing down read performance.

Purpose of Compaction

Compaction is a background administrative process that merges multiple SSTables into a single new SSTable.

  • Merges Data: Combines fragments of a row spread across multiple SSTables.
  • Evicts Tombstones: Physically removes deleted data (tombstones) once they expire (past the gc_grace_seconds threshold).
  • Improves Read Speed: Reduces the number of SSTables a read operation must consult.

Compaction Strategies

Cassandra provides different strategies optimized for different workloads:

  1. SizeTieredCompactionStrategy (STCS):

    • Default strategy.
    • Merges SSTables of similar sizes.
    • Best for: Write-heavy workloads.
    • Drawback: Can cause space spikes during compaction (requires up to 50% free disk space).
  2. LeveledCompactionStrategy (LCS):

    • Organizes SSTables into levels of exponentially increasing sizes.
    • Guarantees that reads will have to check a very small number of SSTables.
    • Best for: Read-heavy workloads or workloads with many updates/deletes.
    • Drawback: More I/O intensive during writes.
  3. TimeWindowCompactionStrategy (TWCS):

    • Groups SSTables into time windows (e.g., 1 day). Compaction only occurs within that window.
    • Best for: Time-series data (where old data is rarely updated and eventually expires via TTL - Time To Live).