Unit 6 - Notes
Unit 6: Introduction to Apache Cassandra
1. Installation of Apache Cassandra
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers.
Prerequisites
- Java Runtime Environment (JRE/JDK): Cassandra is written in Java. Java 8 or Java 11 is strictly required for Cassandra 3.x and 4.x.
- Python: Required to run
cqlsh, the Cassandra Query Language shell (Python 3.6+ is recommended).
General Installation Steps (Linux/Unix-based)
- Download Cassandra: Download the latest tarball from the official Apache Cassandra website.
BASHwget https://downloads.apache.org/cassandra/4.0.4/apache-cassandra-4.0.4-bin.tar.gz - Extract the Archive:
BASHtar -xzvf apache-cassandra-4.0.4-bin.tar.gz cd apache-cassandra-4.0.4 - Configuration (Optional but recommended):
- Modify
conf/cassandra.yamlto set cluster names, node IP addresses (listen_address), and directory paths for data, commit logs, and saved caches.
- Modify
- Start Cassandra:
BASHbin/cassandra -f # The -f flag runs it in the foreground - Verify Installation: Run the node tool status command to check if the node is up.
BASHbin/nodetool status - Access CQL Shell:
BASHbin/cqlsh
2. Cassandra Architecture
Cassandra’s architecture is designed specifically to prevent single points of failure and to handle massive, continuous data streams.
Peer-to-Peer Architecture
- Ring Topology: Cassandra operates on a distributed, peer-to-peer (P2P) ring architecture.
- Masterless: Unlike Hadoop HDFS or MongoDB (which use master-slave architectures), every node in a Cassandra cluster is identical. There is no master node.
- Any Node can handle requests: A client can connect to any node (acting as the "coordinator" node) to read or write data. If the coordinator does not hold the requested data, it routes the request to the nodes that do.
Gossip Protocol
- Concept: A peer-to-peer communication protocol used by nodes to exchange state and location information about themselves and other nodes.
- Mechanism: Every second, each node initiates a Gossip exchange with one to three other random nodes. They exchange versioned information.
- Fault Detection: Gossip is the underlying mechanism for fault detection. If a node stops gossiping, other nodes will eventually mark it as "Down" (dead), routing traffic away from it.
Replication and Consistency Levels
Replication
Cassandra replicates data across multiple nodes to ensure fault tolerance.
- Replication Strategy: Defined at the Keyspace level.
- SimpleStrategy: Used for single datacenter deployments.
- NetworkTopologyStrategy: Used for multiple datacenters. Allows specifying how many replicas should be in each datacenter.
- Replication Factor (RF): The total number of copies of the data across the cluster. An RF of 3 means three nodes will hold a copy of a specific row.
Consistency Levels (Tunable Consistency)
Cassandra offers tunable consistency, meaning you can balance strong consistency and high availability per operation.
- Consistency Level (CL): Defines how many replica nodes must acknowledge a read or write operation before it is considered successful.
- Common Write/Read CLs:
- ONE: Only one replica needs to respond. (High availability, low consistency).
- QUORUM: A majority of replicas must respond
(RF / 2) + 1. (Strong consistency balance). - ALL: All replicas must respond. (Highest consistency, lowest availability—if one node is down, the operation fails).
- Note: Strong consistency is achieved when
Write CL + Read CL > Replication Factor.
Read and Write Paths
Write Path
Cassandra’s write path is highly optimized for write-heavy workloads.
- CommitLog: Data is first written to the CommitLog on disk. This ensures durability (crash recovery).
- Memtable: Simultaneously, data is written to an in-memory data structure called the Memtable. Once written to both, the write is acknowledged as successful.
- Flush to SSTable: When the Memtable fills up, its contents are flushed to disk into an immutable file called an SSTable (Sorted String Table).
Read Path
Reads in Cassandra are more complex because data for a single row might be fragmented across multiple SSTables and the Memtable.
- Memtable Check: Cassandra checks the active Memtable.
- Row Cache: Checks the row cache (if enabled).
- Bloom Filter: An in-memory, probabilistic data structure. It quickly tells Cassandra if an SSTable might contain the requested data, or definitely does not. This saves unnecessary disk I/O.
- Key Cache: An in-memory cache holding partition key locations.
- Partition Summary & Index: If not in the key cache, Cassandra reads the Partition Summary (in memory) to find the location in the Partition Index (on disk), which points to the exact byte offset in the SSTable.
- SSTable: Data is fetched from the SSTable, merged with Memtable data, and returned.
3. Cassandra Data Model
Cassandra is a partitioned row store. Data modeling in Cassandra is query-driven (you must model your data based on the queries you intend to run).
Keyspace
- The outermost container for data in Cassandra.
- Analogous to a Database in an RDBMS.
- This is where replication strategy and replication factor are defined.
Table (formerly Column Family)
- A container for rows, similar to a table in an RDBMS.
- Enforces a schema using CQL, defining the primary key and data types.
Primary Key
The Primary Key in Cassandra uniquely identifies a row, but it also dictates exactly how and where the data is stored on the cluster. It has two components:
PRIMARY KEY (Partition Key, Clustering Columns)
Partition Key
- The first part of the Primary Key.
- Cassandra hashes the Partition Key to determine exactly which node in the cluster will store the row.
- Ensures that data is evenly distributed across the cluster.
Clustering Columns
- The second part of the Primary Key (optional).
- Determines the on-disk sorting order of data within a single partition.
- Crucial for range queries (e.g., finding events between two dates).
Example:
PRIMARY KEY ((user_id), created_at)
user_idis the Partition Key (all data for a user is kept together on one node).created_atis the Clustering Column (data within that user's partition is sorted by time).
Wide Rows
- A "wide row" occurs when a single partition key contains a massive number of clustering columns (and therefore, rows).
- Cassandra is designed to handle wide rows (up to 2 billion cells per partition), making it excellent for time-series data where a device ID is the partition key and timestamps are clustering columns.
4. CQL (Cassandra Query Language)
CQL is the primary language for communicating with Cassandra. It looks similar to SQL but lacks complex relational features like JOINs or subqueries, due to the distributed nature of the data.
Data Types
- Numeric:
int,bigint,float,double,decimal,varint. - Text/Strings:
text,varchar,ascii. - Time:
timestamp,date,time,timeuuid. - Identifiers:
uuid. - Collections:
list,set,map(useful for denormalizing data).
Creating Keyspaces and Tables
Creating a Keyspace:
CREATE KEYSPACE IF NOT EXISTS store
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
Creating a Table:
USE store;
CREATE TABLE IF NOT EXISTS user_activity (
user_id uuid,
activity_date date,
activity_time timestamp,
action text,
PRIMARY KEY ((user_id, activity_date), activity_time)
) WITH CLUSTERING ORDER BY (activity_time DESC);
(Here,
(user_id, activity_date) is a composite partition key, and activity_time is the clustering column sorted descending).
CRUD Operations
Insert:
INSERT INTO user_activity (user_id, activity_date, activity_time, action)
VALUES (123e4567-e89b-12d3-a456-426614174000, '2023-10-25', '2023-10-25 10:00:00', 'login');
Update:
In Cassandra, UPDATE is essentially an "Upsert". If the primary key exists, it updates; if not, it inserts.
UPDATE user_activity
SET action = 'logout'
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000
AND activity_date = '2023-10-25'
AND activity_time = '2023-10-25 10:00:00';
Delete:
Deletions in Cassandra do not immediately remove data. Instead, they insert a marker called a Tombstone.
DELETE FROM user_activity
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000
AND activity_date = '2023-10-25';
Select:
SELECT * FROM user_activity
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000
AND activity_date = '2023-10-25';
Filtering and Indexing
Filtering (ALLOW FILTERING):
If you try to query a column that is not part of the Primary Key, Cassandra will reject it because it requires scanning all partitions across the cluster. You can bypass this using ALLOW FILTERING, but it is highly discouraged in production due to massive performance hits.
SELECT * FROM user_activity WHERE action = 'login' ALLOW FILTERING;
-- WARNING: Scans the entire cluster!
Secondary Indexing:
Secondary indexes allow querying on non-primary key columns without ALLOW FILTERING. However, they are still distributed across the cluster and can be slow for high-cardinality data.
CREATE INDEX action_idx ON user_activity (action);
SELECT * FROM user_activity WHERE action = 'login'; -- Now works without ALLOW FILTERING
5. Cassandra Administration - Compaction
Because Cassandra SSTables are immutable (read-only), updates and deletes result in new data being written to new SSTables. Over time, this leads to data fragmentation across many files, slowing down read performance.
Purpose of Compaction
Compaction is a background administrative process that merges multiple SSTables into a single new SSTable.
- Merges Data: Combines fragments of a row spread across multiple SSTables.
- Evicts Tombstones: Physically removes deleted data (tombstones) once they expire (past the
gc_grace_secondsthreshold). - Improves Read Speed: Reduces the number of SSTables a read operation must consult.
Compaction Strategies
Cassandra provides different strategies optimized for different workloads:
-
SizeTieredCompactionStrategy (STCS):
- Default strategy.
- Merges SSTables of similar sizes.
- Best for: Write-heavy workloads.
- Drawback: Can cause space spikes during compaction (requires up to 50% free disk space).
-
LeveledCompactionStrategy (LCS):
- Organizes SSTables into levels of exponentially increasing sizes.
- Guarantees that reads will have to check a very small number of SSTables.
- Best for: Read-heavy workloads or workloads with many updates/deletes.
- Drawback: More I/O intensive during writes.
-
TimeWindowCompactionStrategy (TWCS):
- Groups SSTables into time windows (e.g., 1 day). Compaction only occurs within that window.
- Best for: Time-series data (where old data is rarely updated and eventually expires via TTL - Time To Live).