Unit 5 - Notes

INT312 8 min read

Unit 5: Introduction to Apache HBase

1. HBase Fundamentals and Data Model

What is Apache HBase?

Apache HBase is an open-source, non-relational (NoSQL), distributed, open-source database modeled after Google’s Bigtable. It is written in Java and runs on top of the Hadoop Distributed File System (HDFS).
HBase is designed to provide real-time, random read/write access to massive datasets (billions of rows and millions of columns).

Key Characteristics

  • Column-Oriented: Stores data in columns rather than rows, optimizing read performance for sparse datasets.
  • Schema-less: Columns can be added dynamically; only column families need to be predefined.
  • Strongly Consistent: Reads and writes are strongly consistent, making it suitable for high-speed transactional data.
  • Scalable: Scales linearly and automatically by adding more nodes to the cluster.

HBase Data Model

The HBase data model is fundamentally different from an RDBMS. Data is stored in a multi-dimensional, sorted map.

  • Table: A collection of rows.
  • Row Key: The unique identifier for a row. Rows are lexicographically sorted by the Row Key. Designing a good Row Key is critical for HBase performance.
  • Column Family: A logical grouping of columns. All members of a column family are stored together on the disk. Column families must be defined when the table is created. (e.g., personal_info, contact_info).
  • Column Qualifier: The actual column name, added dynamically inside a Column Family. Expressed as ColumnFamily:ColumnQualifier (e.g., personal_info:name, contact_info:email).
  • Cell: The intersection of a Row Key, Column Family, and Column Qualifier. It contains the actual value/data.
  • Timestamp: Every cell has a timestamp associated with it. HBase maintains multiple versions of a cell's value distinguished by timestamps (by default, it keeps the last 3 versions).

2. HBase Architecture

HBase follows a Master-Slave architecture. It relies heavily on ZooKeeper for coordination and HDFS for underlying storage.

Core Components

  1. HMaster (Master Node):
    • Responsible for monitoring all RegionServers in the cluster.
    • Handles metadata changes (DDL operations like creating or dropping tables).
    • Assigns regions to RegionServers and handles load balancing and failover.
  2. RegionServer (Slave/Worker Node):
    • Responsible for handling read and write requests from clients.
    • Hosts and manages multiple Regions.
    • Communicates directly with the client for data operations (DML).
  3. Regions:
    • The basic building block of the HBase cluster for scaling and load balancing.
    • A table is divided horizontally into Regions. Each Region contains a contiguous range of Row Keys.
    • As a Region grows beyond a configured threshold, it automatically splits in two.
  4. ZooKeeper:
    • Acts as a distributed coordination service.
    • Maintains the state of the cluster (which servers are alive, which holds the META table).
    • Clients first connect to ZooKeeper to find the location of the RegionServer hosting the data they need.
  5. HDFS (Hadoop Distributed File System):
    • Provides the actual persistent storage.
    • HBase stores its data in HDFS in specific file formats, primarily HFiles (which store the actual data) and WAL (Write-Ahead Logs, used for recovery in case a RegionServer crashes).

3. Installation of Apache HBase

HBase can be installed in three modes: Standalone, Pseudo-Distributed, and Fully Distributed. Below is a guide for a Pseudo-Distributed installation (assuming Hadoop and Java are already installed and running).

Prerequisites

  • Java (JDK 8 or later) installed and JAVA_HOME configured.
  • Hadoop installed and running (HDFS and YARN).

Step-by-Step Installation

Step 1: Download and Extract
Download the stable binary release from the Apache HBase website.

BASH
wget https://archive.apache.org/dist/hbase/2.4.9/hbase-2.4.9-bin.tar.gz
tar -zxvf hbase-2.4.9-bin.tar.gz
cd hbase-2.4.9

Step 2: Configure hbase-env.sh
Navigate to the conf directory and edit hbase-env.sh to set the Java path and tell HBase to manage its own ZooKeeper instance.

BASH
export JAVA_HOME=/path/to/your/jdk
export HBASE_MANAGES_ZK=true

Step 3: Configure hbase-site.xml
Edit conf/hbase-site.xml to specify the HDFS directory for HBase and enable pseudo-distributed mode.

XML
<configuration>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://localhost:9000/hbase</value>
    </property>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <value>/home/user/zookeeper</value>
    </property>
</configuration>

Step 4: Start HBase
Run the start script from the bin directory.

BASH
./bin/start-hbase.sh

(Verify by running jps. You should see HMaster, HRegionServer, and HQuorumPeer).


4. General Commands in Apache HBase (HBase Shell)

The HBase shell is a JRuby-based interactive tool used to execute commands. Launch it by typing ./bin/hbase shell.

Data Definition Language (DDL) Commands

  • Create a table: Creates a table with specific column families.
    RUBY
        create 'employees', 'personal_data', 'professional_data'
        
  • List tables: Shows all tables in HBase.
    RUBY
        list
        
  • Describe a table: Shows table structure and metadata.
    RUBY
        describe 'employees'
        
  • Disable/Enable a table: A table must be disabled before altering or dropping it.
    RUBY
        disable 'employees'
        enable 'employees'
        
  • Drop a table:
    RUBY
        drop 'employees'
        

Data Manipulation Language (DML) Commands

  • Put: Inserts or updates data in a specific cell.
    RUBY
        put 'employees', 'row1', 'personal_data:name', 'John Doe'
        put 'employees', 'row1', 'professional_data:role', 'Developer'
        
  • Get: Retrieves data for a specific row.
    RUBY
        get 'employees', 'row1'
        
  • Scan: Retrieves data from the entire table or a range of rows.
    RUBY
        scan 'employees'
        
  • Delete: Deletes a specific cell.
    RUBY
        delete 'employees', 'row1', 'personal_data:name'
        
  • Truncate: Disables, drops, and recreates the table, clearing all data.
    RUBY
        truncate 'employees'
        

5. Filtering in HBase: Prefix and Single Value Column

Filters in HBase allow you to push down criteria to the RegionServers so that only the matching data is returned over the network, drastically improving performance.

1. PrefixFilter

The PrefixFilter takes a single argument (a prefix) and returns only those rows whose Row Keys start with that specific prefix.

HBase Shell Example:
If we have rows with keys: user123, user456, admin123.

RUBY
# Returns 'user123' and 'user456'
scan 'employees', {FILTER => "PrefixFilter('user')"}

2. SingleColumnValueFilter

The SingleColumnValueFilter evaluates the value of a specific column (Family + Qualifier) and determines whether to include the entire row based on a comparison operator (e.g., =, !=, >, <).

HBase Shell Example:
Return all rows where the professional_data:role is 'Developer'.

RUBY
scan 'employees', {FILTER => "SingleColumnValueFilter('professional_data', 'role', =, 'binary:Developer')"}

(Note: By default, if the column does not exist in a row, the row is included. To prevent this, FilterIfMissing is often set to true in the Java API).


6. Time To Live (TTL) for Columns in HBase

Time To Live (TTL) is a feature in HBase that allows you to set an expiration time for your data. Once the TTL expires, the data is marked with a tombstone marker and is eventually removed during major compactions.

  • TTL is configured at the Column Family level.
  • The value is specified in seconds.
  • It is extremely useful for time-series data, logs, or sensor data where old data is no longer relevant.

Setting TTL during Table Creation

Creates a table where data in the log_data column family automatically expires after 30 days (2,592,000 seconds).

RUBY
create 'server_logs', {NAME => 'log_data', TTL => 2592000}

Altering TTL for an Existing Table

You can change the TTL of an existing column family. The table must generally be disabled first or altered online depending on the HBase version.

RUBY
alter 'server_logs', {NAME => 'log_data', TTL => 86400} # 1 day


7. Interacting with HBase using Java API

HBase is written in Java, and its native Java API provides the most robust way to interact with the database.

Maven Dependencies

To use the Java API, include the HBase client dependency in your pom.xml:

XML
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>2.4.9</version>
</dependency>

Java API Code Example (CRUD Operations)

Below is an example of configuring, connecting, creating a table, putting data, and getting data using the modern Connection API.

JAVA
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

public class HBaseExample {
    public static void main(String[] args) throws IOException {
        
        // 1. Setup Configuration
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum", "localhost");
        config.set("hbase.zookeeper.property.clientPort", "2181");

        // 2. Establish Connection
        try (Connection connection = ConnectionFactory.createConnection(config);
             Admin admin = connection.getAdmin()) {
            
            TableName tableName = TableName.valueOf("Students");

            // 3. Create Table
            if (!admin.tableExists(tableName)) {
                ColumnFamilyDescriptor familyDesc = ColumnFamilyDescriptorBuilder.newBuilder(Bytes.toBytes("info")).build();
                TableDescriptor tableDesc = TableDescriptorBuilder.newBuilder(tableName)
                        .setColumnFamily(familyDesc)
                        .build();
                admin.createTable(tableDesc);
                System.out.println("Table created successfully.");
            }

            // 4. Put Data (Insert)
            try (Table table = connection.getTable(tableName)) {
                byte[] rowKey = Bytes.toBytes("student1");
                Put put = new Put(rowKey);
                put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Alice"));
                put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("major"), Bytes.toBytes("Computer Science"));
                
                table.put(put);
                System.out.println("Data inserted successfully.");

                // 5. Get Data (Read)
                Get get = new Get(rowKey);
                Result result = table.get(get);
                
                byte[] nameBytes = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
                String name = Bytes.toString(nameBytes);
                System.out.println("Retrieved Student Name: " + name);
            }
        }
    }
}

Key Java API Classes:

  • HBaseConfiguration: Reads HBase config files (hbase-site.xml).
  • ConnectionFactory: Manages connections to the cluster.
  • Admin: Interface for DDL operations (create, drop, alter tables).
  • Table: Interface for DML operations (put, get, scan, delete).
  • Put, Get, Scan, Delete: Represent the specific data operations to be performed.
  • Bytes: Utility class to convert Java types to and from byte arrays (HBase stores everything as byte arrays).