Unit 5 - Notes
Unit 5: Introduction to Apache HBase
1. HBase Fundamentals and Data Model
What is Apache HBase?
Apache HBase is an open-source, non-relational (NoSQL), distributed, open-source database modeled after Google’s Bigtable. It is written in Java and runs on top of the Hadoop Distributed File System (HDFS).
HBase is designed to provide real-time, random read/write access to massive datasets (billions of rows and millions of columns).
Key Characteristics
- Column-Oriented: Stores data in columns rather than rows, optimizing read performance for sparse datasets.
- Schema-less: Columns can be added dynamically; only column families need to be predefined.
- Strongly Consistent: Reads and writes are strongly consistent, making it suitable for high-speed transactional data.
- Scalable: Scales linearly and automatically by adding more nodes to the cluster.
HBase Data Model
The HBase data model is fundamentally different from an RDBMS. Data is stored in a multi-dimensional, sorted map.
- Table: A collection of rows.
- Row Key: The unique identifier for a row. Rows are lexicographically sorted by the Row Key. Designing a good Row Key is critical for HBase performance.
- Column Family: A logical grouping of columns. All members of a column family are stored together on the disk. Column families must be defined when the table is created. (e.g.,
personal_info,contact_info). - Column Qualifier: The actual column name, added dynamically inside a Column Family. Expressed as
ColumnFamily:ColumnQualifier(e.g.,personal_info:name,contact_info:email). - Cell: The intersection of a Row Key, Column Family, and Column Qualifier. It contains the actual value/data.
- Timestamp: Every cell has a timestamp associated with it. HBase maintains multiple versions of a cell's value distinguished by timestamps (by default, it keeps the last 3 versions).
2. HBase Architecture
HBase follows a Master-Slave architecture. It relies heavily on ZooKeeper for coordination and HDFS for underlying storage.
Core Components
- HMaster (Master Node):
- Responsible for monitoring all RegionServers in the cluster.
- Handles metadata changes (DDL operations like creating or dropping tables).
- Assigns regions to RegionServers and handles load balancing and failover.
- RegionServer (Slave/Worker Node):
- Responsible for handling read and write requests from clients.
- Hosts and manages multiple Regions.
- Communicates directly with the client for data operations (DML).
- Regions:
- The basic building block of the HBase cluster for scaling and load balancing.
- A table is divided horizontally into Regions. Each Region contains a contiguous range of Row Keys.
- As a Region grows beyond a configured threshold, it automatically splits in two.
- ZooKeeper:
- Acts as a distributed coordination service.
- Maintains the state of the cluster (which servers are alive, which holds the
METAtable). - Clients first connect to ZooKeeper to find the location of the RegionServer hosting the data they need.
- HDFS (Hadoop Distributed File System):
- Provides the actual persistent storage.
- HBase stores its data in HDFS in specific file formats, primarily HFiles (which store the actual data) and WAL (Write-Ahead Logs, used for recovery in case a RegionServer crashes).
3. Installation of Apache HBase
HBase can be installed in three modes: Standalone, Pseudo-Distributed, and Fully Distributed. Below is a guide for a Pseudo-Distributed installation (assuming Hadoop and Java are already installed and running).
Prerequisites
- Java (JDK 8 or later) installed and
JAVA_HOMEconfigured. - Hadoop installed and running (HDFS and YARN).
Step-by-Step Installation
Step 1: Download and Extract
Download the stable binary release from the Apache HBase website.
wget https://archive.apache.org/dist/hbase/2.4.9/hbase-2.4.9-bin.tar.gz
tar -zxvf hbase-2.4.9-bin.tar.gz
cd hbase-2.4.9
Step 2: Configure hbase-env.sh
Navigate to the conf directory and edit hbase-env.sh to set the Java path and tell HBase to manage its own ZooKeeper instance.
export JAVA_HOME=/path/to/your/jdk
export HBASE_MANAGES_ZK=true
Step 3: Configure hbase-site.xml
Edit conf/hbase-site.xml to specify the HDFS directory for HBase and enable pseudo-distributed mode.
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/user/zookeeper</value>
</property>
</configuration>
Step 4: Start HBase
Run the start script from the bin directory.
./bin/start-hbase.sh
(Verify by running
jps. You should see HMaster, HRegionServer, and HQuorumPeer).
4. General Commands in Apache HBase (HBase Shell)
The HBase shell is a JRuby-based interactive tool used to execute commands. Launch it by typing ./bin/hbase shell.
Data Definition Language (DDL) Commands
- Create a table: Creates a table with specific column families.
RUBYcreate 'employees', 'personal_data', 'professional_data' - List tables: Shows all tables in HBase.
RUBYlist - Describe a table: Shows table structure and metadata.
RUBYdescribe 'employees' - Disable/Enable a table: A table must be disabled before altering or dropping it.
RUBYdisable 'employees' enable 'employees' - Drop a table:
RUBYdrop 'employees'
Data Manipulation Language (DML) Commands
- Put: Inserts or updates data in a specific cell.
RUBYput 'employees', 'row1', 'personal_data:name', 'John Doe' put 'employees', 'row1', 'professional_data:role', 'Developer' - Get: Retrieves data for a specific row.
RUBYget 'employees', 'row1' - Scan: Retrieves data from the entire table or a range of rows.
RUBYscan 'employees' - Delete: Deletes a specific cell.
RUBYdelete 'employees', 'row1', 'personal_data:name' - Truncate: Disables, drops, and recreates the table, clearing all data.
RUBYtruncate 'employees'
5. Filtering in HBase: Prefix and Single Value Column
Filters in HBase allow you to push down criteria to the RegionServers so that only the matching data is returned over the network, drastically improving performance.
1. PrefixFilter
The PrefixFilter takes a single argument (a prefix) and returns only those rows whose Row Keys start with that specific prefix.
HBase Shell Example:
If we have rows with keys: user123, user456, admin123.
# Returns 'user123' and 'user456'
scan 'employees', {FILTER => "PrefixFilter('user')"}
2. SingleColumnValueFilter
The SingleColumnValueFilter evaluates the value of a specific column (Family + Qualifier) and determines whether to include the entire row based on a comparison operator (e.g., =, !=, >, <).
HBase Shell Example:
Return all rows where the professional_data:role is 'Developer'.
scan 'employees', {FILTER => "SingleColumnValueFilter('professional_data', 'role', =, 'binary:Developer')"}
(Note: By default, if the column does not exist in a row, the row is included. To prevent this,
FilterIfMissing is often set to true in the Java API).
6. Time To Live (TTL) for Columns in HBase
Time To Live (TTL) is a feature in HBase that allows you to set an expiration time for your data. Once the TTL expires, the data is marked with a tombstone marker and is eventually removed during major compactions.
- TTL is configured at the Column Family level.
- The value is specified in seconds.
- It is extremely useful for time-series data, logs, or sensor data where old data is no longer relevant.
Setting TTL during Table Creation
Creates a table where data in the log_data column family automatically expires after 30 days (2,592,000 seconds).
create 'server_logs', {NAME => 'log_data', TTL => 2592000}
Altering TTL for an Existing Table
You can change the TTL of an existing column family. The table must generally be disabled first or altered online depending on the HBase version.
alter 'server_logs', {NAME => 'log_data', TTL => 86400} # 1 day
7. Interacting with HBase using Java API
HBase is written in Java, and its native Java API provides the most robust way to interact with the database.
Maven Dependencies
To use the Java API, include the HBase client dependency in your pom.xml:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.4.9</version>
</dependency>
Java API Code Example (CRUD Operations)
Below is an example of configuring, connecting, creating a table, putting data, and getting data using the modern Connection API.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
public class HBaseExample {
public static void main(String[] args) throws IOException {
// 1. Setup Configuration
Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "localhost");
config.set("hbase.zookeeper.property.clientPort", "2181");
// 2. Establish Connection
try (Connection connection = ConnectionFactory.createConnection(config);
Admin admin = connection.getAdmin()) {
TableName tableName = TableName.valueOf("Students");
// 3. Create Table
if (!admin.tableExists(tableName)) {
ColumnFamilyDescriptor familyDesc = ColumnFamilyDescriptorBuilder.newBuilder(Bytes.toBytes("info")).build();
TableDescriptor tableDesc = TableDescriptorBuilder.newBuilder(tableName)
.setColumnFamily(familyDesc)
.build();
admin.createTable(tableDesc);
System.out.println("Table created successfully.");
}
// 4. Put Data (Insert)
try (Table table = connection.getTable(tableName)) {
byte[] rowKey = Bytes.toBytes("student1");
Put put = new Put(rowKey);
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("major"), Bytes.toBytes("Computer Science"));
table.put(put);
System.out.println("Data inserted successfully.");
// 5. Get Data (Read)
Get get = new Get(rowKey);
Result result = table.get(get);
byte[] nameBytes = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
String name = Bytes.toString(nameBytes);
System.out.println("Retrieved Student Name: " + name);
}
}
}
}
Key Java API Classes:
HBaseConfiguration: Reads HBase config files (hbase-site.xml).ConnectionFactory: Manages connections to the cluster.Admin: Interface for DDL operations (create, drop, alter tables).Table: Interface for DML operations (put, get, scan, delete).Put,Get,Scan,Delete: Represent the specific data operations to be performed.Bytes: Utility class to convert Java types to and from byte arrays (HBase stores everything as byte arrays).