Apache HBase is a distributed, scalable, NoSQL database that stores data in a column-oriented format on top of Hadoop.
Incorrect! Try again.
2Which underlying file system does Apache HBase typically use to store its data?
HBase Architecture
Easy
A.EXT4
B.Amazon S3 exclusively
C.Hadoop Distributed File System (HDFS)
D.NTFS
Correct Answer: Hadoop Distributed File System (HDFS)
Explanation:
HBase is designed to run on top of HDFS, leveraging its distributed and fault-tolerant storage capabilities.
Incorrect! Try again.
3Which centralized service is used by HBase to maintain configuration information and distributed synchronization?
ZooKeeper Integration
Easy
A.Apache Hive
B.Apache Kafka
C.Apache ZooKeeper
D.Apache Pig
Correct Answer: Apache ZooKeeper
Explanation:
HBase uses Apache ZooKeeper to manage cluster state, track server failures, and coordinate distributed operations.
Incorrect! Try again.
4In the HBase architecture, which node is responsible for monitoring RegionServers and assigning regions to them?
HMaster
Easy
A.RegionServer
B.DataNode
C.HMaster
D.NameNode
Correct Answer: HMaster
Explanation:
The HMaster is the master node in HBase that assigns regions to RegionServers and handles load balancing.
Incorrect! Try again.
5Which component in HBase actually handles the read and write requests from the clients?
RegionServer
Easy
A.RegionServer
B.NameNode
C.ZooKeeper
D.HMaster
Correct Answer: RegionServer
Explanation:
RegionServers serve the actual data to clients and manage the execution of read/write operations for their assigned regions.
Incorrect! Try again.
6What uniquely identifies a specific row in an HBase table?
HBase Data Model
Easy
A.Column Qualifier
B.Primary Key
C.Timestamp
D.Row Key
Correct Answer: Row Key
Explanation:
In HBase, data is stored and sorted by the Row Key, which acts as the unique identifier for every row in a table.
Incorrect! Try again.
7In HBase, columns are grouped into logical and physical sets called what?
Column Families
Easy
A.Column Families
B.Row Keys
C.Tables
D.Namespaces
Correct Answer: Column Families
Explanation:
A Column Family is a grouping of columns in HBase. All column members of a column family are stored together on the disk.
Incorrect! Try again.
8Which of the following MUST be predefined when creating an HBase table?
HBase Schema
Easy
A.The exact number of rows
B.The data type of every cell
C.Column Families
D.Every single column name
Correct Answer: Column Families
Explanation:
Unlike RDBMS, HBase does not require columns to be predefined, but it does require Column Families to be defined at table creation.
Incorrect! Try again.
9Apache HBase is written in which programming language?
HBase Basics
Easy
A.Scala
B.C++
C.Python
D.Java
Correct Answer: Java
Explanation:
HBase is an open-source project written in Java, much like the rest of the Hadoop ecosystem.
Incorrect! Try again.
10Does Apache HBase natively support SQL queries out of the box?
HBase vs RDBMS
Easy
A.Yes, it uses MySQL as its query engine.
B.Yes, but only for UPDATE statements.
C.No, it does not support SQL natively without extra tools like Apache Phoenix.
D.Yes, it is fully SQL compliant.
Correct Answer: No, it does not support SQL natively without extra tools like Apache Phoenix.
Explanation:
HBase provides a Java API and shell for data access. To use SQL queries, additional abstraction layers like Apache Phoenix or Hive are required.
Incorrect! Try again.
11What is an HBase 'Region'?
HBase Architecture
Easy
A.The main configuration file
B.A geographical location of a server
C.A backup of the entire database
D.A continuous range of sorted rows stored together
Correct Answer: A continuous range of sorted rows stored together
Explanation:
A Region is a subset of a table's data, containing a continuous, sorted range of rows based on their Row Keys.
Incorrect! Try again.
12According to the CAP theorem, which two properties does HBase primarily guarantee?
HBase Properties
Easy
A.Availability and Partition Tolerance (AP)
B.Consistency and Availability (CA)
C.None of the above
D.Consistency and Partition Tolerance (CP)
Correct Answer: Consistency and Partition Tolerance (CP)
Explanation:
HBase is a CP system. It guarantees strong consistency and partition tolerance, occasionally sacrificing availability during node failures.
Incorrect! Try again.
13How does HBase handle multiple versions of data stored in the same cell?
HBase Data Model
Easy
A.It overwrites the old data immediately.
B.It cannot store multiple versions of data.
C.It stores them using different Row Keys.
D.It differentiates them using a Timestamp.
Correct Answer: It differentiates them using a Timestamp.
Explanation:
HBase cells can contain multiple versions of the same data, which are differentiated and sorted by a Timestamp.
Incorrect! Try again.
14Which API operation is used to insert or update data in an HBase table?
HBase Operations
Easy
A.UPDATE
B.PUT
C.INSERT
D.POST
Correct Answer: PUT
Explanation:
The 'Put' command/API is used to insert new rows or update existing data in an HBase table.
Incorrect! Try again.
15Which API operation is used to fetch a single, specific row of data from HBase?
HBase Operations
Easy
A.FETCH
B.PULL
C.GET
D.SELECT
Correct Answer: GET
Explanation:
The 'Get' command is used to retrieve specific data from HBase based on a known Row Key.
Incorrect! Try again.
16In HBase, what is a Column Qualifier?
HBase Data Model
Easy
A.The part of the column name that identifies a specific column within a Column Family
B.The data type of the column
C.The name of the database
D.A constraint that prevents null values
Correct Answer: The part of the column name that identifies a specific column within a Column Family
Explanation:
A Column Qualifier is added to a Column Family to identify a specific piece of data (e.g., in 'personal:name', 'name' is the qualifier).
Incorrect! Try again.
17If an HMaster node fails in a highly available HBase cluster, what happens?
HBase Architecture
Easy
A.All data is deleted.
B.A backup HMaster is elected to take over.
C.The NameNode takes over HMaster duties.
D.The entire cluster immediately shuts down.
Correct Answer: A backup HMaster is elected to take over.
Explanation:
In a highly available setup, ZooKeeper will detect the failure and promote a standby HMaster to active status.
Incorrect! Try again.
18Why would a system choose HBase over plain HDFS?
HBase vs HDFS
Easy
A.HDFS cannot store large files.
B.HBase is much cheaper to install than HDFS.
C.HBase provides fast, random read/write access to data, whereas HDFS is designed for sequential batch processing.
D.HBase does not require any servers.
Correct Answer: HBase provides fast, random read/write access to data, whereas HDFS is designed for sequential batch processing.
Explanation:
HDFS excels at batch processing and sequential reads, but HBase provides the low-latency, random read/write access needed for real-time applications.
Incorrect! Try again.
19Which operation is used to iterate over a range of rows in an HBase table?
HBase Operations
Easy
A.GET
B.SCAN
C.ITERATE
D.LOOP
Correct Answer: SCAN
Explanation:
The 'Scan' operation allows a client to iterate over multiple rows, often defined by a start row and stop row.
Incorrect! Try again.
20Which famous Google technology paper inspired the creation of Apache HBase?
HBase Basics
Easy
A.Spanner
B.Bigtable
C.Google File System (GFS)
D.MapReduce
Correct Answer: Bigtable
Explanation:
Apache HBase was created as an open-source implementation based on the concepts presented in Google's Bigtable paper.
Incorrect! Try again.
21In an HBase cluster, how is the state and configuration of the distributed environment maintained to ensure high availability?
HBase Architecture
Medium
A.Through Apache ZooKeeper, which manages cluster coordination and tracks server failures
B.Through a dedicated relational database like MySQL
C.By storing the metadata directly in HDFS blocks replicated across the cluster
D.By electing a backup HMaster that continuously polls RegionServers
Correct Answer: Through Apache ZooKeeper, which manages cluster coordination and tracks server failures
Explanation:
Apache ZooKeeper is a centralized service used in HBase to maintain configuration information, provide distributed synchronization, and manage the active HMaster and RegionServer statuses.
Incorrect! Try again.
22If a user needs to retrieve a specific version of a value in an HBase table, which combination of coordinates is required?
HBase is a multidimensional map. To locate a specific cell's version, the exact coordinates required are the Row Key, Column Family, Column Qualifier, and the Timestamp.
Incorrect! Try again.
23What happens automatically when an HBase table's region grows beyond its configured maximum file size?
Region and RegionServers
Medium
A.The region dynamically splits into two roughly equal child regions.
B.The RegionServer rejects further write requests until data is deleted.
C.The data is compressed and moved to an archive table.
D.The region is flushed to a secondary storage cluster.
Correct Answer: The region dynamically splits into two roughly equal child regions.
Explanation:
When an HBase region exceeds the predefined hbase.hregion.max.filesize, the RegionServer automatically splits it into two smaller daughter regions to distribute the load.
Incorrect! Try again.
24Which of the following scenarios is a valid reason to choose Apache HBase over directly storing files in HDFS?
HBase vs HDFS
Medium
A.The application requires streaming massive datasets for sequential batch processing.
B.The application needs to store large, immutable video files.
C.The application relies on complex SQL joins and multi-row transactions.
D.The application requires low-latency, random read and write access to billions of rows.
Correct Answer: The application requires low-latency, random read and write access to billions of rows.
Explanation:
HDFS is optimized for sequential reads of large files, whereas HBase is built on top of HDFS to provide fast, random, real-time read/write access to large datasets.
Incorrect! Try again.
25Why is it generally recommended to keep the number of column families in an HBase table small (typically 1 to 3)?
Column Families
Medium
A.Because HBase has a hard limit of 3 column families per table.
B.Because ZooKeeper cannot track metadata for more than 3 column families.
C.Because flushing occurs per region; flushing one column family forces the others to flush, creating many small HFiles.
D.Because read performance degrades exponentially as column families cannot be compressed.
Correct Answer: Because flushing occurs per region; flushing one column family forces the others to flush, creating many small HFiles.
Explanation:
In HBase, MemStores are flushed on a per-region basis. If one column family's MemStore fills up, all column families in that region are flushed, leading to unnecessary I/O and many small files if there are too many column families.
Incorrect! Try again.
26When a client sends a write request to a RegionServer, what is the correct operational sequence before the server acknowledges the write as successful?
HBase Write Path
Medium
A.Write to BlockCache -> Write to Write-Ahead Log (WAL) -> Acknowledge client
B.Write to MemStore -> Write to HFile -> Acknowledge client
C.Write to HFile directly -> Acknowledge client
D.Write to Write-Ahead Log (WAL) -> Write to MemStore -> Acknowledge client
Correct Answer: Write to Write-Ahead Log (WAL) -> Write to MemStore -> Acknowledge client
Explanation:
To ensure data durability and fast writes, HBase first appends the edit to the Write-Ahead Log (WAL) on disk, then stores it in the in-memory MemStore, and finally acknowledges the write to the client.
Incorrect! Try again.
27How are rows physically sorted within an HBase table?
HBase Data Model
Medium
A.Chronologically by their insertion timestamp
B.Numerically by a system-generated auto-incrementing ID
C.Randomly based on the hash of the Column Family
D.Lexicographically by their Row Key
Correct Answer: Lexicographically by their Row Key
Explanation:
HBase stores row data sorted lexicographically (alphabetically/byte-order) by the Row Key. This sorting allows for efficient range scans.
Incorrect! Try again.
28In the event of a RegionServer crash, which HBase component is responsible for reassigning its regions to other healthy RegionServers?
HMaster and ZooKeeper
Medium
A.The Client Driver
B.Apache ZooKeeper
C.The HMaster
D.The NameNode
Correct Answer: The HMaster
Explanation:
The HMaster is responsible for monitoring RegionServers, handling load balancing, and reassigning regions when a RegionServer fails or joins the cluster.
Incorrect! Try again.
29What is the key difference between Minor Compaction and Major Compaction in HBase?
Compaction
Medium
A.Minor compaction runs only on the HMaster, while Major compaction runs on RegionServers.
B.Minor compaction merges smaller adjacent HFiles, while Major compaction merges all HFiles into one and removes deleted/expired cells.
C.Minor compaction merges all HFiles in a region, while Major compaction only merges MemStores.
D.Minor compaction deletes expired cells, while Major compaction only re-indexes the HFiles.
Correct Answer: Minor compaction merges smaller adjacent HFiles, while Major compaction merges all HFiles into one and removes deleted/expired cells.
Explanation:
Minor compactions combine a few smaller HFiles into a larger one to reduce file count. Major compactions rewrite all HFiles in a column family into a single HFile, permanently removing deleted or expired data.
Incorrect! Try again.
30A developer is inserting sequential time-series data into HBase using the timestamp as the row key. What performance issue is likely to occur?
Row Key Design
Medium
A.Data corruption due to overlapping timestamps
B.Automatic rejection of sequential keys by the HMaster
C.Region hotspotting, where all new writes hit a single RegionServer
D.ZooKeeper synchronization timeout
Correct Answer: Region hotspotting, where all new writes hit a single RegionServer
Explanation:
Because HBase sorts data lexicographically, sequentially increasing keys (like timestamps) will cause all new writes to be directed to a single region (and thus a single RegionServer), creating a bottleneck known as hotspotting.
Incorrect! Try again.
31During a read request, if a RegionServer does not find the requested data in the BlockCache or MemStore, what does it do to optimize the search on disk?
HBase Read Path
Medium
A.It scans all HFiles sequentially until the data is found.
B.It uses Bloom filters to skip HFiles that definitely do not contain the requested row key.
C.It queries the HMaster for the data location.
D.It requests the client to search the MapReduce output.
Correct Answer: It uses Bloom filters to skip HFiles that definitely do not contain the requested row key.
Explanation:
HBase uses Bloom filters, a space-efficient probabilistic data structure, to test whether a row key exists in an HFile, thereby drastically reducing unnecessary disk reads.
Incorrect! Try again.
32Which of the following guarantees does Apache HBase provide in the context of the CAP Theorem?
HBase vs RDBMS
Medium
A.Consistency and Partition Tolerance (CP)
B.Consistency and Availability (CA)
C.Eventual Consistency and High Availability
D.Availability and Partition Tolerance (AP)
Correct Answer: Consistency and Partition Tolerance (CP)
Explanation:
HBase is a CP system. It guarantees strong consistency for reads and writes at the row level and can tolerate network partitions, but it may sacrifice availability during certain failure scenarios (e.g., RegionServer crash recovery).
Incorrect! Try again.
33What is the primary purpose of the Write-Ahead Log (WAL) located on a RegionServer?
Region and RegionServers
Medium
A.To track the history of schema changes applied by the HMaster.
B.To audit user access and log analytical metrics for MapReduce.
C.To recover data that has not yet been flushed from the MemStore in the event of a server crash.
D.To cache frequently read rows to improve read performance.
Correct Answer: To recover data that has not yet been flushed from the MemStore in the event of a server crash.
Explanation:
The WAL records all write operations before they are saved in the MemStore. If a RegionServer crashes before the MemStore flushes to disk, the WAL is replayed to prevent data loss.
Incorrect! Try again.
34How does an HBase client initially discover the location of the hbase:meta table to perform read/write operations?
HBase Architecture
Medium
A.By asking the NameNode of the underlying HDFS cluster
B.By broadcasting a UDP request to all RegionServers
C.By querying Apache ZooKeeper
D.By connecting directly to the local HMaster process
Correct Answer: By querying Apache ZooKeeper
Explanation:
The client first connects to ZooKeeper, which holds the location of the RegionServer hosting the hbase:meta table. The client then queries the meta table to find the specific region for its data.
Incorrect! Try again.
35How does Apache HBase internally store data types such as Strings, Integers, or custom objects?
Data Types
Medium
A.As uninterpreted arrays of bytes
B.As JSON documents
C.As serialized XML strings
D.As heavily typed native Java objects
Correct Answer: As uninterpreted arrays of bytes
Explanation:
HBase does not enforce data types. All data—including row keys, column families, qualifiers, and values—is treated as uninterpreted byte arrays (byte[]). It is the client's responsibility to serialize and deserialize data.
Incorrect! Try again.
36Which of the following is NOT a responsibility of the HMaster in HBase?
HMaster and ZooKeeper
Medium
A.Handling DDL operations like create, alter, and drop tables
B.Serving client read and write requests for user tables
C.Monitoring all RegionServer instances in the cluster
D.Assigning regions to RegionServers at startup
Correct Answer: Serving client read and write requests for user tables
Explanation:
The HMaster handles administrative and metadata tasks. Client read and write data requests go directly to the RegionServers, bypassing the HMaster entirely to avoid bottlenecks.
Incorrect! Try again.
37To solve the region hotspotting problem caused by sequentially increasing row keys, which technique involves adding a deterministic prefix based on the original key?
Row Key Design
Medium
A.Row Compression
B.Hashing
C.Salting
D.Key Reversal
Correct Answer: Hashing
Explanation:
Hashing the row key (or applying a hash to generate a prefix) deterministically distributes sequential keys across multiple regions. (Note: Salting adds a random prefix, while Hashing uses a deterministic prefix).
Incorrect! Try again.
38In HBase, when a cell is updated or deleted, what physically happens to the old data immediately?
HBase Data Model
Medium
A.The old data is overwritten in place on the disk.
B.The old data is moved to a temporary trash bin in HDFS.
C.The old data remains, and a new version is written; deletes are marked with a tombstone marker.
D.The old data is immediately stripped from the HFile through an synchronous compaction.
Correct Answer: The old data remains, and a new version is written; deletes are marked with a tombstone marker.
Explanation:
HFiles are immutable. Updates write a new cell version with a newer timestamp. Deletions write a 'tombstone' marker. The actual removal of old/deleted data happens later during a Major Compaction.
Incorrect! Try again.
39What is the primary function of the BlockCache in a RegionServer?
HBase Read Path
Medium
A.To store region metadata temporarily while communicating with ZooKeeper.
B.To cache frequently read data blocks in memory to speed up subsequent reads.
C.To buffer incoming write requests before appending to the WAL.
D.To group small HFiles together before executing a major compaction.
Correct Answer: To cache frequently read data blocks in memory to speed up subsequent reads.
Explanation:
The BlockCache is an LRU (Least Recently Used) in-memory cache on the RegionServer used to store frequently accessed read data, thereby reducing disk I/O latency.
Incorrect! Try again.
40If an HBase table defines a Column Family named metrics, which of the following represents a valid column qualifier creation process?
Column Families
Medium
A.Qualifiers are auto-generated by the HMaster sequentially.
B.Qualifiers are dynamically created by the client at the time of data insertion.
C.Qualifiers are extracted automatically from the Row Key hash.
D.Qualifiers must be strictly defined in the table schema before data insertion.
Correct Answer: Qualifiers are dynamically created by the client at the time of data insertion.
Explanation:
Unlike column families which must be pre-defined in the schema, column qualifiers in HBase are dynamic and can be created on-the-fly by the client when inserting data.
Incorrect! Try again.
41A telecommunications company uses Apache HBase to store call detail records (CDRs). They initially designed the RowKey as [Timestamp]-[CallerID]. They are experiencing severe write bottlenecks on a single RegionServer during peak hours. Which of the following RowKey redesign strategies effectively eliminates this 'hot-spotting' while maintaining optimal performance for time-range queries?
HBase Data Model & RowKey Design
Hard
A.Reversing the timestamp before appending the CallerID.
B.Hashing the CallerID and prepending a modulo-based bucket ID to the original RowKey.
C.Moving to a purely random UUID RowKey for perfectly uniform distribution.
D.Salting the RowKey by prepending a randomly generated byte array.
Correct Answer: Hashing the CallerID and prepending a modulo-based bucket ID to the original RowKey.
Explanation:
Prepending a hash-based bucket ID (salting deterministically based on CallerID) distributes writes evenly across RegionServers while still allowing optimized partial scans for a specific CallerID if the client queries all known buckets in parallel. Purely random UUIDs or random salting completely destroy the ability to efficiently scan by CallerID or time ranges.
Incorrect! Try again.
42During a high-throughput write operation in HBase, a RegionServer crashes immediately after writing a batch of mutations to the MemStore and appending them to the Write-Ahead Log (WAL), but before a flush to an HFile occurs. How does HBase ensure data durability and consistency in this scenario?
HBase Architecture & Write Path
Hard
A.The HMaster instructs the client to replay the failed mutations directly to the newly assigned RegionServer.
B.Zookeeper detects the failure and immediately promotes the secondary MemStore replica on a different RegionServer to active.
C.HDFS automatically replicates the un-flushed MemStore blocks to another running RegionServer.
D.The HMaster splits the abandoned WAL into separate files per region and assigns them to new RegionServers for replay during region initialization.
Correct Answer: The HMaster splits the abandoned WAL into separate files per region and assigns them to new RegionServers for replay during region initialization.
Explanation:
When a RegionServer fails, its MemStore data is lost in memory. The HMaster detects the failure via Zookeeper, takes ownership of the RegionServer's WAL on HDFS, splits it by region, and writes the recovered edits to the respective regions' new locations so they are replayed when the regions are brought online.
Incorrect! Try again.
43An HBase cluster is experiencing high disk I/O and CPU utilization due to frequent major compactions. A data engineer proposes disabling major compactions entirely and relying solely on minor compactions. What is the primary negative consequence of this approach?
Compaction & Performance
Hard
A.The Write-Ahead Log (WAL) will never be rolled, eventually filling up the entire HDFS capacity.
B.Data locality across HDFS DataNodes will permanently drop to zero percent, requiring cross-rack reads for all queries.
C.Deleted cells (tombstones) and expired versions will never be purged, leading to infinitely growing storage and degraded read performance.
D.The MemStore will fill up faster, leading to more frequent flushes and OutOfMemory (OOM) errors.
Correct Answer: Deleted cells (tombstones) and expired versions will never be purged, leading to infinitely growing storage and degraded read performance.
Explanation:
Minor compactions only merge smaller HFiles into larger ones but do not drop deleted cells (tombstones) or cells that exceed the max versions/TTL. Only a major compaction rewrites all HFiles for a region/column family into a single HFile and purges deleted or expired data.
Incorrect! Try again.
44When an HBase client performs a Get request for a newly introduced RowKey for the very first time after the cluster has been restarted, what is the exact sequence of network interactions it performs to locate the correct RegionServer?
The client first connects to Zookeeper to find the location of the hbase:meta table. It then connects to the RegionServer hosting hbase:meta to find the RegionServer hosting the user's specific row. Finally, it queries the target RegionServer. The client caches this routing information for future requests. The HMaster is not involved in the read/write path.
Incorrect! Try again.
45HBase provides strong consistency for row-level operations. Which internal mechanism guarantees that concurrent reads do not see partial updates from a parallel write operation to the same row?
HBase Consistency & MVCC
Hard
A.Synchronous replication to all HDFS DataNodes before returning an acknowledgment to the client.
B.Multi-Version Concurrency Control (MVCC) utilizing a sequence ID (Write Number) advanced upon completion of the MemStore update.
C.Exclusive row-level write locks acquired in the BlockCache.
D.Distributed locks managed by Zookeeper on the target RowKey.
Correct Answer: Multi-Version Concurrency Control (MVCC) utilizing a sequence ID (Write Number) advanced upon completion of the MemStore update.
Explanation:
HBase uses MVCC to provide row-level ACID guarantees. A transaction is assigned a write number. The updates are applied to the MemStore but remain hidden from readers until the write completes and the region's read point is advanced past the transaction's write number.
Incorrect! Try again.
46In a highly available HBase cluster, the Active HMaster node suddenly crashes due to a hardware failure. Assuming the Zookeeper ensemble and all RegionServers remain healthy, what is the immediate impact on client applications executing heavy read and write (DML) workloads?
HBase Architecture & High Availability
Hard
A.The cluster will become strictly read-only to prevent split-brain scenarios until a new HMaster is elected.
B.Reads will succeed using cached region locations, but writes will fail because the WAL cannot be rotated.
C.Reads and writes will continue normally for existing regions, but schema changes (DDL) and handling of region splits/failures will halt until a Backup HMaster takes over.
D.All reads and writes will fail immediately with a MasterNotRunningException.
Correct Answer: Reads and writes will continue normally for existing regions, but schema changes (DDL) and handling of region splits/failures will halt until a Backup HMaster takes over.
Explanation:
The HMaster is responsible for administrative operations (DDL), region assignment, and handling RegionServer failures. It is not in the data path. Client reads and writes go directly to RegionServers, so DML operations continue uninterrupted during an HMaster outage.
Incorrect! Try again.
47An HBase administrator configures hbase.regionserver.global.memstore.size to 0.5 (50% of heap) and hfile.block.cache.size to 0.4 (40% of heap) on a RegionServer with a 32GB heap. During a sustained, mixed read/write workload, the RegionServer repeatedly crashes with OutOfMemoryError. What is the fundamental architectural constraint violated here?
MemStore & BlockCache
Hard
A.HBase strictly prohibits MemStore configurations exceeding 40% because of WAL serialization overhead.
B.The combined allocation of MemStore and BlockCache must strictly remain below 0.8 (80%) of the total heap to leave adequate room for internal RegionServer processing and RPC queues.
C.The BlockCache must always be strictly larger than the MemStore size to handle flush operations.
D.The combined memory allocation limits garbage collection, requiring G1GC which is incompatible with a 32GB heap.
Correct Answer: The combined allocation of MemStore and BlockCache must strictly remain below 0.8 (80%) of the total heap to leave adequate room for internal RegionServer processing and RPC queues.
Explanation:
HBase requires that the sum of hbase.regionserver.global.memstore.size and hfile.block.cache.size does not exceed 0.8 (80%). If it does, HBase throws an exception or risks OOM because it needs the remaining 20% of the heap for general processing, RPC buffers, and object overhead.
Incorrect! Try again.
48HBase achieves data locality by ensuring HFiles are written to HDFS DataNodes residing on the same physical machine as the RegionServer. Under which of the following circumstances is data locality temporarily lost, and how is it subsequently restored?
HBase Architecture & HDFS
Hard
A.Lost when client write volume exceeds WAL capacity; restored when WALs are archived.
B.Lost when a RegionServer fails and its regions are moved to another server; restored during the next Major Compaction.
C.Lost during MemStore flushes; restored when Zookeeper triggers a locality sync.
D.Lost during HDFS NameNode failover; restored automatically by the HDFS Balancer script.
Correct Answer: Lost when a RegionServer fails and its regions are moved to another server; restored during the next Major Compaction.
Explanation:
When regions are reassigned to a new RegionServer (due to failure or load balancing), the existing HFiles remain on their original DataNodes, resulting in lost data locality. Locality is restored when the new RegionServer performs a Major Compaction, as the newly written HFiles will favor the local DataNode.
Incorrect! Try again.
49A developer configures a ROWCOL Bloom Filter on an HBase column family to optimize high-latency read queries. For which of the following access patterns will this specific Bloom Filter provide the most significant performance improvement?
HBase Read Path & Bloom Filters
Hard
A.Get operations querying a specific RowKey and a highly specific subset of Column Qualifiers that rarely exist.
B.Scan operations filtering on specific cell values using a ValueFilter.
C.Get operations querying the entire row (all column qualifiers) for a specific RowKey.
D.Scan operations retrieving all columns for a contiguous range of RowKeys.
Correct Answer: Get operations querying a specific RowKey and a highly specific subset of Column Qualifiers that rarely exist.
Explanation:
A ROWCOL Bloom filter hashes both the RowKey and the Column Qualifier. It is highly effective at ruling out the presence of specific columns for a specific row in an HFile, preventing unnecessary disk reads for point-lookups requesting specific missing qualifiers.
Incorrect! Try again.
50Consider an HBase table where a client inserts a cell at RowKey R1, Column Family CF1, Qualifier Q1, with a specific timestamp . Later, the client issues a Delete operation for R1:CF1:Q1 without specifying a timestamp. Assuming default behavior, what exactly does HBase write to the storage engine?
HBase Data Model
Hard
A.A tombstone marker for R1:CF1:Q1 with the server's current timestamp, which shadows all older versions during reads.
B.A DeleteFamily marker at the row level that masks the entire R1:CF1 combination regardless of qualifier.
C.It directly modifies the MemStore and HFile to physically erase the bytes associated with .
D.A tombstone marker for R1:CF1:Q1 explicitly matched to , leaving any versions with visible.
Correct Answer: A tombstone marker for R1:CF1:Q1 with the server's current timestamp, which shadows all older versions during reads.
Explanation:
When a delete is issued without a specific timestamp, HBase creates a tombstone marker (DeleteColumn) with the current server timestamp. This tombstone masks all versions of that cell with a timestamp older than or equal to the tombstone's timestamp during read operations.
Incorrect! Try again.
51In the context of the CAP theorem, Apache HBase is classified as a CP (Consistent and Partition Tolerant) system. In the event of a network partition separating the HMaster and several RegionServers from Zookeeper, how does HBase sacrifice Availability to maintain Consistency?
HBase and CAP Theorem
Hard
A.RegionServers switch to read-only mode, serving stale data from BlockCache until Zookeeper reconnects.
B.HBase transparently routes all requests to a backup HDFS cluster, ensuring availability but sacrificing read latency.
C.The HMaster forcefully formats the WALs of the partitioned RegionServers, causing temporary write unavailability but ensuring zero data divergence.
D.The partitioned RegionServers voluntarily shut down (suicide) because they lose their ephemeral nodes in Zookeeper, making their hosted regions temporarily unavailable.
Correct Answer: The partitioned RegionServers voluntarily shut down (suicide) because they lose their ephemeral nodes in Zookeeper, making their hosted regions temporarily unavailable.
Explanation:
HBase maintains Consistency over Availability. If a RegionServer cannot communicate with Zookeeper due to a partition, its Zookeeper session expires, its ephemeral node is deleted, and the RegionServer shuts itself down to prevent split-brain and inconsistent states, making its regions unavailable until reassigned.
Incorrect! Try again.
52A data science team needs to calculate the real-time sum of a specific numerical column across millions of rows in an HBase table. Retrieving all rows to the client application is too slow due to network overhead. Which HBase feature provides the most efficient, distributed execution for this aggregation directly on the RegionServers?
HBase Advanced Features
Hard
A.Endpoint Coprocessors.
B.Client-side caching with Scan.setBatch().
C.Observer Coprocessors.
D.HBase MapReduce Integration.
Correct Answer: Endpoint Coprocessors.
Explanation:
Endpoint Coprocessors act similarly to stored procedures in RDBMS. They allow custom computation (like aggregation or sum) to be executed dynamically on the RegionServers, returning only the final aggregated result to the client, thus saving massive network I/O. Observers are akin to database triggers.
Incorrect! Try again.
53HFiles use a block-based format (default 64KB block size) to store data on HDFS. Within the HFile structure, how does the Read Path rapidly locate a specific RowKey without scanning the entire file?
HBase Architecture & HFile
Hard
A.By loading the HFile's Data Block Index (located via the Trailer) into memory, which maps the start keys of data blocks to their physical offsets.
B.By utilizing an embedded B+ Tree structure stored at the beginning of the HFile.
C.By querying the hbase:meta table, which holds the exact byte offsets for every RowKey in the cluster.
D.By sequentially scanning the WAL to find the most recent memory offset for the HFile.
Correct Answer: By loading the HFile's Data Block Index (located via the Trailer) into memory, which maps the start keys of data blocks to their physical offsets.
Explanation:
An HFile contains a Trailer at the end of the file which points to the Meta and Data Block Indices. These indices are loaded into memory when the HFile is opened. They contain the start keys for each 64KB data block, allowing the RegionServer to perform a binary search and seek directly to the relevant block.
Incorrect! Try again.
54A developer heavily uses reverse scans (Scan.setReversed(true)) to fetch the most recent records from a time-series HBase table where RowKeys are monotonically increasing timestamps. They observe significant performance degradation compared to forward scans. What is the fundamental architectural reason for this degradation?
HBase Performance Tuning
Hard
A.Reverse scans bypass the BlockCache entirely to avoid cache poisoning.
B.HFiles and MemStores use a forward-linked skip-list and internal block encodings (like prefix encoding) optimized strictly for forward sequential access.
C.Zookeeper must continually re-calculate region boundaries during a reverse scan, causing high CPU overhead.
D.Reverse scans require the RegionServer to perform an on-the-fly Major Compaction before returning data.
Correct Answer: HFiles and MemStores use a forward-linked skip-list and internal block encodings (like prefix encoding) optimized strictly for forward sequential access.
Explanation:
HBase is heavily optimized for forward scanning. HFile data blocks often use prefix compression, which relies on the previous row to decompress the current row. Furthermore, internal structures like skip-lists are forward-linked. Reverse scans force the system to repeatedly seek backwards and decode blocks sub-optimally, destroying performance.
Incorrect! Try again.
55To improve write throughput on a massive HBase cluster, the administration enables the MultiWAL feature. What is the specific bottleneck this feature aims to resolve?
HBase Write Path
Hard
A.The inability of a single WAL to replicate to multiple HDFS data centers simultaneously.
B.The Zookeeper locking overhead when multiple RegionServers attempt to write to the same region simultaneously.
C.The low throughput of a single HDFS write pipeline when writing WAL edits sequentially to a single file.
D.The single-threaded nature of the MemStore flush operation.
Correct Answer: The low throughput of a single HDFS write pipeline when writing WAL edits sequentially to a single file.
Explanation:
By default, a RegionServer writes all edits to a single WAL file, limited by the throughput of a single HDFS write pipeline. MultiWAL allows a RegionServer to write to multiple WAL files in parallel, utilizing multiple HDFS pipelines and significantly increasing total write throughput.
Incorrect! Try again.
56A column family CF1 is configured with VERSIONS => 3 and MIN_VERSIONS => 1, and a TTL (Time-To-Live) of 86400 seconds (1 day). If 5 updates are made to a specific cell within the last hour, and 1 update was made 2 days ago, what will a major compaction retain?
HBase Data Model
Hard
A.Only 1 update (the 2-day old one) because MIN_VERSIONS overrides TTL, and the recent ones exceed the version limit.
B.The most recent update only, as TTL aggressively purges any multi-versioned data to save space.
C.All 5 updates from the last hour (ignoring max versions to satisfy MIN_VERSIONS); the 2-day-old update is dropped.
D.The 3 most recent updates from the last hour; the older 2 updates and the 2-day-old update are dropped.
Correct Answer: The 3 most recent updates from the last hour; the older 2 updates and the 2-day-old update are dropped.
Explanation:
During compaction, HBase enforces VERSIONS => 3, keeping only the 3 newest versions regardless of TTL. The TTL applies to older data, but since the 2-day-old version is already past the 3-version limit and past the TTL, it is dropped. MIN_VERSIONS => 1 ensures at least one version is kept even if it passes TTL, but here we have recent valid versions, so the top 3 are kept.
Incorrect! Try again.
57HBase relies on a Log-Structured Merge-Tree (LSM-Tree) architecture rather than a traditional B-Tree. Which of the following best describes the primary advantage of this choice for Big Data workloads?
Log-Structured Merge-Tree
Hard
A.LSM-Trees convert random, concurrent write operations into sequential disk I/O, maximizing write throughput.
B.LSM-Trees inherently support cross-row atomic transactions, which B-Trees cannot provide.
C.LSM-Trees provide strictly bounded, predictable latencies for highly randomized point reads compared to B-Trees.
D.LSM-Trees eliminate the need for an in-memory buffer, allowing all writes to go directly to disk without CPU overhead.
Correct Answer: LSM-Trees convert random, concurrent write operations into sequential disk I/O, maximizing write throughput.
Explanation:
LSM-Trees buffer writes in memory (MemStore) and append a sequential log (WAL) for durability. When memory fills, it flushes sequentially to disk. This transforms random user writes into highly efficient sequential disk I/O, heavily optimizing write throughput, which is critical for Big Data ingestion.
Incorrect! Try again.
58Zookeeper tracks the status of RegionServers using ephemeral nodes in a specific directory (e.g., /hbase/rs). What edge-case occurs if a RegionServer experiences a severe Java "Stop-The-World" Garbage Collection pause lasting longer than the Zookeeper session timeout?
HBase Architecture & Zookeeper
Hard
A.The RegionServer automatically transitions to a read-only state to serve stale data until GC finishes.
B.The HMaster pauses all DML operations across the entire cluster until the GC pause completes and the node responds.
C.Zookeeper assumes the RegionServer is dead, expires its session, and the HMaster begins reassigning its regions; when the GC finishes, the RegionServer shuts itself down.
D.Zookeeper dynamically extends the session timeout via a heartbeat retry mechanism, keeping the cluster stable.
Correct Answer: Zookeeper assumes the RegionServer is dead, expires its session, and the HMaster begins reassigning its regions; when the GC finishes, the RegionServer shuts itself down.
Explanation:
A long Stop-The-World GC pause prevents the RegionServer from sending heartbeats to Zookeeper. If the session times out, Zookeeper deletes the ephemeral node. The HMaster treats this as a failure and reassigns regions. When the RegionServer wakes up from GC, it realizes it lost its Zookeeper session and commits "suicide" to prevent split-brain issues.
Incorrect! Try again.
59You are pre-splitting an HBase table to prevent initial hot-spotting. The RowKeys are MD5 hashes represented as 32-character hexadecimal strings (0-9, a-f). You want to pre-split the table into exactly 16 regions. Which of the following represents the optimal set of split keys?
HBase Performance Tuning
Hard
A.Allowing HBase to dynamically auto-split the table based on the ConstantSizeRegionSplitPolicy.
B.A set of 16 keys, where the first character iterates from '0' to 'f' (i.e., ['0', '1', '2', ..., 'e', 'f']).
C.A set of 15 keys based on the integer representation of the hashes divided by 16.
D.A set of 15 keys, where the first character iterates from '1' to 'f' (i.e., ['1', '2', '3', ..., 'e', 'f']).
Correct Answer: A set of 15 keys, where the first character iterates from '1' to 'f' (i.e., ['1', '2', '3', ..., 'e', 'f']).
Explanation:
To divide a hexadecimal keyspace evenly into 16 regions based on the first character, you need 15 boundaries (split keys). The regions will automatically be: (-∞, 1), [1, 2), [2, 3) ... [e, f), [f, +∞). Providing 16 keys would result in 17 regions.
Incorrect! Try again.
60Which of the following scenarios natively requires distributed transaction management libraries (like Apache Phoenix or Tephra) because it violates HBase's out-of-the-box ACID guarantees?
HBase Consistency
Hard
A.Updating a cell while a concurrent thread is reading the same exact cell, requiring read-committed isolation.
B.Ensuring that a write to the MemStore is immediately durable even if the node crashes before flushing.
C.Atomically updating the name, age, and address columns of a single user within the same RowKey.
D.Atomically transferring a balance between two different RowKeys representing different user accounts.
Correct Answer: Atomically transferring a balance between two different RowKeys representing different user accounts.
Explanation:
HBase natively provides strict ACID guarantees only at the single-row level. Any operation spanning multiple rows (like transferring a balance between Row A and Row B) requires a secondary transaction manager (like Omid, Tephra, or Phoenix) to achieve atomicity across multiple rows or regions.