Unit 4 - Notes
Unit 4: Introduction to Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface (HiveQL) to query data stored in various databases and file systems that integrate with Hadoop. It abstracts the complexity of MapReduce, making it easier for data analysts to query large datasets.
1. Hive Installation
Installing Apache Hive requires a pre-existing, functioning Hadoop cluster. Hive acts as a client-side application and does not need to be installed on all nodes; it only needs to be installed on the node from which queries will be submitted.
Prerequisites
- Java: Java 8 or later must be installed.
- Hadoop: Apache Hadoop must be installed, configured, and running (HDFS and YARN).
Step-by-Step Installation
Step 1: Download Apache Hive
Download the latest stable release of Apache Hive from the official Apache mirrors.
wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
Step 2: Extract the Tarball
Extract the downloaded file to your preferred installation directory.
tar -xzvf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin /usr/local/hive
Step 3: Set Environment Variables
Add the Hive path to the system's environment variables (e.g., ~/.bashrc or ~/.bash_profile).
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
Apply the changes:
source ~/.bashrc
Step 4: Configure Hadoop for Hive
Create HDFS directories for Hive to store its data and grant the necessary permissions.
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
Step 5: Configure hive-site.xml
Navigate to the $HIVE_HOME/conf directory. Copy the template and create hive-site.xml. Configure the database connection for the Hive Metastore (Derby is used by default for local, but MySQL/PostgreSQL is recommended for production).
Step 6: Initialize the Metastore
Before starting Hive for the first time, initialize the Metastore schema using the schematool utility.
schematool -dbType derby -initSchema
Step 7: Start Hive
Launch the Hive CLI or Beeline (the modern JDBC client).
hive
# OR
beeline -u jdbc:hive2://
2. Hive Data Types
Hive supports a wide variety of data types, which are broadly categorized into Primitive and Complex data types.
Primitive Data Types
Primitive data types represent single values.
- Numeric Types:
TINYINT: 1-byte signed integer.SMALLINT: 2-byte signed integer.INT: 4-byte signed integer.BIGINT: 8-byte signed integer.FLOAT: 4-byte single-precision floating point.DOUBLE: 8-byte double-precision floating point.DECIMAL: Fixed-point arbitrary-precision.
- String Types:
STRING: Unbounded length text.VARCHAR: Variable-length text with a specified maximum length.CHAR: Fixed-length text.
- Date/Time Types:
TIMESTAMP: Represents a specific point in time (nanosecond precision).DATE: Represents a year/month/day without a time component.INTERVAL: Represents a time or day interval.
- Miscellaneous Types:
BOOLEAN:TRUEorFALSE.BINARY: Array of bytes.
Complex Data Types
Complex data types are built using primitive data types and allow for the storage of nested or multi-valued data.
- ARRAY: An ordered collection of elements of the same type.
- Syntax:
ARRAY<data_type> - Example:
ARRAY<STRING>(Accessed via zero-based index:arr[0])
- Syntax:
- MAP: An unordered collection of key-value pairs. Keys must be primitives.
- Syntax:
MAP<primitive_type, data_type> - Example:
MAP<STRING, INT>(Accessed via key:map['key'])
- Syntax:
- STRUCT: A collection of elements of different types, similar to an object or a row.
- Syntax:
STRUCT<col_name:data_type, ...> - Example:
STRUCT<first:STRING, last:STRING>(Accessed via dot notation:struct.first)
- Syntax:
- UNION: A data type that can hold exactly one value from its specified data types.
- Syntax:
UNIONTYPE<data_type, data_type, ...>
- Syntax:
3. Hive Partitioning
Partitioning is an optimization technique in Hive that divides a large table into smaller, manageable parts (partitions) based on the values of one or more columns (e.g., date, country, department).
Concept and Purpose
- Directory Mapping: Each partition corresponds to a separate directory in HDFS.
- Query Optimization: When a query filters on the partition column, Hive only reads the relevant directory (Partition Pruning), drastically reducing I/O and query time.
Types of Partitioning
- Static Partitioning: The partition value is hard-coded into the
INSERTorLOADstatement. Best used when loading a specific, known chunk of data. - Dynamic Partitioning: Hive determines the partition value dynamically based on the data being inserted. Requires setting configuration properties (
set hive.exec.dynamic.partition=true;).
Syntax and Example
Creating a Partitioned Table:
CREATE TABLE sales (
transaction_id INT,
amount DOUBLE
)
PARTITIONED BY (country STRING, year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Loading Data into a Static Partition:
LOAD DATA LOCAL INPATH '/tmp/us_2023_sales.csv'
INTO TABLE sales PARTITION (country='US', year=2023);
4. Hive Bucketing
Bucketing (or Clustering) is another data organization technique in Hive. While partitioning divides data into directories, bucketing divides data into a fixed number of files (buckets) within a directory or partition.
Concept and Purpose
- Hashing: Hive uses a hash function on the bucketed column to determine which bucket (file) a record belongs to.
- Map-Side Joins: If two tables are bucketed on the same join key and have a multiple of the same number of buckets, Map-side joins are highly optimized.
- Data Sampling: Bucketing allows for efficient data sampling on large datasets.
Bucketing vs. Partitioning
- Partitioning is based on column values (creates directories). Best for columns with low cardinality (e.g., year, country).
- Bucketing is based on the hash of a column (creates files). Best for columns with high cardinality (e.g., user_id, transaction_id).
Syntax and Example
Creating a Bucketed Table:
CREATE TABLE users (
user_id INT,
name STRING,
age INT
)
CLUSTERED BY (user_id) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
5. HiveQL Operations
HiveQL (HQL) is the query language for Hive, heavily resembling standard SQL. Operations are divided into DDL, DML, and DQL.
Data Definition Language (DDL)
Used to create, alter, and drop database objects.
- Create Database:
CREATE DATABASE IF NOT EXISTS retail_db; - Create Table (Internal/Managed): Data is owned by Hive. Dropping the table drops the data.
SQLCREATE TABLE employees (id INT, name STRING); - Create Table (External): Data is not owned by Hive. Dropping the table only drops the metadata, not the HDFS data.
SQLCREATE EXTERNAL TABLE ext_employees (id INT, name STRING) LOCATION '/user/data/employees'; - Drop Table:
DROP TABLE employees; - Alter Table:
ALTER TABLE employees RENAME TO staff;
Data Manipulation Language (DML)
Used to load and modify data.
- Load Data: Moves or copies files into the Hive warehouse.
SQLLOAD DATA LOCAL INPATH '/local/path/data.txt' INTO TABLE employees; - Insert Data: Inserts data using a query.
SQLINSERT INTO TABLE target_table SELECT * FROM source_table; - Update/Delete: Supported only on tables configured for ACID transactions (requires specific file formats like ORC and bucketing).
Data Query Language (DQL)
Used to retrieve data.
- Basic Select:
SELECT id, name FROM employees WHERE age > 30; - Group By & Aggregation:
SELECT department, COUNT(*) FROM employees GROUP BY department; - Joins: Supports
INNER JOIN,LEFT OUTER JOIN,RIGHT OUTER JOIN,FULL OUTER JOIN.
SQLSELECT e.name, d.dept_name FROM employees e JOIN departments d ON e.dept_id = d.id;
6. Hive Operators
Hive provides built-in operators to perform mathematical, logical, and relational computations within queries.
Relational Operators
Used to compare operands. They return a boolean value (TRUE, FALSE, or NULL).
A = B: TRUE if A equals B.A != BorA <> B: TRUE if A is not equal to B.A < B,A <= B,A > B,A >= B: Standard inequalities.A IS NULL/A IS NOT NULL: Checks for null values.A LIKE B: Simple pattern matching (e.g.,name LIKE 'Sm%').A RLIKE BorA REGEXP B: Advanced pattern matching using Java regular expressions.
Arithmetic Operators
Used for mathematical calculations. Return numeric types.
A + B: AdditionA - B: SubtractionA * B: MultiplicationA / B: Division (returns DOUBLE)A % B: Modulo (remainder)A & B,A | B,A ^ B,~A: Bitwise AND, OR, XOR, and NOT.
Logical Operators
Used to combine multiple conditions. Return boolean values.
A AND B: TRUE if both A and B are TRUE.A OR B: TRUE if either A or B is TRUE.NOT Aor!A: TRUE if A is FALSE.
Complex Type Operators
Used to access elements within complex data types (Arrays, Maps, Structs).
- Array Elements:
A[n](Returns the nth element of array A, starting at index 0). - Map Elements:
M[key](Returns the value corresponding to key in map M). - Struct Elements:
S.x(Returns the field x from the struct S).