Unit 4 - Notes

INT312 7 min read

Unit 4: Introduction to Apache Hive

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface (HiveQL) to query data stored in various databases and file systems that integrate with Hadoop. It abstracts the complexity of MapReduce, making it easier for data analysts to query large datasets.

1. Hive Installation

Installing Apache Hive requires a pre-existing, functioning Hadoop cluster. Hive acts as a client-side application and does not need to be installed on all nodes; it only needs to be installed on the node from which queries will be submitted.

Prerequisites

Java: Java 8 or later must be installed.
Hadoop: Apache Hadoop must be installed, configured, and running (HDFS and YARN).

Step-by-Step Installation

Step 1: Download Apache Hive
Download the latest stable release of Apache Hive from the official Apache mirrors.

BASH

wget https://downloads.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

Step 2: Extract the Tarball
Extract the downloaded file to your preferred installation directory.

BASH

tar -xzvf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin /usr/local/hive

Step 3: Set Environment Variables
Add the Hive path to the system's environment variables (e.g., ~/.bashrc or ~/.bash_profile).

BASH

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Apply the changes:

BASH

source ~/.bashrc

Step 4: Configure Hadoop for Hive
Create HDFS directories for Hive to store its data and grant the necessary permissions.

BASH

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
hdfs dfs -chmod g+w /tmp

Step 5: Configure hive-site.xml
Navigate to the $HIVE_HOME/conf directory. Copy the template and create hive-site.xml. Configure the database connection for the Hive Metastore (Derby is used by default for local, but MySQL/PostgreSQL is recommended for production).

Step 6: Initialize the Metastore
Before starting Hive for the first time, initialize the Metastore schema using the schematool utility.

BASH

schematool -dbType derby -initSchema

Step 7: Start Hive
Launch the Hive CLI or Beeline (the modern JDBC client).

BASH

hive
# OR
beeline -u jdbc:hive2://

2. Hive Data Types

Hive supports a wide variety of data types, which are broadly categorized into Primitive and Complex data types.

Primitive Data Types

Primitive data types represent single values.

Numeric Types:
- TINYINT: 1-byte signed integer.
- SMALLINT: 2-byte signed integer.
- INT: 4-byte signed integer.
- BIGINT: 8-byte signed integer.
- FLOAT: 4-byte single-precision floating point.
- DOUBLE: 8-byte double-precision floating point.
- DECIMAL: Fixed-point arbitrary-precision.
String Types:
- STRING: Unbounded length text.
- VARCHAR: Variable-length text with a specified maximum length.
- CHAR: Fixed-length text.
Date/Time Types:
- TIMESTAMP: Represents a specific point in time (nanosecond precision).
- DATE: Represents a year/month/day without a time component.
- INTERVAL: Represents a time or day interval.
Miscellaneous Types:
- BOOLEAN: TRUE or FALSE.
- BINARY: Array of bytes.

Complex Data Types

Complex data types are built using primitive data types and allow for the storage of nested or multi-valued data.

ARRAY: An ordered collection of elements of the same type.
- Syntax: ARRAY<data_type>
- Example: ARRAY<STRING> (Accessed via zero-based index: arr[0])
MAP: An unordered collection of key-value pairs. Keys must be primitives.
- Syntax: MAP<primitive_type, data_type>
- Example: MAP<STRING, INT> (Accessed via key: map['key'])
STRUCT: A collection of elements of different types, similar to an object or a row.
- Syntax: STRUCT<col_name:data_type, ...>
- Example: STRUCT<first:STRING, last:STRING> (Accessed via dot notation: struct.first)
UNION: A data type that can hold exactly one value from its specified data types.
- Syntax: UNIONTYPE<data_type, data_type, ...>

3. Hive Partitioning

Partitioning is an optimization technique in Hive that divides a large table into smaller, manageable parts (partitions) based on the values of one or more columns (e.g., date, country, department).

Concept and Purpose

Directory Mapping: Each partition corresponds to a separate directory in HDFS.
Query Optimization: When a query filters on the partition column, Hive only reads the relevant directory (Partition Pruning), drastically reducing I/O and query time.

Types of Partitioning

Static Partitioning: The partition value is hard-coded into the INSERT or LOAD statement. Best used when loading a specific, known chunk of data.
Dynamic Partitioning: Hive determines the partition value dynamically based on the data being inserted. Requires setting configuration properties (set hive.exec.dynamic.partition=true;).

Syntax and Example

Creating a Partitioned Table:

SQL

CREATE TABLE sales (
    transaction_id INT,
    amount DOUBLE
)
PARTITIONED BY (country STRING, year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Loading Data into a Static Partition:

SQL

LOAD DATA LOCAL INPATH '/tmp/us_2023_sales.csv' 
INTO TABLE sales PARTITION (country='US', year=2023);

4. Hive Bucketing

Bucketing (or Clustering) is another data organization technique in Hive. While partitioning divides data into directories, bucketing divides data into a fixed number of files (buckets) within a directory or partition.

Concept and Purpose

Hashing: Hive uses a hash function on the bucketed column to determine which bucket (file) a record belongs to.
Map-Side Joins: If two tables are bucketed on the same join key and have a multiple of the same number of buckets, Map-side joins are highly optimized.
Data Sampling: Bucketing allows for efficient data sampling on large datasets.

Bucketing vs. Partitioning

Partitioning is based on column values (creates directories). Best for columns with low cardinality (e.g., year, country).
Bucketing is based on the hash of a column (creates files). Best for columns with high cardinality (e.g., user_id, transaction_id).

Syntax and Example

Creating a Bucketed Table:

SQL

CREATE TABLE users (
    user_id INT,
    name STRING,
    age INT
)
CLUSTERED BY (user_id) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

5. HiveQL Operations

HiveQL (HQL) is the query language for Hive, heavily resembling standard SQL. Operations are divided into DDL, DML, and DQL.

Data Definition Language (DDL)

Used to create, alter, and drop database objects.

Create Database: CREATE DATABASE IF NOT EXISTS retail_db;
Create Table (Internal/Managed): Data is owned by Hive. Dropping the table drops the data.
SQL
```
    CREATE TABLE employees (id INT, name STRING);
    
```
Create Table (External): Data is not owned by Hive. Dropping the table only drops the metadata, not the HDFS data.
SQL
```
    CREATE EXTERNAL TABLE ext_employees (id INT, name STRING) 
    LOCATION '/user/data/employees';
    
```
Drop Table: DROP TABLE employees;
Alter Table: ALTER TABLE employees RENAME TO staff;

Data Manipulation Language (DML)

Used to load and modify data.

Load Data: Moves or copies files into the Hive warehouse.

SQL

    LOAD DATA LOCAL INPATH '/local/path/data.txt' INTO TABLE employees;

Insert Data: Inserts data using a query.

SQL

    INSERT INTO TABLE target_table SELECT * FROM source_table;

Update/Delete: Supported only on tables configured for ACID transactions (requires specific file formats like ORC and bucketing).

Data Query Language (DQL)

Used to retrieve data.

Basic Select: SELECT id, name FROM employees WHERE age > 30;
Group By & Aggregation: SELECT department, COUNT(*) FROM employees GROUP BY department;

Joins: Supports INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN.

SQL

    SELECT e.name, d.dept_name 
    FROM employees e JOIN departments d ON e.dept_id = d.id;

6. Hive Operators

Hive provides built-in operators to perform mathematical, logical, and relational computations within queries.

Relational Operators

Used to compare operands. They return a boolean value (TRUE, FALSE, or NULL).

A = B : TRUE if A equals B.
A != B or A <> B : TRUE if A is not equal to B.
A < B, A <= B, A > B, A >= B : Standard inequalities.
A IS NULL / A IS NOT NULL : Checks for null values.
A LIKE B : Simple pattern matching (e.g., name LIKE 'Sm%').
A RLIKE B or A REGEXP B : Advanced pattern matching using Java regular expressions.

Arithmetic Operators

Used for mathematical calculations. Return numeric types.

A + B : Addition
A - B : Subtraction
A * B : Multiplication
A / B : Division (returns DOUBLE)
A % B : Modulo (remainder)
A & B, A | B, A ^ B, ~A : Bitwise AND, OR, XOR, and NOT.

Logical Operators

Used to combine multiple conditions. Return boolean values.

A AND B : TRUE if both A and B are TRUE.
A OR B : TRUE if either A or B is TRUE.
NOT A or !A : TRUE if A is FALSE.

Complex Type Operators

Used to access elements within complex data types (Arrays, Maps, Structs).

Array Elements: A[n] (Returns the nth element of array A, starting at index 0).
Map Elements: M[key] (Returns the value corresponding to key in map M).
Struct Elements: S.x (Returns the field x from the struct S).

Unit 3

Unit 5

Unit 4 - Notes

Table of Contents

Unit 4: Introduction to Apache Hive

1. Hive Installation

Prerequisites

Step-by-Step Installation

2. Hive Data Types

Primitive Data Types

Complex Data Types

3. Hive Partitioning

Concept and Purpose

Types of Partitioning

Syntax and Example

4. Hive Bucketing

Concept and Purpose

Bucketing vs. Partitioning

Syntax and Example

5. HiveQL Operations

Data Definition Language (DDL)

Data Manipulation Language (DML)

Data Query Language (DQL)

6. Hive Operators

Relational Operators

Arithmetic Operators

Logical Operators

Complex Type Operators