Unit 2 - Notes

INT312 7 min read

Unit 2: Hadoop Architecture

1. Introduction to Hadoop Architecture

Apache Hadoop is an open-source framework designed to store and process large datasets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Hadoop follows a Master-Slave architecture for both storage and processing.

  • Master Nodes: Manage the distributed file system and schedule resource allocation/job execution.
  • Slave/Worker Nodes: Store the actual data and perform the processing tasks dictated by the master nodes.

The core of Apache Hadoop consists of two main layers:

  1. Storage Layer: HDFS (Hadoop Distributed File System)
  2. Processing Layer: MapReduce (and YARN in Hadoop 2.x onwards)

2. Hadoop Storage: HDFS (Hadoop Distributed File System)

HDFS is a highly fault-tolerant distributed file system responsible for storing data across multiple machines. It is optimized for high throughput rather than low latency, making it ideal for large-scale batch processing.

Key Concepts of HDFS:

  • Blocks: HDFS splits massive files into smaller chunks called blocks. The default block size in Hadoop 2.x and later is 128 MB (it was 64 MB in Hadoop 1.x). This large size minimizes disk seeks and decreases the metadata burden on the NameNode.
  • Replication: To ensure fault tolerance and high availability, HDFS replicates each block across multiple nodes. The default replication factor is 3. (e.g., Block A is stored on Node 1, Node 2, and Node 3).
  • Rack Awareness: HDFS uses a rack awareness policy to distribute replicas. Typically, two replicas are placed on the same rack (on different nodes), and the third replica is placed on a different rack. This protects against a complete rack failure.
  • Write Once, Read Many (WORM): HDFS is designed for data that is written once and read multiple times. It does not support random writes or modifications to files after they are closed, though appends are supported.

3. Core Hadoop Daemons (Components)

In a classic Hadoop (MapReduce v1) cluster, there are four primary daemons running to manage storage and processing.

3.1 NameNode (Master - Storage)

  • Function: It is the centerpiece of HDFS. It manages the file system namespace and regulates access to files by clients.
  • Metadata Storage: It stores all the metadata of HDFS (e.g., file names, permissions, block mapping to DataNodes). This metadata is stored in RAM for fast access.
  • Key Files:
    • fsimage: A snapshot of the file system metadata at a specific point in time.
    • edits log: A transactional log of all changes made to the file system since the last fsimage was created.
  • Role: It does not store actual data. It simply tells the client which DataNodes hold the required blocks.

3.2 DataNode (Slave - Storage)

  • Function: These are the worker nodes in HDFS. They store the actual data blocks.
  • Operations: They serve read and write requests from the file system's clients. They also perform block creation, deletion, and replication upon instruction from the NameNode.
  • Heartbeats: DataNodes send periodic "heartbeats" (usually every 3 seconds) and "Block Reports" to the NameNode to prove they are alive and to update the NameNode on the blocks they currently hold.

3.3 JobTracker (Master - Processing)

(Note: JobTracker is specific to Hadoop 1.x / MRv1 architecture. In Hadoop 2.x, its duties were split into YARN's ResourceManager and ApplicationMaster).

  • Function: The JobTracker is responsible for resource management and job scheduling across the cluster.
  • Role: When a client submits a MapReduce job, the JobTracker determines the execution plan. It communicates with the NameNode to determine data location (Data Locality) and assigns tasks to the TaskTrackers closest to the data.
  • Monitoring: It monitors the TaskTrackers. If a TaskTracker fails, the JobTracker reschedules the task on another node.

3.4 TaskTracker (Slave - Processing)

  • Function: TaskTrackers are the worker nodes for processing. They run on the same physical nodes as the DataNodes to enable data locality (processing data where it resides).
  • Role: They accept Map, Reduce, and Shuffle tasks from the JobTracker.
  • Execution: They spawn separate JVMs (Java Virtual Machines) to execute the actual tasks to prevent task failure from bringing down the TaskTracker daemon.
  • Communication: They send periodic heartbeats to the JobTracker to indicate they are alive and report available processing slots.

4. Hadoop MapReduce Paradigm

MapReduce is a programming paradigm and processing engine for writing applications that rapidly process vast amounts of data in parallel on large clusters of commodity hardware.

The Principle of Data Locality

Instead of moving massive amounts of data to the processing unit (which clogs network bandwidth), MapReduce moves the processing logic to the nodes where the data resides.

Phases of MapReduce:

  1. Map Phase: The input data is processed line by line. The Mapper applies a user-defined function to the input data and transforms it into intermediate Key-Value (K, V) pairs.
  2. Shuffle and Sort Phase: This is a framework-managed phase bridging Map and Reduce. It takes the output of the Mappers, sorts it by key, and groups all values associated with the same key together.
  3. Reduce Phase: The Reducer takes the grouped Key-Value pairs (K, List[V]) as input. It runs a user-defined aggregate function to reduce the list of values into a single, final output value (or a smaller set of values).

5. MapReduce Terminology

  • Job: A complete execution of a MapReduce program. It consists of the Map phase, Shuffle/Sort phase, and Reduce phase.
  • Task: A smaller unit of a Job. A Job is broken down into multiple Map Tasks and Reduce Tasks.
  • InputSplit: A logical representation of data to be processed by a single Mapper. Usually, one HDFS block equals one InputSplit, but it can be adjusted. One Map task is spawned per InputSplit.
  • RecordReader: It interacts with the InputSplit and converts the raw data into Key-Value pairs to be fed into the Mapper. For a text file, the key is typically the byte offset of the line, and the value is the line's text.
  • Mapper: The user-defined function that processes the Key-Value pairs generated by the RecordReader and outputs intermediate Key-Value pairs.
  • Combiner (Mini-Reducer): An optional localized reducer that runs on the Map output on the same node before data is sent over the network to the Reducer. It decreases the amount of data transferred across the network.
  • Partitioner: Determines which Reducer will process a specific key. The default partitioner uses a hash function: Hash(Key) % Number of Reducers.
  • Reducer: The user-defined function that processes the grouped data from the Shuffle phase and generates the final output.

6. Practical Example: Word Count on Command Line

The "Word Count" program is the "Hello World" of Hadoop. It counts the frequency of each word in a given text file.

Step 1: Prepare the Input Data

Create a local text file with some sample text.

BASH
echo "hello hadoop hello mapreduce" > sample.txt
echo "hadoop architecture hadoop hdfs" >> sample.txt

Step 2: Move Data to HDFS

Before processing, the data must be loaded into the Hadoop Distributed File System.

BASH
# Create an input directory in HDFS
hadoop fs -mkdir /input_dir

# Copy the local file to HDFS
hadoop fs -put sample.txt /input_dir/

Step 3: Run the MapReduce Job

Hadoop comes with a pre-compiled JAR file containing example MapReduce programs, including WordCount.

BASH
# Execute the wordcount program from the examples jar
# Syntax: hadoop jar <jar_path> <program_name> <input_path> <output_path>
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input_dir /output_dir

Note: The /output_dir MUST NOT exist before running the command. Hadoop will create it. If it already exists, the job will fail to prevent overwriting data.

Step 4: Inspect the Output

Once the job completes successfully, check the output directory in HDFS.

BASH
# List the contents of the output directory
hadoop fs -ls /output_dir

You will typically see two files:

  • _SUCCESS: An empty file indicating the job finished successfully.
  • part-r-00000: The actual output file generated by the Reducer.

View the results:

BASH
hadoop fs -cat /output_dir/part-r-00000

Expected Output:
TEXT
architecture    1
hadoop          3
hdfs            1
hello           2
mapreduce       1