Unit 1 - Notes

INT312 8 min read

Unit 1: Introduction to Hadoop

1. Introduction to Big Data

Big Data refers to massive, complex datasets that grow exponentially over time. These datasets are so voluminous and complex that traditional data processing software and database management systems (like standard RDBMS) cannot capture, store, manage, or analyze them efficiently.

Key Characteristics of the Big Data Era:

Scale: Data is generated in Terabytes (TB), Petabytes (PB), and Exabytes (EB).
Sources: Data originates from diverse sources, including social media, IoT devices, sensors, enterprise applications, web logs, and mobile devices.
Purpose: The ultimate goal of Big Data is not just storage, but analytics—extracting meaningful insights to drive business decisions, predictive modeling, and machine learning.

2. Types of Data

Big Data is broadly categorized into three main types based on its level of organization and format:

A. Structured Data

Definition: Data that has a predefined data model, format, and structure. It is highly organized and easily searchable.
Storage: Typically stored in Relational Database Management Systems (RDBMS) using tables with rows and columns.
Examples: Employee databases, customer records, financial transaction logs, inventory systems.
Processing: Easily queried using Structured Query Language (SQL).

B. Semi-Structured Data

Definition: Data that does not reside in a relational database but has some organizational properties (like tags or markers) that make it easier to analyze. It contains semantic elements.
Storage: NoSQL databases, object storage.
Examples: XML files, JSON documents, CSV files, HTML web pages, email headers.
Processing: Requires specific parsing rules or schemas applied at the time of reading (schema-on-read).

C. Unstructured Data

Definition: Data that has no recognizable internal structure or predefined data model. It makes up approximately 80% to 90% of all data generated today.
Storage: Data lakes, Hadoop Distributed File System (HDFS), NoSQL databases.
Examples: Text documents, PDFs, images, audio files, video streams, social media posts, satellite imagery.
Processing: Requires advanced processing techniques like Natural Language Processing (NLP), computer vision, and machine learning to extract value.

3. The V's of Big Data

To understand Big Data, industry experts use the "V" framework. While it started with 3 V's, it has expanded to 5 core V's (and sometimes more).

Volume: Refers to the sheer amount of data generated every second. Traditional systems measure data in Gigabytes, whereas Big Data measures in Petabytes and Zettabytes.
Velocity: The speed at which new data is generated and the speed at which it needs to be processed. For example, millions of credit card transactions or social media posts happen in real-time and require rapid stream processing.
Variety: The different types and formats of data (Structured, Semi-Structured, Unstructured) coming from disparate sources.
Veracity: Refers to the messiness or trustworthiness of the data. With so much data available, quality and accuracy can be compromised (e.g., typos, hashtags, missing values). High veracity means the data is meaningful for analysis.
Value: The most important 'V'. It refers to the business value or actionable insights that can be extracted from the data. If massive data cannot be turned into value, the storage and processing efforts are useless.

4. Introduction to Hadoop

Apache Hadoop is an open-source software framework managed by the Apache Software Foundation. It is designed for distributed storage and distributed processing of very large datasets across computer clusters built from commodity hardware.

History & Origin:

Created by Doug Cutting and Mike Cafarella in 2005.
Inspired by Google’s white papers on the Google File System (GFS) and MapReduce.
Named after a yellow toy elephant owned by Doug Cutting’s son.

Key Features of Hadoop:

Highly Scalable: You can easily add more nodes (servers) to the cluster without changing data formats.
Fault-Tolerant: Data is replicated across multiple nodes. If one machine fails, data is not lost, and processing is redirected to another node.
Cost-Effective: Runs on cheap, industry-standard commodity hardware rather than expensive, specialized servers.
Data Locality: Instead of moving massive amounts of data to the processing unit, Hadoop moves the processing logic to the nodes where the data resides, significantly reducing network congestion.

5. Components of Hadoop

Modern Hadoop (Hadoop 2.x and 3.x) consists of four core modules:

A. HDFS (Hadoop Distributed File System)

The storage layer of Hadoop. It breaks large files into smaller blocks (default size is 128 MB or 256 MB) and distributes them across the cluster.

NameNode (Master): Maintains the namespace tree and metadata for all files and directories in the cluster (e.g., file names, permissions, block locations). It does not store the actual data.
DataNode (Slave): The worker nodes that store and retrieve the actual data blocks as instructed by the NameNode. They periodically send "heartbeats" and "block reports" to the NameNode.
Replication: By default, HDFS replicates each block 3 times across different DataNodes to ensure fault tolerance.

B. MapReduce

The processing/computation layer of Hadoop. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.

Map Phase: Takes a set of data and converts it into another set of data, where individual elements are broken down into key-value pairs. (Filtering and sorting).
Reduce Phase: Takes the output from a map as an input and combines those data tuples into a smaller set of tuples. (Aggregation and summary).

C. YARN (Yet Another Resource Negotiator)

The resource management layer introduced in Hadoop 2.0. It acts as the operating system for Hadoop, managing cluster resources and scheduling jobs.

ResourceManager (Master): The ultimate authority that arbitrates resources among all applications in the system.
NodeManager (Slave): Responsible for individual compute nodes. It monitors resource usage (CPU, memory, disk) of containers and reports to the ResourceManager.
ApplicationMaster: Negotiates resources from the ResourceManager and works with NodeManager(s) to execute and monitor tasks.
Container: A fraction of the NodeManager capacity allocated for a specific task.

D. Hadoop Common

Also known as Hadoop Core, these are the common utilities, libraries, and dependencies required by the other Hadoop modules (HDFS, MapReduce, YARN) to function.

Note: The Hadoop Ecosystem extends beyond these core components to include tools like Hive (SQL-like querying), Pig (scripting), HBase (NoSQL database), Spark (in-memory processing), Sqoop (data transfer), and Zookeeper (coordination).

6. Installation of Apache Hadoop

Hadoop can be installed in three operating modes:

Local/Standalone Mode: Runs on a single machine as a single Java process. Used for debugging.
Pseudo-Distributed Mode: Runs on a single machine, but each Hadoop daemon (NameNode, DataNode, etc.) runs in a separate Java process. Used for learning and testing.
Fully-Distributed Mode: Runs across a cluster of multiple machines. Used in production.

Below are the detailed steps for installing Hadoop in Pseudo-Distributed Mode on a Linux system (e.g., Ubuntu).

Step 1: Prerequisites Installation

Hadoop requires Java and SSH for daemons to communicate seamlessly.

BASH

# Update repositories
sudo apt update

# Install Java (JDK 8 is highly recommended for Hadoop)
sudo apt install openjdk-8-jdk -y

# Install OpenSSH server and client
sudo apt install openssh-server openssh-client -y

Step 2: Configure Passwordless SSH

Hadoop needs passwordless SSH access to manage its nodes.

BASH

# Generate SSH keys
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

# Add the generated key to authorized keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# Test SSH (should not ask for a password)
ssh localhost

Step 3: Download and Extract Hadoop

BASH

# Download Hadoop (replace with current stable version)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

# Extract the archive
tar -xzvf hadoop-3.3.6.tar.gz

# Move to a standard directory
sudo mv hadoop-3.3.6 /usr/local/hadoop

Step 4: Setup Environment Variables

Edit the ~/.bashrc file to add Hadoop and Java paths.

BASH

nano ~/.bashrc

# Add the following lines at the end of the file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

# Apply the changes
source ~/.bashrc

Step 5: Configure Hadoop XML Files

Navigate to $HADOOP_HOME/etc/hadoop and edit the following configuration files:

1. hadoop-env.sh
Uncomment and set the JAVA_HOME variable:

BASH

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2. core-site.xml
Configures the HDFS URI (NameNode address).

XML

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

3. hdfs-site.xml
Configures HDFS block replication and storage directories.

XML

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/data/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/data/datanode</value>
    </property>
</configuration>

4. mapred-site.xml
Tells Hadoop to use YARN as the MapReduce framework.

XML

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

5. yarn-site.xml
Configures the YARN NodeManager.

XML

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Step 6: Format the NameNode

Before starting Hadoop for the first time, you must format the HDFS filesystem. This initializes the directory structure for the NameNode.

BASH

hdfs namenode -format

(Warning: Never run this on a cluster with existing data, as it will delete all data.)

Step 7: Start Hadoop Daemons

Start the HDFS and YARN services.

BASH

# Start HDFS (NameNode, DataNode, SecondaryNameNode)
start-dfs.sh

# Start YARN (ResourceManager, NodeManager)
start-yarn.sh

Step 8: Verify the Installation

Use the Java Virtual Machine Process Status Tool (jps) to check if all daemons are running.

BASH

jps

Expected Output:

NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager
Jps

Web Interfaces for Verification:

NameNode UI: http://localhost:9870
ResourceManager UI: http://localhost:8088

Unit 2

Unit 1 - Notes

Table of Contents

Unit 1: Introduction to Hadoop

1. Introduction to Big Data

2. Types of Data

A. Structured Data

B. Semi-Structured Data

C. Unstructured Data

3. The V's of Big Data

4. Introduction to Hadoop

5. Components of Hadoop

A. HDFS (Hadoop Distributed File System)

B. MapReduce

C. YARN (Yet Another Resource Negotiator)

D. Hadoop Common

6. Installation of Apache Hadoop

Step 1: Prerequisites Installation

Step 2: Configure Passwordless SSH

Step 3: Download and Extract Hadoop

Step 4: Setup Environment Variables

Step 5: Configure Hadoop XML Files

Step 6: Format the NameNode

Step 7: Start Hadoop Daemons

Step 8: Verify the Installation