Unit 1 - Notes
Unit 1: Introduction to Hadoop
1. Introduction to Big Data
Big Data refers to massive, complex datasets that grow exponentially over time. These datasets are so voluminous and complex that traditional data processing software and database management systems (like standard RDBMS) cannot capture, store, manage, or analyze them efficiently.
Key Characteristics of the Big Data Era:
- Scale: Data is generated in Terabytes (TB), Petabytes (PB), and Exabytes (EB).
- Sources: Data originates from diverse sources, including social media, IoT devices, sensors, enterprise applications, web logs, and mobile devices.
- Purpose: The ultimate goal of Big Data is not just storage, but analytics—extracting meaningful insights to drive business decisions, predictive modeling, and machine learning.
2. Types of Data
Big Data is broadly categorized into three main types based on its level of organization and format:
A. Structured Data
- Definition: Data that has a predefined data model, format, and structure. It is highly organized and easily searchable.
- Storage: Typically stored in Relational Database Management Systems (RDBMS) using tables with rows and columns.
- Examples: Employee databases, customer records, financial transaction logs, inventory systems.
- Processing: Easily queried using Structured Query Language (SQL).
B. Semi-Structured Data
- Definition: Data that does not reside in a relational database but has some organizational properties (like tags or markers) that make it easier to analyze. It contains semantic elements.
- Storage: NoSQL databases, object storage.
- Examples: XML files, JSON documents, CSV files, HTML web pages, email headers.
- Processing: Requires specific parsing rules or schemas applied at the time of reading (schema-on-read).
C. Unstructured Data
- Definition: Data that has no recognizable internal structure or predefined data model. It makes up approximately 80% to 90% of all data generated today.
- Storage: Data lakes, Hadoop Distributed File System (HDFS), NoSQL databases.
- Examples: Text documents, PDFs, images, audio files, video streams, social media posts, satellite imagery.
- Processing: Requires advanced processing techniques like Natural Language Processing (NLP), computer vision, and machine learning to extract value.
3. The V's of Big Data
To understand Big Data, industry experts use the "V" framework. While it started with 3 V's, it has expanded to 5 core V's (and sometimes more).
- Volume: Refers to the sheer amount of data generated every second. Traditional systems measure data in Gigabytes, whereas Big Data measures in Petabytes and Zettabytes.
- Velocity: The speed at which new data is generated and the speed at which it needs to be processed. For example, millions of credit card transactions or social media posts happen in real-time and require rapid stream processing.
- Variety: The different types and formats of data (Structured, Semi-Structured, Unstructured) coming from disparate sources.
- Veracity: Refers to the messiness or trustworthiness of the data. With so much data available, quality and accuracy can be compromised (e.g., typos, hashtags, missing values). High veracity means the data is meaningful for analysis.
- Value: The most important 'V'. It refers to the business value or actionable insights that can be extracted from the data. If massive data cannot be turned into value, the storage and processing efforts are useless.
4. Introduction to Hadoop
Apache Hadoop is an open-source software framework managed by the Apache Software Foundation. It is designed for distributed storage and distributed processing of very large datasets across computer clusters built from commodity hardware.
History & Origin:
- Created by Doug Cutting and Mike Cafarella in 2005.
- Inspired by Google’s white papers on the Google File System (GFS) and MapReduce.
- Named after a yellow toy elephant owned by Doug Cutting’s son.
Key Features of Hadoop:
- Highly Scalable: You can easily add more nodes (servers) to the cluster without changing data formats.
- Fault-Tolerant: Data is replicated across multiple nodes. If one machine fails, data is not lost, and processing is redirected to another node.
- Cost-Effective: Runs on cheap, industry-standard commodity hardware rather than expensive, specialized servers.
- Data Locality: Instead of moving massive amounts of data to the processing unit, Hadoop moves the processing logic to the nodes where the data resides, significantly reducing network congestion.
5. Components of Hadoop
Modern Hadoop (Hadoop 2.x and 3.x) consists of four core modules:
A. HDFS (Hadoop Distributed File System)
The storage layer of Hadoop. It breaks large files into smaller blocks (default size is 128 MB or 256 MB) and distributes them across the cluster.
- NameNode (Master): Maintains the namespace tree and metadata for all files and directories in the cluster (e.g., file names, permissions, block locations). It does not store the actual data.
- DataNode (Slave): The worker nodes that store and retrieve the actual data blocks as instructed by the NameNode. They periodically send "heartbeats" and "block reports" to the NameNode.
- Replication: By default, HDFS replicates each block 3 times across different DataNodes to ensure fault tolerance.
B. MapReduce
The processing/computation layer of Hadoop. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.
- Map Phase: Takes a set of data and converts it into another set of data, where individual elements are broken down into key-value pairs. (Filtering and sorting).
- Reduce Phase: Takes the output from a map as an input and combines those data tuples into a smaller set of tuples. (Aggregation and summary).
C. YARN (Yet Another Resource Negotiator)
The resource management layer introduced in Hadoop 2.0. It acts as the operating system for Hadoop, managing cluster resources and scheduling jobs.
- ResourceManager (Master): The ultimate authority that arbitrates resources among all applications in the system.
- NodeManager (Slave): Responsible for individual compute nodes. It monitors resource usage (CPU, memory, disk) of containers and reports to the ResourceManager.
- ApplicationMaster: Negotiates resources from the ResourceManager and works with NodeManager(s) to execute and monitor tasks.
- Container: A fraction of the NodeManager capacity allocated for a specific task.
D. Hadoop Common
Also known as Hadoop Core, these are the common utilities, libraries, and dependencies required by the other Hadoop modules (HDFS, MapReduce, YARN) to function.
Note: The Hadoop Ecosystem extends beyond these core components to include tools like Hive (SQL-like querying), Pig (scripting), HBase (NoSQL database), Spark (in-memory processing), Sqoop (data transfer), and Zookeeper (coordination).
6. Installation of Apache Hadoop
Hadoop can be installed in three operating modes:
- Local/Standalone Mode: Runs on a single machine as a single Java process. Used for debugging.
- Pseudo-Distributed Mode: Runs on a single machine, but each Hadoop daemon (NameNode, DataNode, etc.) runs in a separate Java process. Used for learning and testing.
- Fully-Distributed Mode: Runs across a cluster of multiple machines. Used in production.
Below are the detailed steps for installing Hadoop in Pseudo-Distributed Mode on a Linux system (e.g., Ubuntu).
Step 1: Prerequisites Installation
Hadoop requires Java and SSH for daemons to communicate seamlessly.
# Update repositories
sudo apt update
# Install Java (JDK 8 is highly recommended for Hadoop)
sudo apt install openjdk-8-jdk -y
# Install OpenSSH server and client
sudo apt install openssh-server openssh-client -y
Step 2: Configure Passwordless SSH
Hadoop needs passwordless SSH access to manage its nodes.
# Generate SSH keys
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
# Add the generated key to authorized keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# Test SSH (should not ask for a password)
ssh localhost
Step 3: Download and Extract Hadoop
# Download Hadoop (replace with current stable version)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
# Extract the archive
tar -xzvf hadoop-3.3.6.tar.gz
# Move to a standard directory
sudo mv hadoop-3.3.6 /usr/local/hadoop
Step 4: Setup Environment Variables
Edit the ~/.bashrc file to add Hadoop and Java paths.
nano ~/.bashrc
# Add the following lines at the end of the file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
# Apply the changes
source ~/.bashrc
Step 5: Configure Hadoop XML Files
Navigate to $HADOOP_HOME/etc/hadoop and edit the following configuration files:
1. hadoop-env.sh
Uncomment and set the JAVA_HOME variable:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
2. core-site.xml
Configures the HDFS URI (NameNode address).
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
3. hdfs-site.xml
Configures HDFS block replication and storage directories.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/data/datanode</value>
</property>
</configuration>
4. mapred-site.xml
Tells Hadoop to use YARN as the MapReduce framework.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. yarn-site.xml
Configures the YARN NodeManager.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Step 6: Format the NameNode
Before starting Hadoop for the first time, you must format the HDFS filesystem. This initializes the directory structure for the NameNode.
hdfs namenode -format
(Warning: Never run this on a cluster with existing data, as it will delete all data.)
Step 7: Start Hadoop Daemons
Start the HDFS and YARN services.
# Start HDFS (NameNode, DataNode, SecondaryNameNode)
start-dfs.sh
# Start YARN (ResourceManager, NodeManager)
start-yarn.sh
Step 8: Verify the Installation
Use the Java Virtual Machine Process Status Tool (jps) to check if all daemons are running.
jps
Expected Output:
- NameNode
- DataNode
- SecondaryNameNode
- ResourceManager
- NodeManager
- Jps
Web Interfaces for Verification:
- NameNode UI:
http://localhost:9870 - ResourceManager UI:
http://localhost:8088