Unit2 - Subjective Questions
INT312 • Practice Questions with Detailed Answers
Explain the core architecture of Hadoop and its primary components.
Hadoop Architecture is primarily designed to store and process huge amounts of data in a distributed environment. It follows a master-slave architecture.
The core components of Hadoop Architecture are:
- HDFS (Hadoop Distributed File System): The storage layer of Hadoop. It breaks large files into smaller blocks and distributes them across the cluster. It consists of a Master (NameNode) and multiple Slaves (DataNodes).
- MapReduce: The processing layer of Hadoop. It is a programming model that processes large data sets by dividing tasks into independent sub-tasks (Map phase) and aggregating the results (Reduce phase).
- YARN (Yet Another Resource Negotiator): Introduced in Hadoop 2.x, it acts as the resource management layer. It allocates system resources (CPU, memory) to various applications and schedules tasks on different cluster nodes.
Describe the architecture of the Hadoop Distributed File System (HDFS).
HDFS follows a Master-Slave architecture. The key components are:
- NameNode (Master):
- It manages the file system namespace and regulates client access to files.
- It stores metadata such as file names, permissions, and the mapping of blocks to DataNodes.
- It maintains the EditLog (transaction log) and FsImage (snapshot of the file system).
- DataNode (Slave):
- These are the actual worker nodes where data blocks are stored.
- They perform read and write requests from the file system's clients.
- They periodically send Heartbeats and Block Reports to the NameNode to prove they are alive and report their stored blocks.
- Secondary NameNode:
- It periodically merges the EditLog and FsImage from the NameNode to prevent the EditLog from becoming too large. It is not a high-availability backup for the NameNode.
Differentiate between NameNode and DataNode in HDFS.
Here are the key differences between NameNode and DataNode:
- Role: NameNode is the master node; DataNode is the slave/worker node.
- Storage: NameNode stores only metadata (file paths, block maps, permissions); DataNode stores the actual data blocks.
- Hardware Requirements: NameNode requires high RAM (memory) to keep metadata fast and accessible; DataNodes require high disk storage capacity.
- Failure Impact: If the active NameNode fails (without HA configured), the entire cluster becomes inaccessible (Single Point of Failure). If a DataNode fails, the cluster continues to function as data is replicated on other DataNodes.
- Quantity: Typically, a cluster has one active NameNode (and one standby in HA), but can have hundreds or thousands of DataNodes.
Explain the role and functioning of the Secondary NameNode.
The Secondary NameNode is a helper node to the primary NameNode, not a hot backup.
Functioning:
- The primary NameNode keeps the current state of the file system in memory, backed up by two files on disk: the FsImage (a snapshot) and the EditLog (incremental changes).
- Over time, the EditLog can grow extremely large, making NameNode startup very slow.
- The Secondary NameNode periodically downloads the FsImage and EditLog from the primary NameNode.
- It merges them together into a new FsImage file.
- It uploads the new FsImage back to the primary NameNode and truncates the EditLog.
Role: Its primary role is to perform this "checkpointing" process to keep the EditLog size manageable and ensure faster NameNode restarts.
Describe the data read operation in HDFS with a step-by-step workflow.
The data read operation in HDFS involves the client interacting with both the NameNode and DataNodes:
- Open Request: The client opens the file it wishes to read by calling
open()on the FileSystem object. - Get Block Locations: The FileSystem calls the NameNode via RPC to determine the locations of the blocks comprising the file. The NameNode returns the addresses of the DataNodes that have a copy of that block, sorted by their proximity to the client.
- Read Data: The client calls
read()on the input stream. The stream connects to the closest DataNode for the first block. - Data Transfer: Data is streamed from the DataNode back to the client. When the end of the block is reached, the connection to the DataNode is closed.
- Sequential Reading: The client then connects to the best DataNode for the next block. This process continues until the entire file is read.
- Close: Once reading is complete, the client calls
close()on the input stream.
Explain the data write operation in HDFS.
Writing data to HDFS is a pipelined process:
- Create Request: The client asks the NameNode to create a new file (with no blocks initially).
- Validation: The NameNode checks if the file already exists and if the client has permissions. If valid, it records the new file in the namespace.
- Write Data: The client begins writing data to a data queue. The data is broken into packets.
- Pipeline Setup: A
DataStreamerasks the NameNode for a list of DataNodes to host the replicas (e.g., a list of 3 DataNodes for a replication factor of 3). These nodes form a pipeline. - Data Streaming: The packet is sent to the first DataNode, which stores it and forwards it to the second DataNode, which stores and forwards it to the third.
- Acknowledgments: Once a node writes the packet, it sends an ACK back up the pipeline.
- Close: When the client finishes writing, it closes the stream and notifies the NameNode that the file is complete.
What is Rack Awareness in Hadoop? Why is it important?
Rack Awareness is a policy used in Hadoop to determine the physical location of cluster nodes across different server racks. Hadoop cluster administrators define a topology script that maps DataNodes to specific racks.
Importance:
- Data Reliability (Fault Tolerance): HDFS places replicas of blocks across different racks. By default (with replication factor 3), one replica is placed on a local rack, and the other two are placed on a different, remote rack. If an entire rack fails (e.g., due to a switch or power failure), the data is still available on the other rack.
- Network Performance: Network bandwidth between machines in the same rack is generally greater than between machines in different racks. Rack awareness allows Hadoop to optimize read/write operations by preferring data transfers within the same rack when possible, reducing cross-rack network traffic.
Explain the architecture of YARN (Yet Another Resource Negotiator).
YARN separates resource management and job scheduling/monitoring into separate daemons. Its architecture consists of:
- ResourceManager (RM): The global master daemon. It arbitrates resources among all competing applications in the cluster. It has two main components:
- Scheduler: Allocates resources to various running applications subject to constraints of capacities, queues, etc.
- ApplicationsManager: Accepts job submissions and negotiates the first container for executing the ApplicationMaster.
- NodeManager (NM): The per-machine slave daemon. It is responsible for launching applications' containers, monitoring their resource usage (CPU, memory, disk, network), and reporting this to the ResourceManager.
- ApplicationMaster (AM): A framework-specific library. It negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks.
- Container: Represents an allocated fraction of resources (RAM, CPU core) on a specific NodeManager.
Differentiate between Hadoop 1.x and Hadoop 2.x architecture.
The primary differences between Hadoop 1.x and 2.x lie in processing and resource management:
- Resource Management:
- Hadoop 1.x: Uses MapReduce (JobTracker and TaskTracker) for both resource management and data processing.
- Hadoop 2.x: Introduces YARN. Resource management is handled by YARN (ResourceManager/NodeManager), while processing can be handled by MapReduce or other frameworks (Spark, Tez).
- Bottlenecks:
- Hadoop 1.x: JobTracker was a single point of failure and a scalability bottleneck (limited to ~4000 nodes).
- Hadoop 2.x: YARN decentralizes the workload (using ApplicationMasters), allowing clusters to scale beyond 10,000 nodes.
- Ecosystem Compatibility:
- Hadoop 1.x: Only supports batch processing via MapReduce.
- Hadoop 2.x: Supports batch, interactive, and real-time processing through YARN's generic resource management.
Describe the MapReduce programming model and its core phases.
MapReduce is a processing model for distributed computing based on Java. It processes data in key-value pairs. The core phases are:
- Map Phase: The input data is split into independent chunks which are processed by Map tasks in parallel. The Map function takes input pairs
(k1, v1)and produces intermediate output pairslist(k2, v2). - Shuffle and Sort Phase: The intermediate outputs from the Map phase are transferred to the Reducers. The Hadoop framework sorts this data by key. All values associated with the same key are grouped together.
- Reduce Phase: The Reduce task takes the grouped key-value pairs
(k2, list(v2))and applies a reduce function to aggregate, filter, or combine the data, producing the final outputlist(k3, v3). This output is then written to HDFS.
Explain the significance of the Shuffle and Sort phase in MapReduce.
The Shuffle and Sort phase is the bridge between the Map phase and the Reduce phase in Hadoop MapReduce.
- Shuffling: This is the process of transferring intermediate data from the Mappers (which can be distributed across many nodes) to the appropriate Reducers. The framework uses a partitioner to determine which reducer receives which key-value pairs (usually using a hash function like
hash(key) mod R, where R is the number of reducers). - Sorting: Before the Reducer processes the data, Hadoop automatically sorts the key-value pairs by key. This ensures that the Reducer receives data grouped by keys in a predictable, sorted order.
- Significance: It guarantees that all values for a single key are processed by the exact same reducer, which is essential for accurate aggregation (e.g., counting word frequencies).
Discuss the concept of Block Size and Replication in HDFS.
Block Size:
In HDFS, files are divided into large chunks called blocks. The default block size in Hadoop 2.x/3.x is 128 MB (compared to 4KB in traditional OS file systems).
Reason: A large block size minimizes the cost of disk seeks and reduces the metadata size stored in the NameNode's RAM.
Replication:
To ensure fault tolerance, HDFS replicates each block across multiple DataNodes. The default Replication Factor is 3.
Mechanism: For a given block, HDFS stores:
- One replica on the local node (or a random node if the writer is outside the cluster).
- A second replica on a different rack.
- A third replica on the same remote rack but on a different node.
This ensures data survival even if an entire node or rack fails.
How does Hadoop achieve High Availability (HA)? Explain the HA architecture.
In Hadoop 1.x, the NameNode was a Single Point of Failure (SPOF). Hadoop 2.x introduced High Availability (HA) to solve this.
HA Architecture:
HA configures a cluster with two NameNodes: an Active NameNode and a Standby NameNode.
- Active NameNode: Responsible for all client operations in the cluster.
- Standby NameNode: Maintains enough state to provide a fast failover if the Active node fails.
State Synchronization:
To keep states synchronized, both nodes communicate with a group of separate daemons called JournalNodes (using the Quorum Journal Manager). When the Active node makes namespace modifications, it logs a record to the JournalNodes. The Standby node constantly reads these edits and applies them to its own namespace.
Failover:
Apache ZooKeeper is used for automatic failover. It monitors the Active NameNode. If it crashes, ZooKeeper triggers a failover, promoting the Standby NameNode to Active status, ensuring minimal cluster downtime.
What are the different execution modes of Hadoop? Describe each briefly.
Hadoop can be deployed in three main execution modes:
- Standalone (Local) Mode:
- This is the default mode. Hadoop runs as a single Java process.
- It uses the local file system instead of HDFS.
- Useful for debugging and testing MapReduce programs locally.
- Pseudo-Distributed Mode:
- Hadoop runs on a single machine, but each daemon (NameNode, DataNode, ResourceManager, NodeManager) runs in a separate Java process.
- It simulates a multi-node cluster on a single machine.
- Useful for testing cluster configurations and HDFS operations.
- Fully-Distributed Mode:
- This is the production mode where Hadoop runs on a cluster of multiple machines.
- Master daemons and Slave daemons are distributed across different physical or virtual servers.
Explain the roles of JobTracker and TaskTracker in Hadoop 1.x.
In Hadoop 1.x, the MapReduce framework used two main daemons for execution:
- JobTracker (Master):
- There is one JobTracker per cluster.
- It is responsible for resource management, tracking resource availability, and scheduling jobs.
- When a client submits a job, the JobTracker splits it into Map and Reduce tasks and assigns them to available TaskTrackers.
- It monitors TaskTrackers via heartbeats.
- TaskTracker (Slave):
- There is one TaskTracker per worker node.
- It executes the Map and Reduce tasks directed by the JobTracker.
- It sends heartbeat messages to the JobTracker every few seconds to report its status (alive, available slots).
Discuss the function of the ApplicationMaster in YARN.
The ApplicationMaster (AM) in YARN is a framework-specific library tasked with negotiating resources and managing task execution for a single application.
Functions:
- Resource Negotiation: After a job is submitted, the YARN ResourceManager allocates a container for the AM. The AM then calculates the total resources (CPU, memory) needed for its application and negotiates these from the ResourceManager.
- Task Scheduling: Once it receives containers from the ResourceManager, the AM contacts the respective NodeManagers to start the tasks.
- Monitoring and Fault Tolerance: The AM tracks the status and progress of all its tasks. If a task fails or a container crashes, the AM is responsible for requesting a new container and restarting the task.
What is a 'Heartbeat' in Hadoop, and why is it important?
A Heartbeat is a periodic signal sent by slave nodes to their respective master nodes in the Hadoop cluster.
- In HDFS: DataNodes send heartbeats to the NameNode (typically every 3 seconds).
- In YARN: NodeManagers send heartbeats to the ResourceManager.
Importance:
- Liveness Check: It proves that the slave node is alive and functioning. If the master does not receive a heartbeat from a node for a specified timeout period (usually 10 minutes), it marks the node as dead.
- Status Reporting: The heartbeat carries information about the node's total capacity, fractions of resources in use, and (in HDFS) block reports.
- Fault Tolerance: Identifying dead nodes allows the master to initiate recovery mechanisms, such as re-replicating lost HDFS blocks or rescheduling failed YARN tasks on other active nodes.
Explain how fault tolerance is achieved in HDFS.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware where failures are common. It achieves fault tolerance through:
- Data Replication: Every block of data is replicated across multiple DataNodes (default is 3). If one DataNode crashes, the data can be retrieved from another node hosting the replica.
- Heartbeats: DataNodes constantly send heartbeat signals to the NameNode. If a heartbeat is lost, the NameNode identifies the dead DataNode and routes future read/write requests to other nodes.
- Re-replication: When a DataNode fails, the replication factor of its blocks drops. The NameNode detects this via block reports and automatically commands other DataNodes to copy the blocks until the replication factor is restored to 3.
- Checksums: HDFS computes checksums for data blocks during writes and verifies them during reads to detect data corruption. Corrupt blocks are discarded and fetched from other replicas.
Describe the concept of Data Locality in Hadoop and explain its interaction between HDFS and MapReduce/YARN.
Data Locality is the principle of moving the computation to the data, rather than moving large amounts of data to the computation.
Interaction between HDFS and YARN:
- When a MapReduce job is submitted, the YARN ResourceManager communicates with the HDFS NameNode to determine the exact locations (DataNodes and racks) of the data blocks required for the job.
- The ResourceManager's scheduler then attempts to allocate containers (execution environments) on the exact same DataNodes where the data blocks reside (Node-local).
- If node-local resources are unavailable, it tries to allocate containers on the same rack (Rack-local).
- As a last resort, it allocates containers on a different rack (Off-rack).
Importance: Data locality drastically minimizes network congestion and increases the overall throughput of the system, which is critical for processing Big Data.
Briefly describe the purpose of ZooKeeper, Hive, and Pig in the Hadoop Ecosystem in relation to the core architecture.
While HDFS and YARN form the core architecture, other tools interact with them to provide a complete ecosystem:
- ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. In Hadoop architecture, it is critically used for High Availability (HA) to manage automatic failover of the Active NameNode and ResourceManager.
- Hive: A data warehouse infrastructure built on top of Hadoop. It provides an SQL-like interface (HiveQL) to query data stored in HDFS. Hive translates these queries into MapReduce (or Tez/Spark) jobs executed via YARN, abstracting the complexity of MapReduce programming.
- Pig: A high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Like Hive, Pig scripts are converted into MapReduce tasks by the framework, allowing for easier data transformation (ETL) pipelines on HDFS data.