Unit1 - Subjective Questions
INT312 • Practice Questions with Detailed Answers
Define Big Data and explain the '5 V's' of Big Data.
Big Data refers to datasets whose size, type, and speed of creation make them too complex to be captured, managed, and processed by traditional databases and data processing tools.
The 5 V's of Big Data are:
- Volume: The massive amount of data generated every second (e.g., terabytes or petabytes).
- Velocity: The speed at which new data is generated and the pace at which it moves around (e.g., real-time streaming).
- Variety: The different types of data formats (structured, semi-structured like XML/JSON, and unstructured like audio/video).
- Veracity: The untrustworthiness or uncertainty of some data sources, requiring data cleansing.
- Value: The business value or insights that can be extracted from the collected data.
What is Apache Hadoop? Briefly explain its core components.
Apache Hadoop is an open-source framework managed by the Apache Software Foundation that allows for the distributed processing of large datasets across clusters of computers using simple programming models.
Its core components are:
- HDFS (Hadoop Distributed File System): The storage layer of Hadoop that breaks data into blocks and distributes them across cluster nodes.
- MapReduce: The processing or computation layer that breaks down processing into Map (filter/sort) and Reduce (summarize) phases.
- YARN (Yet Another Resource Negotiator): The resource management layer introduced in Hadoop 2.x that manages cluster resources and job scheduling.
- Hadoop Common: The common utilities and libraries that support the other Hadoop modules.
Explain the architecture of HDFS in detail. What are the roles of the NameNode and DataNode?
HDFS follows a Master-Slave architecture.
1. NameNode (Master):
- It acts as the centerpiece of the HDFS architecture.
- Metadata Storage: It stores all the metadata about the file system (e.g., file names, permissions, and the mapping of blocks to DataNodes).
- It does not store actual data.
- It keeps track of the health of DataNodes via Heartbeats.
2. DataNode (Slave):
- These are the commodity hardware machines deployed across the cluster.
- Data Storage: They store the actual data in the form of blocks (default 128 MB).
- They perform read and write requests from the file system's clients.
- They send periodic heartbeats and block reports to the NameNode to prove they are alive and report their stored blocks.
3. Secondary NameNode:
- It is a helper node that periodically merges the edit logs with the fsimage to prevent the edit logs from growing too large. It is not a backup NameNode in the traditional sense.
Describe the MapReduce programming model. How does it process large datasets?
MapReduce is a programming paradigm for parallel processing of large datasets. It consists of two main functions: Map and Reduce.
1. Map Phase:
- The input dataset is split into independent chunks which are processed by the map tasks in a completely parallel manner.
- The Map function takes a set of data and converts it into another set of data, where individual elements are broken down into key-value pairs.
2. Shuffle and Sort Phase:
- The output of the Map phase is sorted based on the keys and partitioned.
- Data with the same keys are shuffled and brought together to the same reducer.
3. Reduce Phase:
- The Reduce task takes the sorted key-value pairs and combines them to form a smaller set of tuples.
- It performs summary operations (like counting or aggregation) and writes the final output back to HDFS.
This model brings the computation to the data rather than moving data to the computation, vastly improving efficiency.
How does HDFS ensure fault tolerance and high availability?
HDFS ensures fault tolerance primarily through Data Replication.
- Block Replication: When a file is loaded into HDFS, it is broken into blocks. Each block is replicated across multiple DataNodes (the default replication factor is 3).
- Rack Awareness: The NameNode ensures that replicas are placed on different racks to protect against rack-level power or network failures.
- Heartbeat Mechanism: DataNodes send periodic heartbeats to the NameNode. If a DataNode fails to send a heartbeat, the NameNode declares it dead and initiates the replication of its blocks to other healthy nodes to maintain the replication factor.
- High Availability (HA) NameNode: In Hadoop 2.x onwards, an Active/Standby NameNode setup ensures that if the Active NameNode fails, the Standby immediately takes over without user disruption.
What is YARN? Explain the architecture of YARN including ResourceManager and NodeManager.
YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop introduced in version 2.x to separate resource management and job scheduling/monitoring from data processing.
Architecture Components:
1. ResourceManager (RM):
- The master daemon running on the master node.
- It is the ultimate authority that arbitrates resources among all applications in the system.
- It consists of a Scheduler (allocates resources) and an ApplicationsManager (accepts job submissions and negotiates the first container for the AppMaster).
2. NodeManager (NM):
- The per-machine framework slave daemon.
- It is responsible for launching the applications' containers, monitoring their resource usage (CPU, memory, disk, network), and reporting the same to the ResourceManager.
3. ApplicationMaster (AM):
- A per-application framework specific entity.
- It negotiates resources from the RM and works with the NM(s) to execute and monitor the tasks.
4. Container:
- Represents a fraction of the NM's capacity (Memory, CPU) used to execute tasks.
Compare and contrast Hadoop 1.x and Hadoop 2.x architectures.
Hadoop 1.x vs Hadoop 2.x:
- Resource Management:
- 1.x: Handled by JobTracker, which managed both resources and job scheduling, leading to bottlenecks.
- 2.x: Handled by YARN (ResourceManager for cluster resources, ApplicationMaster for job monitoring).
- NameNode:
- 1.x: Single Point of Failure (SPOF). Only one NameNode existed.
- 2.x: Introduced High Availability (HA) with Active and Standby NameNodes.
- Ecosystem Support:
- 1.x: Only supported MapReduce for processing.
- 2.x: Because of YARN, it supports multiple processing models (MapReduce, Spark, Storm, Flink).
- Scalability:
- 1.x: Limited to ~4000 nodes due to JobTracker limitations.
- 2.x: Scales easily beyond 10,000 nodes.
Explain the Heartbeat mechanism in Hadoop. Why is it important?
The Heartbeat is a periodic signal sent by DataNodes to the NameNode (and by NodeManagers to the ResourceManager in YARN) to indicate that they are alive and functioning properly.
- Default Interval: DataNodes send a heartbeat every 3 seconds.
- Health Monitoring: It acts as an "I am alive" message.
- Failure Detection: If the NameNode does not receive a heartbeat from a DataNode for a specific duration (default 10 minutes), it considers that node dead.
- Self-Healing: Upon detecting a dead node, the NameNode checks which blocks were on that node and instructs other nodes to replicate those blocks to maintain the required replication factor.
What is the role of the Secondary NameNode? Is it a backup for the NameNode?
No, the Secondary NameNode is NOT a backup for the primary NameNode.
Role of Secondary NameNode:
- The primary NameNode keeps the file system metadata in RAM for fast access, and logs changes to a file called
edits. The memory state is persisted in anfsimagefile. - Over time, the
editslog grows very large, which would cause a massive delay if the NameNode had to restart and replay it. - The Secondary NameNode periodically downloads the
fsimageandeditsfiles from the primary NameNode. - It merges them into a new, updated
fsimageand sends it back to the primary NameNode. - This process is called Checkpointing, which ensures the edit logs don't grow indefinitely and speeds up primary NameNode restarts.
What is a block in HDFS? Explain why HDFS uses such large block sizes (e.g., 128 MB).
A Block is the minimum amount of data that HDFS can read or write. Large files are broken down into multiple blocks before being stored.
- Default Size: In Hadoop 2.x, the default block size is 128 MB (compared to 4 KB in traditional OS file systems).
Why large block sizes?
- Minimize Seek Time: By making a block large enough, the time required to transfer the data from the disk becomes significantly larger than the time required to seek to the start of the block. This makes sequential reads highly efficient.
- Reduce Metadata Load on NameNode: The NameNode stores metadata about every block in RAM. Larger blocks mean fewer total blocks for a given dataset, reducing the memory footprint on the NameNode.
- Efficient Network Transfer: Less overhead in establishing network connections for massive data reads.
Explain the anatomy of a file read and write operation in HDFS.
HDFS Read Operation:
- The Client contacts the NameNode to get the block locations for the file.
- The NameNode replies with a list of DataNodes that host the blocks, sorted by proximity to the client.
- The Client directly contacts the closest DataNode and requests the block.
- Once the first block is read, it proceeds to the next until the file is complete.
HDFS Write Operation:
- The Client asks the NameNode for permission to write a file.
- The NameNode checks permissions and file existence, then provides a list of DataNodes to store the first block and its replicas.
- The Client writes data to the first DataNode.
- The first DataNode pipes the data to the second DataNode, which pipes it to the third (Data Replication Pipeline).
- Once all nodes in the pipeline acknowledge the write, an ack is sent back to the client.
- The client then requests nodes for the next block.
Discuss the Data Replication strategy in HDFS. Give an example with a replication factor of 3.
Data replication is HDFS's core strategy for reliability and fault tolerance. When a file is divided into blocks, each block is duplicated across multiple nodes.
Placement Policy (with Replication Factor = 3):
- Replica 1: Placed on the local machine where the client is writing the data (or a random node if the client is outside the cluster).
- Replica 2: Placed on a different node in a different rack from the first replica. This protects against rack failure.
- Replica 3: Placed on a different node in the same rack as the second replica. This minimizes inter-rack network traffic while maintaining redundancy.
If any node or rack goes down, HDFS uses the surviving replicas to recreate the lost blocks on other nodes.
Explain the High Availability (HA) architecture of NameNode introduced in Hadoop 2.x.
In Hadoop 1.x, the NameNode was a Single Point of Failure (SPOF). Hadoop 2.x solved this by introducing High Availability (HA).
Architecture Details:
- Active and Standby Nodes: Two NameNodes are configured; one is in an Active state, and the other is in a Standby state.
- Shared Storage: Both nodes share a storage directory (e.g., Quorum Journal Nodes or a shared NFS directory). When the Active node modifies the namespace, it logs the modification to the shared storage. The Standby node constantly reads these edits to keep its state synchronized.
- DataNode Block Reports: DataNodes are configured with the IP addresses of both NameNodes and send block reports and heartbeats to both.
- Failover Controller: Zookeeper manages the automatic failover. If the Active NameNode crashes, the Zookeeper Failover Controller (ZKFC) detects it and automatically promotes the Standby to Active, allowing the cluster to continue operating without downtime.
Detail the different phases of a MapReduce job (Map, Combine, Shuffle & Sort, Reduce).
A MapReduce job goes through several distinct phases:
1. Map Phase:
Data from HDFS is read as key-value pairs. The Map function processes these pairs and generates intermediate key-value pairs.
2. Combiner Phase (Optional but common):
Often called a 'mini-reducer', it runs on the Map output locally on the same node to aggregate data before sending it over the network, drastically reducing network I/O.
3. Shuffle and Sort Phase:
- Shuffle: The process of transferring intermediate data from the Mappers to the Reducers.
- Sort: The framework automatically sorts the intermediate keys so that all values associated with the same key are grouped together.
4. Reduce Phase:
The Reducer takes the grouped key-value pairs, iterates through the values, and applies the logic (e.g., sum, average) to produce the final output, which is then written to HDFS.
Briefly describe the components of the Hadoop Ecosystem (Hive, Pig, Sqoop, Flume, HBase).
The Hadoop ecosystem comprises several tools built around HDFS and YARN:
- Hive: A data warehousing tool that provides a SQL-like interface (HiveQL) to query and analyze data stored in HDFS. It abstracts the complexity of writing MapReduce jobs.
- Pig: A scripting platform (using Pig Latin) for processing and analyzing large datasets. It is excellent for ETL (Extract, Transform, Load) tasks.
- Sqoop: A tool designed to transfer data efficiently between Hadoop and traditional relational databases (RDBMS).
- Flume: A distributed service for efficiently collecting, aggregating, and moving large amounts of streaming log data into HDFS.
- HBase: A NoSQL, column-oriented database built on top of HDFS, providing real-time read/write access to massive datasets.
Discuss the scenarios where Hadoop is NOT the right tool for data processing.
While Hadoop is powerful, it is not a silver bullet. It is unsuitable for:
- Low Latency / Real-Time Data Access: Hadoop (specifically MapReduce) is designed for batch processing and high throughput, not for millisecond-level responses (unlike RDBMS).
- Processing Many Small Files: HDFS NameNode stores metadata in RAM. Millions of tiny files will exhaust NameNode memory rapidly.
- Highly Iterative Processing: Algorithms like Machine Learning that require multiple passes over the same data are slow in MapReduce because it writes intermediate results to disk. (Spark is better here).
- Complex Relational Transactions: Hadoop does not natively support ACID properties across multiple rows/tables as efficiently as a traditional RDBMS.
What is Speculative Execution in Hadoop MapReduce?
Speculative Execution is an optimization technique in Hadoop.
In a distributed environment, a job is only as fast as its slowest task (straggler). A node might be slow due to hardware degradation or CPU overload.
- Instead of waiting for the slow task to finish or failing it immediately, the Hadoop framework launches a duplicate (speculative) task on a different, faster node.
- Whichever task finishes first (the original or the speculative one), its results are accepted, and the other running task is killed.
- This prevents slow nodes from bottlenecking the entire MapReduce job.
Explain the concept of Rack Awareness in HDFS and its advantages.
Rack Awareness is a policy by which the NameNode decides how to place blocks based on the physical rack topology of the cluster.
- Administrators configure a script that maps Node IPs to their respective physical racks.
- Advantages:
- Fault Tolerance: By ensuring replicas of a block are placed on at least two different racks, HDFS guarantees data availability even if an entire rack goes offline (e.g., due to a switch or power failure).
- Network Efficiency: Inter-rack bandwidth is generally lower than intra-rack bandwidth. HDFS tries to serve read requests from a replica located on the same rack as the client, reducing overall network congestion.
Distinguish between traditional RDBMS and Apache Hadoop.
Differences between RDBMS and Hadoop:
- Data Type: RDBMS is designed for structured data. Hadoop can handle structured, semi-structured, and unstructured data.
- Schema: RDBMS uses "Schema-on-Write" (data must fit the schema when inserted). Hadoop uses "Schema-on-Read" (data is stored raw, schema is applied during querying).
- Processing: RDBMS excels at OLTP (Online Transaction Processing) with low latency. Hadoop excels at OLAP (Online Analytical Processing) and batch processing.
- Scaling: RDBMS generally scales vertically (adding more power to a single server). Hadoop scales horizontally (adding more commodity servers to a cluster).
- Cost: RDBMS often requires expensive proprietary hardware/software. Hadoop uses open-source software on commodity hardware.
Discuss the concept of Data Locality in Hadoop. Why is it a fundamental principle of the framework?
Data Locality refers to the ability to move the computation to the node where the data actually resides, rather than moving large amounts of data over the network to the computation node.
Why it is fundamental:
- In traditional systems, moving terabytes of data to a processing node causes massive network bottlenecks and slows down processing.
- Hadoop reverses this: HDFS splits the data into blocks and distributes them across the cluster. YARN/MapReduce then schedules the Map tasks directly on the DataNodes that hold the respective blocks.
- Result: This eliminates network congestion, maximizes disk I/O efficiency, and is the core reason Hadoop can scale to process petabytes of data efficiently.