Unit3 - Subjective Questions
INT312 • Practice Questions with Detailed Answers
Explain the MapReduce programming paradigm. What are its two primary phases?
MapReduce is a programming model and processing engine designed for processing and generating large datasets distributed across a cluster.
It consists of two primary phases:
- Map Phase: The input data is split into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. The Map function takes a key-value pair and produces a set of intermediate key-value pairs.
- Reduce Phase: The Reduce function accepts the intermediate key-value pairs (grouped by key) and merges or aggregates them to form a smaller set of values.
Mathematically, the transformations can be represented as:
- Map:
- Reduce:
Describe the architecture of Classic MapReduce (MRv1). What are the main daemons involved?
Classic MapReduce (MRv1) operates using a Master-Slave architecture consisting of two primary daemons:
- JobTracker (Master): There is one JobTracker per cluster. It acts as the master node and is responsible for resource management, tracking resource availability, and scheduling jobs. It divides jobs into tasks and assigns them to available TaskTrackers.
- TaskTracker (Slave): There are multiple TaskTrackers (usually one per node). They act as slaves, executing the tasks (Map or Reduce) directed by the JobTracker. They continuously send heartbeat messages to the JobTracker to indicate their alive status and current resource availability.
Limitations: The JobTracker became a bottleneck and a single point of failure in massive clusters, leading to the development of YARN.
Explain the WordCount program using the MapReduce framework with a clear example.
The WordCount program counts the frequency of each word in a given text file.
Example: Consider the input text: "hello world hello hadoop"
1. Input Splitting & Record Reader:
The text is converted into Key-Value pairs. Key is the byte offset, Value is the line text.
- Pair 1:
(0, "hello world hello hadoop")
2. Map Phase:
The Mapper tokenizes the value by space and outputs a count of 1 for each word.
- Output:
("hello", 1),("world", 1),("hello", 1),("hadoop", 1)
3. Shuffle and Sort Phase:
The framework groups all values sharing the same key.
- Output:
("hadoop", [1]),("hello", [1, 1]),("world", [1])
4. Reduce Phase:
The Reducer sums up the lists of values for each key.
- Output:
("hadoop", 1),("hello", 2),("world", 1)
What is the role of a Combiner in MapReduce? Why is it known as a 'Mini-Reducer'?
A Combiner is an optional component in MapReduce that executes locally on the node where the Mapper ran.
Role:
- Its primary job is to aggregate the output of the Mapper locally before it is sent over the network to the Reducer.
- This significantly reduces the volume of data transferred across the network, optimizing bandwidth and speeding up the shuffle phase.
Why 'Mini-Reducer':
- The Combiner performs a function very similar to the Reducer (e.g., aggregation, summing), but it only processes the data generated by a single Map task. For instance, in WordCount, if a Mapper outputs
("apple", 1)three times, the Combiner aggregates this to("apple", 3)before sending it to the Reducer.
Explain the Partitioning phase in MapReduce. How does the default Partitioner work?
Partitioning in MapReduce dictates which Reducer will process a specific intermediate key-value pair.
Mechanism:
- The Partitioner runs after the Map phase and the optional Combiner phase.
- The total number of partitions equals the number of Reduce tasks configured for the job (e.g., if there are reducers, there are partitions).
Default Partitioner (HashPartitioner):
- The default partitioner applies a hash function to the key:
partition = (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks. - This ensures that all key-value pairs with the exact same key are sent to the exact same Reducer, which is essential for accurate aggregation.
Detail the Shuffle and Sort phase in MapReduce.
The Shuffle and Sort phase is the bridge between the Map and Reduce phases. It is handled entirely by the MapReduce framework.
Shuffle:
- It is the process by which intermediate data from Mappers is transferred to the Reducers.
- Reducers pull the data via HTTP from the nodes where Mappers have written their output.
Sort / Merge:
- Before the Reducer can process the data, all intermediate key-value pairs must be sorted by key.
- Since a Reducer gathers data from multiple Mappers, the incoming data streams are merged and sorted.
- This grouping ensures that the Reducer receives data in the format:
(Key, List[Values]). - This phase is often the most resource-intensive part of the job due to disk I/O and network bandwidth consumption.
What were the major limitations of MapReduce v1 (MRv1) that necessitated the creation of YARN?
The major limitations of MRv1 were:
- Scalability Bottleneck: The JobTracker managed both cluster resources and job execution. This dual responsibility caused it to choke when scaling beyond roughly 4,000 nodes or 40,000 tasks.
- Single Point of Failure (SPOF): There was only one JobTracker. If it failed, all running and queued jobs failed, and the cluster was essentially unusable.
- Static Resource Allocation: Resources were strictly divided into "Map slots" and "Reduce slots." A node might have free Reduce slots but couldn't use them to run pending Map tasks, leading to poor resource utilization.
- Non-MapReduce Applications: MRv1 could only run MapReduce jobs. It could not efficiently support other processing models like graph processing, real-time streaming, or interactive queries.
Describe the architecture of YARN (Yet Another Resource Negotiator).
YARN (MapReduce v2) decoupled the resource management and job scheduling/monitoring functions into separate daemons.
Key Components:
- ResourceManager (RM): The global master daemon responsible for tracking resources and allocating them among various applications. It contains two main components: the Scheduler and the ApplicationsManager.
- NodeManager (NM): The per-machine slave daemon responsible for launching and managing Containers (logical bundles of resources like CPU, Memory, Disk). It monitors resource usage and reports to the RM.
- ApplicationMaster (AM): A per-application daemon that negotiates resources from the RM and works with the NM(s) to execute and monitor the application's tasks.
- Container: The unit of resource allocation in YARN, representing a fraction of a Node's capacity.
Explain the role of the ResourceManager in YARN architecture.
The ResourceManager (RM) is the central authority in YARN that manages resources across the cluster.
Main Responsibilities:
- Global Allocation: It has the ultimate authority to allocate resources to various competing applications.
- Pluggable Scheduler: The RM includes a Scheduler (e.g., Capacity, Fair) that allocates resources based on constraints like queue capacities and limits.
- ApplicationsManager (ASM): This component of the RM is responsible for accepting job submissions, negotiating the first container for the ApplicationMaster, and providing the service for restarting the ApplicationMaster container on failure.
- Node Management: It keeps track of live nodes via heartbeat signals received from NodeManagers.
Explain the role of the NodeManager in YARN.
The NodeManager (NM) is the worker daemon in YARN, running on every node in the cluster.
Main Responsibilities:
- Container Management: It receives commands from the ApplicationMaster to start, monitor, and kill containers.
- Resource Monitoring: It tracks the resource usage (CPU, memory, disk, network) of the containers and ensures they do not exceed their allocated limits.
- Heartbeats: It periodically sends heartbeats to the ResourceManager to report its health and the status of the containers running on it.
- Log Management: It handles log aggregation, moving container logs to a distributed file system (like HDFS) after an application finishes.
What is an ApplicationMaster in YARN? Describe its responsibilities.
The ApplicationMaster (AM) is a framework-specific entity in YARN created for the lifetime of a single application (or job).
Responsibilities:
- Resource Negotiation: It calculates the resource requirements for the application and requests these resources (Containers) from the ResourceManager.
- Task Execution: Once containers are allocated, the AM contacts the respective NodeManagers to launch the tasks inside those containers.
- Monitoring and Fault Tolerance: It tracks the status and progress of its tasks. If a task fails, the AM requests a new container from the RM to restart the task.
- Lifecycle Management: It manages the application's lifecycle, from start to completion, and finally unregisters itself with the RM upon completion.
Describe the complete application execution workflow in YARN.
The execution workflow of an application in YARN involves several steps:
- Submission: The client submits an application (e.g., a MapReduce job) to the ResourceManager (RM).
- AM Allocation: The RM allocates a Container on a NodeManager (NM) to run the ApplicationMaster (AM).
- AM Launch: The NM launches the AM container.
- Resource Negotiation: The AM registers with the RM and requests additional Containers to run the actual application tasks (e.g., Mappers and Reducers).
- Container Launch: The RM grants containers to the AM. The AM then communicates with the respective NMs to start these containers.
- Execution: The application code executes inside the containers. The containers report their status back to the AM.
- Completion: Once all tasks are complete, the AM deregisters from the RM and shuts down. The RM reclaims all used containers.
Differentiate between Classic MapReduce (MRv1) and YARN (MRv2).
Here are the key differences between MRv1 and MRv2 (YARN):
- Resource Management: In MRv1, JobTracker managed resources. In MRv2, ResourceManager handles it globally.
- Job Scheduling: In MRv1, JobTracker scheduled jobs. In MRv2, ApplicationMaster (per job) manages task scheduling and monitoring.
- Scalability: MRv1 scales up to ~4,000 nodes. YARN can scale beyond 10,000 nodes due to decoupled responsibilities.
- Resource Slots: MRv1 used static Map and Reduce slots. YARN uses dynamic, fine-grained Containers (memory/CPU bundles).
- Single Point of Failure (SPOF): MRv1's JobTracker was a SPOF. YARN supports ResourceManager High Availability (HA).
- Multi-tenancy: MRv1 only ran MapReduce. YARN is a general-purpose cluster manager that runs MapReduce, Spark, Flink, Storm, etc.
How is Fault Tolerance achieved in MapReduce?
MapReduce is designed to handle failures gracefully without aborting the entire job.
- Task Failure: If a Map or Reduce task crashes or hangs, the ApplicationMaster (in YARN) or JobTracker (in MRv1) notices the absence of heartbeat/progress. It marks the task as failed and reschedules it on a different node.
- Node Failure: If an entire DataNode/NodeManager fails, the master daemon detects the lost heartbeat. All tasks running on that node are rescheduled on healthy nodes.
- Data Locality Fallback: If a node storing the input data fails, MapReduce leverages HDFS replication. It schedules the task on another node that contains a replica of the data.
- Task Attempts: A task is retried a configurable number of times (default is usually 4). If it fails repeatedly, only then does the entire job fail.
Explain the concept of Speculative Execution in MapReduce.
Speculative Execution is an optimization technique used in MapReduce to handle "straggler" tasks.
Concept:
- In a large cluster, some tasks may run significantly slower than others due to hardware degradation, network congestion, or CPU load. These are called stragglers.
- Instead of waiting indefinitely for the slow task to finish (which bottlenecks the whole job), the framework launches a duplicate (speculative) copy of the slow task on a different, healthy node.
- Resolution: Whichever task (the original or the speculative copy) finishes first, its output is accepted, and the other task is immediately killed.
- Pros & Cons: It reduces overall job execution time but consumes extra cluster resources.
Describe the three main types of Schedulers available in YARN.
YARN provides three primary schedulers to manage resource allocation:
1. FIFO Scheduler:
- Jobs are placed in a simple First-In-First-Out queue.
- Pros: Very simple to understand and configure.
- Cons: Not suitable for shared clusters. A massive job will block all subsequent small jobs until it finishes.
2. Capacity Scheduler:
- Divides cluster resources into hierarchical queues, each with a guaranteed capacity (e.g., Marketing gets 30%, Engineering gets 70%).
- Pros: Ensures predictable resource sharing among organizations while allowing queues to use unallocated resources temporarily.
3. Fair Scheduler:
- Dynamically distributes resources so that all running applications get an equal (fair) share of resources over time.
- Pros: Excellent for ad-hoc queries; small jobs finish quickly even if a large job is already running, without needing strict queue definitions.
What are InputFormat and RecordReader in the MapReduce framework?
InputFormat and RecordReader are responsible for reading input data and feeding it to the Mappers.
InputFormat:
- Defines how input files are split and read.
- It validates the input paths and calculates the InputSplits. Each InputSplit represents a chunk of data to be processed by a single Map task.
- Examples:
TextInputFormat(default),KeyValueTextInputFormat.
RecordReader:
- It converts the byte-oriented view of the InputSplit into actual Key-Value pairs that the Mapper can understand.
- For example, the
LineRecordReader(used byTextInputFormat) reads data line by line, passing the byte offset as the key and the line string as the value.
Explain the concept of Containers in YARN.
A Container is the fundamental unit of resource allocation in YARN.
Key Features:
- Resource Encapsulation: It represents a multi-dimensional bundle of resources on a specific node, typically comprising RAM (Memory) and vCores (CPU).
- Isolation: When an application requests resources, the ResourceManager grants a Container. The NodeManager allocates these resources and isolates the execution environment to ensure it does not exceed its quota.
- Execution: Both the ApplicationMaster and the actual application tasks (like Map/Reduce tasks, Spark Executors) run inside these Containers.
- Dynamic: Unlike the rigid "slots" in MRv1, Containers are dynamically sized based on the application's request.
How does High Availability (HA) work for the YARN ResourceManager?
YARN ResourceManager (RM) High Availability (HA) solves the Single Point of Failure problem in modern Hadoop clusters.
Mechanism:
- Active/Standby Architecture: The cluster is configured with two or more RMs. One is Active, and the others are in Standby mode.
- State Store: The Active RM continuously writes its state (metadata, running applications, allocated containers) to a shared storage system, typically Apache ZooKeeper (ZKRMStateStore).
- Failover Controller: ZooKeeper tracks the health of the RMs. If the Active RM crashes, ZooKeeper initiates an automatic failover.
- Recovery: The Standby RM promotes itself to Active, reads the state from ZooKeeper, and resumes the applications seamlessly without the users having to resubmit their jobs.
Explain the significance of Key-Value pairs in MapReduce and provide examples of common Writable data types.
In MapReduce, the entire data flow is structured around Key-Value (KV) pairs. It is the sole data structure used for inputs, intermediate data, and outputs.
Significance:
- Framework Logic: The framework inherently depends on keys for sorting, shuffling, and partitioning data across the cluster.
- Serialization: Data must be sent over the network. Therefore, KV types must implement the
Writableinterface to allow efficient binary serialization. - Comparison: Keys must implement
WritableComparableso they can be sorted during the shuffle phase.
Common Hadoop Data Types:
IntWritable: Equivalent to Javaint.LongWritable: Equivalent to Javalong(often used as the default key for line offsets).Text: Equivalent to JavaString(used for words or line content).FloatWritable/DoubleWritable: For decimal values.