1 $What is MapReduce primarily used for in the Hadoop ecosystem?$

Introduction to MapReduce Easy

A.

Network routing

B.

Relational database management

C.

Distributed data storage

D.

Distributed data processing

2 $In the MapReduce framework, what is the standard output format of the Map phase?$

Map Phase Easy

A.

XML files

B.

Relational tables

C.

Plain text sentences

D.

Key-Value pairs

3 $What is the primary function of the Reducer in a MapReduce job?$

Reduce Phase Easy

A.

Aggregating and summarizing data

B.

Storing data on disks

C.

Managing cluster resources

D.

Splitting data into smaller blocks

4 $What does YARN stand for in Big Data processing?$

Introduction to YARN Easy

A.

Yet Another Relational Node

B.

Yet Another Resource Negotiator

C.

Yield And Resource Network

D.

Your Application Resource Network

5 $YARN was introduced in which major version of Hadoop to resolve the scalability bottlenecks of the classic MapReduce architecture?$

Introduction to YARN Easy

A.

Hadoop 4.x

B.

Hadoop 2.x

C.

Hadoop 1.x

D.

Hadoop 0.5

6 $Which component in YARN is considered the master daemon responsible for globally managing and allocating resources across the entire cluster?$

YARN Architecture Easy

A.

JobTracker

B.

ResourceManager

C.

NodeManager

D.

ApplicationMaster

7 $In YARN, which component runs on each individual worker node to monitor resource usage (CPU, memory) and report it to the master?$

YARN Architecture Easy

A.

NameNode

B.

ApplicationMaster

C.

NodeManager

D.

ResourceManager

8 $In YARN, what is the role of the ApplicationMaster?$

YARN Architecture Easy

A.

Replacing the DataNode

B.

Negotiating resources and tracking application status

C.

Managing the entire cluster's memory

D.

Storing the metadata of HDFS

9 $What term is used in YARN to represent a fraction of cluster resources (like memory, CPU) allocated to execute a specific task?$

YARN Architecture Easy

A.

Block

B.

Container

C.

Pod

D.

Split

10 $Which crucial phase occurs between the Map phase and the Reduce phase to group identical keys together?$

MapReduce Execution Flow Easy

A.

Split and Read

B.

Combine and Merge

C.

Write and Replicate

D.

Shuffle and Sort

11 $Which optional component is often called a "mini-reducer" because it performs local aggregation on the Mapper output to reduce network bandwidth?$

MapReduce Execution Flow Easy

A.

Shuffler

B.

DataNode

C.

Partitioner

D.

Combiner

12 $What determines which specific Reducer instance will process a given key-value pair in a MapReduce job?$

MapReduce Execution Flow Easy

A.

JobTracker

B.

Combiner

C.

NameNode

D.

Partitioner

13 $In the older Hadoop 1.x architecture, which central daemon was strictly responsible for both resource management and job scheduling?$

MapReduce Architecture Easy

A.

NodeManager

B.

TaskTracker

C.

JobTracker

D.

ApplicationMaster

14 $In Hadoop 1.x, which worker daemon accepted tasks from the master daemon and executed the Map and Reduce operations?$

MapReduce Architecture Easy

A.

DataNode

B.

TaskTracker

C.

NodeManager

D.

ResourceManager

15 $What is the logical representation of a chunk of input data that is processed by a single Mapper instance?$

MapReduce Execution Flow Easy

A.

Input Split

B.

HDFS File

C.

Output Split

D.

Data Block

16 $Which of the following statements about MapReduce job execution is true?$

MapReduce Execution Flow Easy

A.

Mappers send data directly to HDFS without a Reducer.

B.

The Reduce phase starts simultaneously with the Map phase.

C.

MapReduce jobs do not require a Map phase.

D.

The Reduce phase cannot start processing data until all Map tasks have completed.

17 $What is a major advantage of YARN over the classic Hadoop 1.x MapReduce architecture?$

Introduction to YARN Easy

A.

It supports diverse processing engines (like Spark) alongside MapReduce.

B.

It removes the need for DataNodes entirely.

C.

It increases the physical block size of HDFS.

D.

It requires all programming to be done in Python.

18 $In Hadoop MapReduce, which standard Java interface must keys and values implement so they can be serialized and sent over the network?$

MapReduce Execution Flow Easy

A.

Externalizable

B.

Writable

C.

Cloneable

D.

Serializable

19 $In a standard Hadoop YARN cluster, where does the NodeManager typically run?$

YARN Architecture Easy

A.

Outside the Hadoop cluster

B.

Exclusively on the client machine

C.

On the same machines as DataNodes (worker nodes)

D.

On the same machine as the NameNode

20 $How many Output files are typically created by a MapReduce job that is configured to run with Reducers?$

Map Phase Easy

A.

files

B.

files

C.

1 combined file

D.

Depends on the number of Mappers

21 $Which of the following scenarios best demonstrates the appropriate use of a Combiner in a MapReduce job?$

MapReduce Architecture Medium

A.

Calculating the exact average of a dataset across multiple nodes before the Reducer phase.

B.

Splitting a massive file into smaller chunks to increase the number of Map tasks.

C.

Performing a local aggregation of word counts on the Mapper node to reduce network bandwidth during the shuffle phase.

D.

Sorting the output of the Reducer phase before writing it to HDFS.

22 $If a file is 300 MB and the HDFS block size is 128 MB, how many Map tasks will typically be launched by default, and why?$

Map Phase Medium

A.

3 Map tasks, because the data is divided into three InputSplits corresponding to the three HDFS blocks (128 MB, 128 MB, and 44 MB).

B.

1 Map task, because a single file always maps to a single Map task to maintain data locality.

C.

4 Map tasks, because the framework always allocates an extra Mapper for overhead processing.

D.

2 Map tasks, because the framework calculates and rounds down.

23 $How does a Partitioner operate in a MapReduce job?$

Shuffle and Sort Medium

A.

It filters out null keys from the Mapper output before the shuffle phase begins.

B.

It determines which Reducer will receive a specific key-value pair based on the hash of the key.

C.

It determines which Mapper processes which HDFS block based on node locality.

D.

It sorts the values for a given key in descending order before they reach the Reducer.

24 $What is the purpose of Speculative Execution in Hadoop MapReduce?$

MapReduce Architecture Medium

A.

To execute tasks speculatively in memory without writing intermediate data to disk.

B.

To predict the output of a Map task before it finishes to speed up the Reducer.

C.

To identify tasks running significantly slower than average and launch duplicate backup tasks on other nodes.

D.

To execute Map tasks on data that has not yet been written to HDFS.

25 $What happens if a developer configures a MapReduce job with setNumReduceTasks(0) ?$

Reduce Phase Medium

A.

The ResourceManager will automatically assign exactly one Reducer to prevent data loss.

B.

The job will fail because a MapReduce job requires at least one Reducer.

C.

The job becomes a 'Map-only' job, and the Mapper outputs are written directly to HDFS without sorting or shuffling.

D.

The framework will hold the Mapper output in memory until a Reducer becomes available.

26 $Which of the following best describes the structural difference between Hadoop 1.x (MRv1) and YARN regarding resource management?$

YARN Architecture Medium

A.

YARN eliminates the need for an ApplicationMaster by allowing the NodeManager to schedule jobs directly.

B.

YARN combines the JobTracker and TaskTracker into a single unified service.

C.

YARN separates the dual responsibilities of the JobTracker (resource management and job scheduling/monitoring) into the ResourceManager and ApplicationMaster.

D.

YARN moves resource management to HDFS DataNodes, bypassing the need for a central coordinator.

27 $What are the two main components of the YARN ResourceManager?$

ResourceManager Medium

A.

JobTracker and TaskTracker

B.

ApplicationMaster and NameNode

C.

Scheduler and ApplicationsManager

D.

NodeManager and Container

28 $In a YARN cluster, what happens if an ApplicationMaster fails during the execution of a job?$

ApplicationMaster Medium

A.

The entire cluster shuts down to prevent data corruption.

B.

The NodeManager takes over the role of the ApplicationMaster for that specific job.

C.

The ResourceManager immediately fails the application and alerts the client.

D.

The ResourceManager restarts the ApplicationMaster in a new container, and the job may recover depending on the framework.

29 $A YARN Container represents a collection of physical resources. If a task inside a container attempts to use more RAM than allocated, what is the default behavior of YARN?$

Containers Medium

A.

The task will pause until memory becomes available in the cluster.

B.

The NodeManager will kill the container for exceeding its physical memory limit.

C.

The NodeManager will seamlessly allocate more memory from the node's reserve pool.

D.

The ApplicationMaster will negotiate an expansion of the container's size with the ResourceManager.

30 $Which YARN scheduler allocates a fraction of cluster capacity to multiple organizations, allowing them to utilize unused cluster capacity when available, but restricts them to their guaranteed minimum when the cluster is busy?$

YARN Scheduling Medium

A.

Preemptive Scheduler

B.

Fair Scheduler

C.

FIFO Scheduler

D.

Capacity Scheduler

31 $During the shuffle phase, MapReduce must transfer intermediate data across the network. How is this data managed on the Mapper node before transfer?$

MapReduce Architecture Medium

A.

It is written to a circular memory buffer, spilled to local disk when the buffer reaches a threshold, and then merged.

B.

It is written directly to an HDFS block so it can be replicated for fault tolerance.

C.

It is kept entirely in RAM until the Reducer requests it via an RPC call.

D.

It is immediately streamed to the Reducer task without being stored on the Mapper node.

32 $What is the primary role of the Heartbeat mechanism between the NodeManager and the ResourceManager in YARN?$

NodeManager Medium

A.

To update HDFS block locations to the NameNode.

B.

To inform the ResourceManager of the NodeManager's health and available resources, and to receive container execution commands.

C.

To transfer MapReduce job output data from the worker node to the master node.

D.

To negotiate resource limits directly with the ApplicationMaster.

33 $In the context of the Fair Scheduler in YARN, what does 'preemption' allow the scheduler to do?$

YARN Scheduling Medium

A.

Bypass the ApplicationMaster and schedule tasks directly on NodeManagers.

B.

Allocate memory dynamically beyond the node's physical limits.

C.

Predict which jobs will run longest and place them at the end of the queue.

D.

Kill containers from a queue that is over its fair share to free up resources for a starved queue.

34 $In a standard MapReduce job, what is the input format received by the reduce() function?$

Reduce Phase Medium

A.

A single key and an Iterable collection of values, e.g., (Key, Iterable<Value>) .

B.

An array of key-value pairs representing the entire dataset.

C.

A single key and a single value, e.g., (Key, Value) .

D.

A list of keys and a single aggregated value, e.g., (List<Key>, Value) .

35 $Why is 'Data Locality' a critical optimization in the MapReduce framework?$

MapReduce Architecture Medium

A.

It ensures that intermediate shuffle data is encrypted locally before network transfer.

B.

It forces all data to be stored on local node disks rather than in HDFS.

C.

It guarantees that Reducers are always placed on the same rack as the client submitting the job.

D.

It schedules Map tasks on the exact same nodes where the required HDFS blocks reside, minimizing network congestion.

36 $To achieve High Availability (HA) for the YARN ResourceManager, an Active/Standby architecture is used. What component typically manages the state and leader election to handle automatic failover?$

ResourceManager Medium

A.

Apache Zookeeper

B.

Secondary NameNode

C.

JobHistoryServer

D.

HDFS JournalNodes

37 $Which interface allows a MapReduce application to broadcast read-only files (like lookup tables or dictionaries) to all worker nodes before tasks execute?$

MapReduce Architecture Medium

A.

Combiner

B.

InputFormat

C.

Partitioner

D.

DistributedCache

38 $When an ApplicationMaster requires resources to run tasks, how does it specify its request to the ResourceManager?$

ApplicationMaster Medium

A.

By requesting specific HDFS block locations directly from the NameNode.

B.

By sending a ResourceRequest containing memory, CPU requirements, preferred nodes/racks, and priority.

C.

By commanding the NodeManager to allocate a certain percentage of its local disk.

D.

By editing the yarn-site.xml file dynamically during runtime.

39 $What is the defining characteristic of a YARN Application?$

YARN Architecture Medium

A.

It must always consist of exactly one Map phase and one Reduce phase.

B.

It is the global queue managed by the Capacity Scheduler.

C.

It is a single job or a DAG of jobs coordinated by a single ApplicationMaster.

D.

It refers strictly to a daemon process running permanently on a NodeManager.

40 $If two keys, and, yield the exact same hash code from the Partitioner, what is the consequence in the MapReduce pipeline?$

Shuffle and Sort Medium

A.

and will be merged into a single key by the Combiner.

B.

The Map task will fail with a HashCollisionException.

C.

The Partitioner will automatically assign to a random Reducer to balance the load.

D.

and will be sent to the same Reducer task, where they will be sorted and grouped separately.

41 $In a MapReduce job, what happens if an InputSplit boundary occurs in the middle of a logical record (e.g., a line in a text file)?$

MapReduce Execution Framework Hard

A.

The Map task processes the partial record and raises an exception for the next Map task.

B.

The JobTracker/ResourceManager automatically realigns the block boundaries in HDFS before launching the Map tasks.

C.

The Map task reads past the end of its InputSplit block into the next block to finish the record, while the adjacent Map task skips the first partial record.

D.

The Map task skips the partial record entirely, resulting in data loss unless custom error handling is implemented.

42 $A developer writes a MapReduce job to calculate the global average of values associated with a key. They use the same Reducer implementation as the Combiner to optimize network traffic. Which of the following describes the outcome of this decision?$

Combiners and Partitioners Hard

A.

The job will produce correct results only if all map tasks emit exactly the same number of records per key.

B.

The framework will throw an execution error because a Reducer cannot logically be used as a Combiner in MRv2.

C.

The job will execute faster and produce the correct global average because averages are commutative.

D.

The job will execute faster but produce incorrect results because calculating a mean is not an associative and commutative operation.

43 $During the shuffle phase, Reducers must fetch Map outputs from various NodeManagers. How does a Reducer task efficiently determine the locations of the completed Map outputs?$

Shuffle and Sort Phase Hard

A.

The NodeManagers actively push the partitioned data to the Reducer containers as soon as the Map tasks complete.

B.

It periodically queries the ApplicationMaster, which receives task completion reports and physical locations from the completed Map tasks.

C.

It queries the HDFS NameNode, which tracks the temporary spilled map outputs.

D.

It broadcasts a request to all NodeManagers in the cluster asking for map outputs associated with its partition.

44 $Under the YARN Fair Scheduler using Dominant Resource Fairness (DRF), consider a cluster with 100 CPUs and 1000 GB of RAM. App A requests containers with 2 CPUs and 10 GB of RAM, while App B requests containers with 1 CPU and 20 GB of RAM. How are the dominant shares calculated?$

YARN Scheduling Hard

A.

App A's dominant resource is CPU (2%); App B's dominant resource is Memory (2%).

B.

App A's dominant resource is CPU (2%); App B's dominant resource is CPU (1%) due to normalization.

C.

Both applications have CPU as their dominant resource because CPU scheduling inherently supersedes Memory scheduling in YARN.

D.

App A's dominant resource is Memory (1%); App B's dominant resource is CPU (1%).

45 $A MapReduce job performs a complex data transformation and inserts the output records directly into an external non-idempotent relational database from within the Map tasks. Speculative execution is enabled by default. What critical issue will arise in this scenario?$

Fault Tolerance and Speculative Execution Hard

A.

The MapReduce job will fail because YARN cannot serialize database connection objects across the cluster.

B.

Data duplication will occur because speculative tasks will insert duplicate records before the framework kills the slower task.

C.

The ApplicationMaster will deadlock because it cannot lock the external database rows.

D.

The external database will reject the connections due to Kerberos ticket mismatches generated by speculative containers.

46 $In a High Availability (HA) YARN cluster, a 'split-brain' scenario occurs where two ResourceManagers (RM1 and RM2) both believe they are active. How does YARN's architectural design prevent cluster corruption in this specific scenario?$

YARN Architecture Components Hard

A.

The NodeManagers utilize a Paxos protocol to vote on which RM to send heartbeats to, ostracizing the minority RM.

B.

The ActiveStandbyElector uses ZooKeeper to maintain an active lock, and YARN implements fencing where the active RM's epoch number is validated by the ZooKeeper-based state store before any state changes are committed.

C.

The Timeline Server acts as an arbiter and forcefully terminates the JVM of the RM with the oldest startup timestamp.

D.

The ApplicationMasters implement exponential backoff and will fail over to a pre-configured third Resource Manager (Witness RM).

47 $To achieve a total global ordering of output data in MapReduce, a developer decides to use the TotalOrderPartitioner . Which of the following prerequisites is strictly necessary for TotalOrderPartitioner to function efficiently without causing extreme data skew?$

Combiners and Partitioners Hard

A.

The number of Reducers must be set strictly equal to the number of Map tasks.

B.

The input dataset must be pre-sorted in HDFS before the Map phase begins.

C.

All keys must be mapped to exactly the same data type size (e.g., exactly 64-bit integers).

D.

A sampling phase must be executed prior to the job to determine partition boundaries, creating a partition file loaded into the Distributed Cache.

48 $If the ApplicationMaster (AM) container fails in YARN, what is the exact sequence of recovery initiated by the framework?$

YARN Architecture Components Hard

A.

The job fails immediately, and the client application must resubmit the entire MapReduce job from scratch.

B.

The ResourceManager launches a new AM; the new AM can recover the state of already completed tasks if application state recovery is enabled, avoiding full task re-execution.

C.

The ResourceManager allocates a new container for the AM, which must then request all previously completed map outputs again, as all intermediate data is purged.

D.

The NodeManager restarts the AM on the same node, maintaining all active task containers without interruption.

49 $During the Map phase, output records are buffered in memory before being spilled to disk. The spill threshold is defined by mapreduce.map.sort.spill.percent . What happens when the buffer usage reaches this threshold?$

Shuffle and Sort Phase Hard

A.

The framework immediately preempts the map task, sending the partially completed buffer over the network directly to the Reducer.

B.

The Map task pauses all record processing until the background thread completes spilling the buffer to HDFS.

C.

A background thread begins sorting and spilling the contents to the local disk, while the map task continues writing to the remaining space in the buffer.

D.

The NodeManager dynamically allocates more heap memory to the map container to prevent disk I/O bottlenecks.

50 $A MapReduce developer is implementing a secondary sort to sort values arriving at the reducer. They configure a custom WritableComparable as the Map output key and write a custom Partitioner . What third component MUST be heavily customized to ensure the Reducer receives all values for a given logical key in a single reduce() call?$

MapReduce Execution Framework Hard

A.

A custom GroupingComparator must be configured to group the composite keys based solely on their logical key portion.

B.

The InputFormat class must be overridden to chunk data based on the logical key.

C.

The OutputCommitter must be configured to merge partial files based on the secondary sort keys.

D.

A custom Combiner must be provided to pre-sort the values in the NodeManager's RAM.

51 $In the YARN Capacity Scheduler, what is the effect of configuring yarn.scheduler.capacity.<queue-path>.maximum-capacity lower than 100% for a specific queue?$

YARN Scheduling Hard

A.

It permanently throttles the CPU clock speed of all containers running in that queue to the specified percentage.

B.

It forces the queue to strictly pre-empt containers from other queues to guarantee its minimum capacity.

C.

It restricts the queue from utilizing idle cluster resources beyond that percentage, preventing it from overtaking the entire cluster during elastic expansion.

D.

It dictates the maximum percentage of a single node's resources that a container in this queue can request.

52 $Consider a scenario where you have two large datasets, A and B. You want to perform a Map-side join. Which of the following conditions is strictly necessary to implement a standard Map-side join efficiently using the framework's CompositeInputFormat ?$

MapReduce Execution Framework Hard

A.

The MapReduce job must be configured to run with zero reducers and the Distributed Cache disabled.

B.

Dataset B must be small enough to fit entirely into the RAM of a single NodeManager.

C.

Both datasets must be partitioned using the same logic, sorted by the join key, and have exactly the same number of partitions.

D.

Both datasets must be compressed using a splittable codec such as bzip2 or LZO.

53 $What happens in YARN if a NodeManager experiences a transient network partition and fails to send heartbeats to the ResourceManager for a duration exceeding the yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms ?$

YARN Architecture Components Hard

A.

The NodeManager automatically shuts down its local operating system to fence the node from the cluster.

B.

The ResourceManager immediately deletes all HDFS blocks residing on that node to prevent data corruption.

C.

The ApplicationMaster running on that node assumes ResourceManager duties until the network partition heals.

D.

The ResourceManager marks the node as DEAD, considers all containers on it as failed, and notifies the respective ApplicationMasters to re-schedule those tasks.

54 $Uber mode (or Uber task optimization) in MapReduce v2 (YARN) is designed to optimize execution for small jobs. How does it alter the standard execution model?$

MapReduce Execution Framework Hard

A.

It bypasses the ResourceManager completely and launches tasks directly using the HDFS DataNode daemon.

B.

It utilizes GPU acceleration on the NodeManagers to execute map and reduce tasks in parallel threads.

C.

It runs only Map tasks and forcefully streams their outputs back to the client submitting the job, skipping the Reduce phase.

D.

It executes all map and reduce tasks sequentially within the ApplicationMaster's JVM, avoiding the overhead of requesting and launching separate containers.

55 $In the context of MapReduce job commitment, what is the primary role of the OutputCommitter class's two-phase commit protocol?$

MapReduce Execution Framework Hard

A.

To commit the final output data to an external RDBMS without holding long-lived database locks.

B.

To synchronize the ZooKeeper transaction logs before signaling the JobTracker of success.

C.

To allow tasks to write output to a temporary location and gracefully promote it to the final destination only if the task, and subsequently the job, successfully completes.

D.

To securely sign the output data blocks in HDFS with Kerberos tokens.

56 $During the Reduce phase, the execution fundamentally consists of three sub-phases: Copy (Shuffle), Sort (Merge), and Reduce. Which of the following accurately describes a critical operation during the Sort (Merge) phase?$

Shuffle and Sort Phase Hard

A.

The Reducer pushes the data back to HDFS temporarily because the memory buffer is cleared for the reduce() function.

B.

The Reducer merges the already-sorted map output files fetched from various NodeManagers to maintain a single, totally ordered stream of keys.

C.

The framework performs a full external Quicksort on the raw key-value pairs fetched from the mappers.

D.

The Reducer invokes the Partitioner again to ensure that keys were not routed to the wrong node due to network errors.

57 $YARN Federation addresses the scalability limits of a single ResourceManager cluster. How does YARN Federation manage a single application that requires more resources than a single sub-cluster can provide?$

YARN Architecture Components Hard

A.

It cannot manage this; YARN Federation strictly requires an application to fit entirely within the capacity of a single sub-cluster.

B.

It allows the ApplicationMaster to request resources from ResourceManagers of multiple sub-clusters simultaneously using a global policy.

C.

It automatically splits the MapReduce code into multiple distinct JARs and submits them independently.

D.

It statically provisions resources by mapping the application's user ID strictly to one master sub-cluster.

58 $If you submit a MapReduce job with the configuration mapreduce.job.reduces=0, what happens to the output data?$

MapReduce Execution Framework Hard

A.

The mappers write their outputs to the local disk of the NodeManager, where it remains until a subsequent reduce job is manually started.

B.

The mappers process the data, but no output is written to HDFS because the OutputFormat is exclusively bound to the Reducer.

C.

The job fails with an IllegalStateException because every MapReduce job requires at least one reducer.

D.

The framework bypasses the shuffle and sort phases, and the map tasks write their output directly to HDFS in the final output directory.

59 $In YARN, an ApplicationMaster operates within a container and must authenticate with the ResourceManager to request further resources. Which security mechanism does the AM use to securely communicate with the RM in a Kerberized cluster?$

YARN Architecture Components Hard

A.

It uses an X.509 client certificate hardcoded into the NodeManager truststore.

B.

It uses the client's original Kerberos Ticket Granting Ticket (TGT), forwarded over RPC.

C.

It generates a public/private keypair dynamically and registers the public key in ZooKeeper.

D.

It uses a short-lived AMRMToken (ApplicationMaster-ResourceManager Token) issued by the RM during AM launch.

60 $A developer writes a custom Partitioner for a MapReduce job to route records based on an 'AccountID' string. The logic uses (accountID.hashCode() % numReducers) . Under what circumstance will this custom partitioner cause a severe job failure?$

Combiners and Partitioners Hard

A.

If accountID contains special characters that cannot be hashed.

B.

If numReducers is set to 1.

C.

If accountID.hashCode() evaluates to Integer.MIN_VALUE .

D.

If there is a massive data skew where one AccountID has 90% of the data.

Unit 3 - Practice Quiz