Unit 3 - Practice Quiz

INT312 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is MapReduce primarily used for in the Hadoop ecosystem?

Introduction to MapReduce Easy
A. Distributed data processing
B. Network routing
C. Distributed data storage
D. Relational database management

2 In the MapReduce framework, what is the standard output format of the Map phase?

Map Phase Easy
A. Plain text sentences
B. Relational tables
C. XML files
D. Key-Value pairs

3 What is the primary function of the Reducer in a MapReduce job?

Reduce Phase Easy
A. Storing data on disks
B. Aggregating and summarizing data
C. Splitting data into smaller blocks
D. Managing cluster resources

4 What does YARN stand for in Big Data processing?

Introduction to YARN Easy
A. Yet Another Resource Negotiator
B. Yet Another Relational Node
C. Your Application Resource Network
D. Yield And Resource Network

5 YARN was introduced in which major version of Hadoop to resolve the scalability bottlenecks of the classic MapReduce architecture?

Introduction to YARN Easy
A. Hadoop 2.x
B. Hadoop 1.x
C. Hadoop 4.x
D. Hadoop 0.5

6 Which component in YARN is considered the master daemon responsible for globally managing and allocating resources across the entire cluster?

YARN Architecture Easy
A. ResourceManager
B. JobTracker
C. ApplicationMaster
D. NodeManager

7 In YARN, which component runs on each individual worker node to monitor resource usage (CPU, memory) and report it to the master?

YARN Architecture Easy
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. NameNode

8 In YARN, what is the role of the ApplicationMaster?

YARN Architecture Easy
A. Negotiating resources and tracking application status
B. Replacing the DataNode
C. Storing the metadata of HDFS
D. Managing the entire cluster's memory

9 What term is used in YARN to represent a fraction of cluster resources (like memory, CPU) allocated to execute a specific task?

YARN Architecture Easy
A. Container
B. Block
C. Split
D. Pod

10 Which crucial phase occurs between the Map phase and the Reduce phase to group identical keys together?

MapReduce Execution Flow Easy
A. Shuffle and Sort
B. Write and Replicate
C. Split and Read
D. Combine and Merge

11 Which optional component is often called a "mini-reducer" because it performs local aggregation on the Mapper output to reduce network bandwidth?

MapReduce Execution Flow Easy
A. Combiner
B. Shuffler
C. DataNode
D. Partitioner

12 What determines which specific Reducer instance will process a given key-value pair in a MapReduce job?

MapReduce Execution Flow Easy
A. Partitioner
B. NameNode
C. JobTracker
D. Combiner

13 In the older Hadoop 1.x architecture, which central daemon was strictly responsible for both resource management and job scheduling?

MapReduce Architecture Easy
A. ApplicationMaster
B. NodeManager
C. JobTracker
D. TaskTracker

14 In Hadoop 1.x, which worker daemon accepted tasks from the master daemon and executed the Map and Reduce operations?

MapReduce Architecture Easy
A. DataNode
B. NodeManager
C. ResourceManager
D. TaskTracker

15 What is the logical representation of a chunk of input data that is processed by a single Mapper instance?

MapReduce Execution Flow Easy
A. HDFS File
B. Data Block
C. Input Split
D. Output Split

16 Which of the following statements about MapReduce job execution is true?

MapReduce Execution Flow Easy
A. MapReduce jobs do not require a Map phase.
B. The Reduce phase starts simultaneously with the Map phase.
C. The Reduce phase cannot start processing data until all Map tasks have completed.
D. Mappers send data directly to HDFS without a Reducer.

17 What is a major advantage of YARN over the classic Hadoop 1.x MapReduce architecture?

Introduction to YARN Easy
A. It supports diverse processing engines (like Spark) alongside MapReduce.
B. It requires all programming to be done in Python.
C. It removes the need for DataNodes entirely.
D. It increases the physical block size of HDFS.

18 In Hadoop MapReduce, which standard Java interface must keys and values implement so they can be serialized and sent over the network?

MapReduce Execution Flow Easy
A. Externalizable
B. Serializable
C. Cloneable
D. Writable

19 In a standard Hadoop YARN cluster, where does the NodeManager typically run?

YARN Architecture Easy
A. On the same machine as the NameNode
B. On the same machines as DataNodes (worker nodes)
C. Exclusively on the client machine
D. Outside the Hadoop cluster

20 How many Output files are typically created by a MapReduce job that is configured to run with Reducers?

Map Phase Easy
A. files
B. 1 combined file
C. files
D. Depends on the number of Mappers

21 Which of the following scenarios best demonstrates the appropriate use of a Combiner in a MapReduce job?

MapReduce Architecture Medium
A. Performing a local aggregation of word counts on the Mapper node to reduce network bandwidth during the shuffle phase.
B. Calculating the exact average of a dataset across multiple nodes before the Reducer phase.
C. Sorting the output of the Reducer phase before writing it to HDFS.
D. Splitting a massive file into smaller chunks to increase the number of Map tasks.

22 If a file is 300 MB and the HDFS block size is 128 MB, how many Map tasks will typically be launched by default, and why?

Map Phase Medium
A. 3 Map tasks, because the data is divided into three InputSplits corresponding to the three HDFS blocks (128 MB, 128 MB, and 44 MB).
B. 4 Map tasks, because the framework always allocates an extra Mapper for overhead processing.
C. 2 Map tasks, because the framework calculates and rounds down.
D. 1 Map task, because a single file always maps to a single Map task to maintain data locality.

23 How does a Partitioner operate in a MapReduce job?

Shuffle and Sort Medium
A. It determines which Mapper processes which HDFS block based on node locality.
B. It sorts the values for a given key in descending order before they reach the Reducer.
C. It determines which Reducer will receive a specific key-value pair based on the hash of the key.
D. It filters out null keys from the Mapper output before the shuffle phase begins.

24 What is the purpose of Speculative Execution in Hadoop MapReduce?

MapReduce Architecture Medium
A. To identify tasks running significantly slower than average and launch duplicate backup tasks on other nodes.
B. To execute Map tasks on data that has not yet been written to HDFS.
C. To execute tasks speculatively in memory without writing intermediate data to disk.
D. To predict the output of a Map task before it finishes to speed up the Reducer.

25 What happens if a developer configures a MapReduce job with setNumReduceTasks(0)?

Reduce Phase Medium
A. The ResourceManager will automatically assign exactly one Reducer to prevent data loss.
B. The job becomes a 'Map-only' job, and the Mapper outputs are written directly to HDFS without sorting or shuffling.
C. The framework will hold the Mapper output in memory until a Reducer becomes available.
D. The job will fail because a MapReduce job requires at least one Reducer.

26 Which of the following best describes the structural difference between Hadoop 1.x (MRv1) and YARN regarding resource management?

YARN Architecture Medium
A. YARN separates the dual responsibilities of the JobTracker (resource management and job scheduling/monitoring) into the ResourceManager and ApplicationMaster.
B. YARN moves resource management to HDFS DataNodes, bypassing the need for a central coordinator.
C. YARN eliminates the need for an ApplicationMaster by allowing the NodeManager to schedule jobs directly.
D. YARN combines the JobTracker and TaskTracker into a single unified service.

27 What are the two main components of the YARN ResourceManager?

ResourceManager Medium
A. NodeManager and Container
B. ApplicationMaster and NameNode
C. Scheduler and ApplicationsManager
D. JobTracker and TaskTracker

28 In a YARN cluster, what happens if an ApplicationMaster fails during the execution of a job?

ApplicationMaster Medium
A. The NodeManager takes over the role of the ApplicationMaster for that specific job.
B. The ResourceManager restarts the ApplicationMaster in a new container, and the job may recover depending on the framework.
C. The entire cluster shuts down to prevent data corruption.
D. The ResourceManager immediately fails the application and alerts the client.

29 A YARN Container represents a collection of physical resources. If a task inside a container attempts to use more RAM than allocated, what is the default behavior of YARN?

Containers Medium
A. The NodeManager will kill the container for exceeding its physical memory limit.
B. The ApplicationMaster will negotiate an expansion of the container's size with the ResourceManager.
C. The task will pause until memory becomes available in the cluster.
D. The NodeManager will seamlessly allocate more memory from the node's reserve pool.

30 Which YARN scheduler allocates a fraction of cluster capacity to multiple organizations, allowing them to utilize unused cluster capacity when available, but restricts them to their guaranteed minimum when the cluster is busy?

YARN Scheduling Medium
A. Preemptive Scheduler
B. Fair Scheduler
C. FIFO Scheduler
D. Capacity Scheduler

31 During the shuffle phase, MapReduce must transfer intermediate data across the network. How is this data managed on the Mapper node before transfer?

MapReduce Architecture Medium
A. It is written to a circular memory buffer, spilled to local disk when the buffer reaches a threshold, and then merged.
B. It is immediately streamed to the Reducer task without being stored on the Mapper node.
C. It is kept entirely in RAM until the Reducer requests it via an RPC call.
D. It is written directly to an HDFS block so it can be replicated for fault tolerance.

32 What is the primary role of the Heartbeat mechanism between the NodeManager and the ResourceManager in YARN?

NodeManager Medium
A. To negotiate resource limits directly with the ApplicationMaster.
B. To update HDFS block locations to the NameNode.
C. To inform the ResourceManager of the NodeManager's health and available resources, and to receive container execution commands.
D. To transfer MapReduce job output data from the worker node to the master node.

33 In the context of the Fair Scheduler in YARN, what does 'preemption' allow the scheduler to do?

YARN Scheduling Medium
A. Kill containers from a queue that is over its fair share to free up resources for a starved queue.
B. Allocate memory dynamically beyond the node's physical limits.
C. Predict which jobs will run longest and place them at the end of the queue.
D. Bypass the ApplicationMaster and schedule tasks directly on NodeManagers.

34 In a standard MapReduce job, what is the input format received by the reduce() function?

Reduce Phase Medium
A. A single key and an Iterable collection of values, e.g., (Key, Iterable<Value>).
B. A list of keys and a single aggregated value, e.g., (List<Key>, Value).
C. A single key and a single value, e.g., (Key, Value).
D. An array of key-value pairs representing the entire dataset.

35 Why is 'Data Locality' a critical optimization in the MapReduce framework?

MapReduce Architecture Medium
A. It ensures that intermediate shuffle data is encrypted locally before network transfer.
B. It forces all data to be stored on local node disks rather than in HDFS.
C. It schedules Map tasks on the exact same nodes where the required HDFS blocks reside, minimizing network congestion.
D. It guarantees that Reducers are always placed on the same rack as the client submitting the job.

36 To achieve High Availability (HA) for the YARN ResourceManager, an Active/Standby architecture is used. What component typically manages the state and leader election to handle automatic failover?

ResourceManager Medium
A. JobHistoryServer
B. HDFS JournalNodes
C. Apache Zookeeper
D. Secondary NameNode

37 Which interface allows a MapReduce application to broadcast read-only files (like lookup tables or dictionaries) to all worker nodes before tasks execute?

MapReduce Architecture Medium
A. Partitioner
B. Combiner
C. InputFormat
D. DistributedCache

38 When an ApplicationMaster requires resources to run tasks, how does it specify its request to the ResourceManager?

ApplicationMaster Medium
A. By editing the yarn-site.xml file dynamically during runtime.
B. By sending a ResourceRequest containing memory, CPU requirements, preferred nodes/racks, and priority.
C. By commanding the NodeManager to allocate a certain percentage of its local disk.
D. By requesting specific HDFS block locations directly from the NameNode.

39 What is the defining characteristic of a YARN Application?

YARN Architecture Medium
A. It must always consist of exactly one Map phase and one Reduce phase.
B. It refers strictly to a daemon process running permanently on a NodeManager.
C. It is a single job or a DAG of jobs coordinated by a single ApplicationMaster.
D. It is the global queue managed by the Capacity Scheduler.

40 If two keys, and , yield the exact same hash code from the Partitioner, what is the consequence in the MapReduce pipeline?

Shuffle and Sort Medium
A. and will be sent to the same Reducer task, where they will be sorted and grouped separately.
B. The Map task will fail with a HashCollisionException.
C. The Partitioner will automatically assign to a random Reducer to balance the load.
D. and will be merged into a single key by the Combiner.

41 In a MapReduce job, what happens if an InputSplit boundary occurs in the middle of a logical record (e.g., a line in a text file)?

MapReduce Execution Framework Hard
A. The Map task skips the partial record entirely, resulting in data loss unless custom error handling is implemented.
B. The JobTracker/ResourceManager automatically realigns the block boundaries in HDFS before launching the Map tasks.
C. The Map task processes the partial record and raises an exception for the next Map task.
D. The Map task reads past the end of its InputSplit block into the next block to finish the record, while the adjacent Map task skips the first partial record.

42 A developer writes a MapReduce job to calculate the global average of values associated with a key. They use the same Reducer implementation as the Combiner to optimize network traffic. Which of the following describes the outcome of this decision?

Combiners and Partitioners Hard
A. The job will produce correct results only if all map tasks emit exactly the same number of records per key.
B. The job will execute faster and produce the correct global average because averages are commutative.
C. The framework will throw an execution error because a Reducer cannot logically be used as a Combiner in MRv2.
D. The job will execute faster but produce incorrect results because calculating a mean is not an associative and commutative operation.

43 During the shuffle phase, Reducers must fetch Map outputs from various NodeManagers. How does a Reducer task efficiently determine the locations of the completed Map outputs?

Shuffle and Sort Phase Hard
A. It periodically queries the ApplicationMaster, which receives task completion reports and physical locations from the completed Map tasks.
B. It queries the HDFS NameNode, which tracks the temporary spilled map outputs.
C. It broadcasts a request to all NodeManagers in the cluster asking for map outputs associated with its partition.
D. The NodeManagers actively push the partitioned data to the Reducer containers as soon as the Map tasks complete.

44 Under the YARN Fair Scheduler using Dominant Resource Fairness (DRF), consider a cluster with 100 CPUs and 1000 GB of RAM. App A requests containers with 2 CPUs and 10 GB of RAM, while App B requests containers with 1 CPU and 20 GB of RAM. How are the dominant shares calculated?

YARN Scheduling Hard
A. App A's dominant resource is CPU (2%); App B's dominant resource is CPU (1%) due to normalization.
B. App A's dominant resource is Memory (1%); App B's dominant resource is CPU (1%).
C. Both applications have CPU as their dominant resource because CPU scheduling inherently supersedes Memory scheduling in YARN.
D. App A's dominant resource is CPU (2%); App B's dominant resource is Memory (2%).

45 A MapReduce job performs a complex data transformation and inserts the output records directly into an external non-idempotent relational database from within the Map tasks. Speculative execution is enabled by default. What critical issue will arise in this scenario?

Fault Tolerance and Speculative Execution Hard
A. Data duplication will occur because speculative tasks will insert duplicate records before the framework kills the slower task.
B. The MapReduce job will fail because YARN cannot serialize database connection objects across the cluster.
C. The external database will reject the connections due to Kerberos ticket mismatches generated by speculative containers.
D. The ApplicationMaster will deadlock because it cannot lock the external database rows.

46 In a High Availability (HA) YARN cluster, a 'split-brain' scenario occurs where two ResourceManagers (RM1 and RM2) both believe they are active. How does YARN's architectural design prevent cluster corruption in this specific scenario?

YARN Architecture Components Hard
A. The ApplicationMasters implement exponential backoff and will fail over to a pre-configured third Resource Manager (Witness RM).
B. The NodeManagers utilize a Paxos protocol to vote on which RM to send heartbeats to, ostracizing the minority RM.
C. The ActiveStandbyElector uses ZooKeeper to maintain an active lock, and YARN implements fencing where the active RM's epoch number is validated by the ZooKeeper-based state store before any state changes are committed.
D. The Timeline Server acts as an arbiter and forcefully terminates the JVM of the RM with the oldest startup timestamp.

47 To achieve a total global ordering of output data in MapReduce, a developer decides to use the TotalOrderPartitioner. Which of the following prerequisites is strictly necessary for TotalOrderPartitioner to function efficiently without causing extreme data skew?

Combiners and Partitioners Hard
A. A sampling phase must be executed prior to the job to determine partition boundaries, creating a partition file loaded into the Distributed Cache.
B. All keys must be mapped to exactly the same data type size (e.g., exactly 64-bit integers).
C. The input dataset must be pre-sorted in HDFS before the Map phase begins.
D. The number of Reducers must be set strictly equal to the number of Map tasks.

48 If the ApplicationMaster (AM) container fails in YARN, what is the exact sequence of recovery initiated by the framework?

YARN Architecture Components Hard
A. The ResourceManager allocates a new container for the AM, which must then request all previously completed map outputs again, as all intermediate data is purged.
B. The ResourceManager launches a new AM; the new AM can recover the state of already completed tasks if application state recovery is enabled, avoiding full task re-execution.
C. The job fails immediately, and the client application must resubmit the entire MapReduce job from scratch.
D. The NodeManager restarts the AM on the same node, maintaining all active task containers without interruption.

49 During the Map phase, output records are buffered in memory before being spilled to disk. The spill threshold is defined by mapreduce.map.sort.spill.percent. What happens when the buffer usage reaches this threshold?

Shuffle and Sort Phase Hard
A. A background thread begins sorting and spilling the contents to the local disk, while the map task continues writing to the remaining space in the buffer.
B. The framework immediately preempts the map task, sending the partially completed buffer over the network directly to the Reducer.
C. The Map task pauses all record processing until the background thread completes spilling the buffer to HDFS.
D. The NodeManager dynamically allocates more heap memory to the map container to prevent disk I/O bottlenecks.

50 A MapReduce developer is implementing a secondary sort to sort values arriving at the reducer. They configure a custom WritableComparable as the Map output key and write a custom Partitioner. What third component MUST be heavily customized to ensure the Reducer receives all values for a given logical key in a single reduce() call?

MapReduce Execution Framework Hard
A. The OutputCommitter must be configured to merge partial files based on the secondary sort keys.
B. A custom GroupingComparator must be configured to group the composite keys based solely on their logical key portion.
C. The InputFormat class must be overridden to chunk data based on the logical key.
D. A custom Combiner must be provided to pre-sort the values in the NodeManager's RAM.

51 In the YARN Capacity Scheduler, what is the effect of configuring yarn.scheduler.capacity.<queue-path>.maximum-capacity lower than 100% for a specific queue?

YARN Scheduling Hard
A. It restricts the queue from utilizing idle cluster resources beyond that percentage, preventing it from overtaking the entire cluster during elastic expansion.
B. It forces the queue to strictly pre-empt containers from other queues to guarantee its minimum capacity.
C. It permanently throttles the CPU clock speed of all containers running in that queue to the specified percentage.
D. It dictates the maximum percentage of a single node's resources that a container in this queue can request.

52 Consider a scenario where you have two large datasets, A and B. You want to perform a Map-side join. Which of the following conditions is strictly necessary to implement a standard Map-side join efficiently using the framework's CompositeInputFormat?

MapReduce Execution Framework Hard
A. Both datasets must be partitioned using the same logic, sorted by the join key, and have exactly the same number of partitions.
B. Dataset B must be small enough to fit entirely into the RAM of a single NodeManager.
C. The MapReduce job must be configured to run with zero reducers and the Distributed Cache disabled.
D. Both datasets must be compressed using a splittable codec such as bzip2 or LZO.

53 What happens in YARN if a NodeManager experiences a transient network partition and fails to send heartbeats to the ResourceManager for a duration exceeding the yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms?

YARN Architecture Components Hard
A. The ApplicationMaster running on that node assumes ResourceManager duties until the network partition heals.
B. The ResourceManager immediately deletes all HDFS blocks residing on that node to prevent data corruption.
C. The ResourceManager marks the node as DEAD, considers all containers on it as failed, and notifies the respective ApplicationMasters to re-schedule those tasks.
D. The NodeManager automatically shuts down its local operating system to fence the node from the cluster.

54 Uber mode (or Uber task optimization) in MapReduce v2 (YARN) is designed to optimize execution for small jobs. How does it alter the standard execution model?

MapReduce Execution Framework Hard
A. It utilizes GPU acceleration on the NodeManagers to execute map and reduce tasks in parallel threads.
B. It bypasses the ResourceManager completely and launches tasks directly using the HDFS DataNode daemon.
C. It executes all map and reduce tasks sequentially within the ApplicationMaster's JVM, avoiding the overhead of requesting and launching separate containers.
D. It runs only Map tasks and forcefully streams their outputs back to the client submitting the job, skipping the Reduce phase.

55 In the context of MapReduce job commitment, what is the primary role of the OutputCommitter class's two-phase commit protocol?

MapReduce Execution Framework Hard
A. To allow tasks to write output to a temporary location and gracefully promote it to the final destination only if the task, and subsequently the job, successfully completes.
B. To securely sign the output data blocks in HDFS with Kerberos tokens.
C. To synchronize the ZooKeeper transaction logs before signaling the JobTracker of success.
D. To commit the final output data to an external RDBMS without holding long-lived database locks.

56 During the Reduce phase, the execution fundamentally consists of three sub-phases: Copy (Shuffle), Sort (Merge), and Reduce. Which of the following accurately describes a critical operation during the Sort (Merge) phase?

Shuffle and Sort Phase Hard
A. The framework performs a full external Quicksort on the raw key-value pairs fetched from the mappers.
B. The Reducer invokes the Partitioner again to ensure that keys were not routed to the wrong node due to network errors.
C. The Reducer merges the already-sorted map output files fetched from various NodeManagers to maintain a single, totally ordered stream of keys.
D. The Reducer pushes the data back to HDFS temporarily because the memory buffer is cleared for the reduce() function.

57 YARN Federation addresses the scalability limits of a single ResourceManager cluster. How does YARN Federation manage a single application that requires more resources than a single sub-cluster can provide?

YARN Architecture Components Hard
A. It allows the ApplicationMaster to request resources from ResourceManagers of multiple sub-clusters simultaneously using a global policy.
B. It statically provisions resources by mapping the application's user ID strictly to one master sub-cluster.
C. It cannot manage this; YARN Federation strictly requires an application to fit entirely within the capacity of a single sub-cluster.
D. It automatically splits the MapReduce code into multiple distinct JARs and submits them independently.

58 If you submit a MapReduce job with the configuration mapreduce.job.reduces=0, what happens to the output data?

MapReduce Execution Framework Hard
A. The mappers write their outputs to the local disk of the NodeManager, where it remains until a subsequent reduce job is manually started.
B. The mappers process the data, but no output is written to HDFS because the OutputFormat is exclusively bound to the Reducer.
C. The framework bypasses the shuffle and sort phases, and the map tasks write their output directly to HDFS in the final output directory.
D. The job fails with an IllegalStateException because every MapReduce job requires at least one reducer.

59 In YARN, an ApplicationMaster operates within a container and must authenticate with the ResourceManager to request further resources. Which security mechanism does the AM use to securely communicate with the RM in a Kerberized cluster?

YARN Architecture Components Hard
A. It uses an X.509 client certificate hardcoded into the NodeManager truststore.
B. It uses a short-lived AMRMToken (ApplicationMaster-ResourceManager Token) issued by the RM during AM launch.
C. It generates a public/private keypair dynamically and registers the public key in ZooKeeper.
D. It uses the client's original Kerberos Ticket Granting Ticket (TGT), forwarded over RPC.

60 A developer writes a custom Partitioner for a MapReduce job to route records based on an 'AccountID' string. The logic uses (accountID.hashCode() % numReducers). Under what circumstance will this custom partitioner cause a severe job failure?

Combiners and Partitioners Hard
A. If there is a massive data skew where one AccountID has 90% of the data.
B. If numReducers is set to 1.
C. If accountID.hashCode() evaluates to Integer.MIN_VALUE.
D. If accountID contains special characters that cannot be hashed.