1Which of the following components provides the distributed storage in the Hadoop Architecture?
Hadoop Architecture
Easy
A.YARN
B.HDFS
C.Hive
D.MapReduce
Correct Answer: HDFS
Explanation:
HDFS (Hadoop Distributed File System) is the primary storage system of the Hadoop architecture, designed to store large files across multiple machines.
Incorrect! Try again.
2What are the two core components of the original Apache Hadoop framework?
Hadoop Architecture
Easy
A.HDFS and MapReduce
B.YARN and Zookeeper
C.Spark and Kafka
D.HBase and Pig
Correct Answer: HDFS and MapReduce
Explanation:
The original Hadoop framework consists of two core layers: HDFS for storage and MapReduce for processing.
Incorrect! Try again.
3Hadoop is primarily optimized for which type of data processing?
Hadoop Architecture
Easy
A.Real-time processing
B.Interactive querying
C.Stream processing
D.Batch processing
Correct Answer: Batch processing
Explanation:
Hadoop is designed for batch processing, meaning it efficiently processes massive amounts of data in the background rather than providing real-time responses.
Incorrect! Try again.
4What does HDFS stand for?
Hadoop Storage: HDFS
Easy
A.Hadoop Distributed File System
B.Hyper Distributed File Storage
C.High Data File System
D.Hadoop Data Format System
Correct Answer: Hadoop Distributed File System
Explanation:
HDFS stands for Hadoop Distributed File System, which is the distributed file system that provides high-throughput access to application data.
Incorrect! Try again.
5What is the default block size in HDFS for Hadoop 2.x and later?
Hadoop Storage: HDFS
Easy
A.64 MB
B.128 MB
C.256 MB
D.512 MB
Correct Answer: 128 MB
Explanation:
In Hadoop 2.x and later versions, the default block size in HDFS is 128 MB. This large size helps minimize disk seek time.
Incorrect! Try again.
6How does HDFS primarily achieve fault tolerance?
Hadoop Storage: HDFS
Easy
A.By replicating data blocks across multiple nodes
B.By continuously backing up to the cloud
C.By encrypting all files
D.By using a relational database
Correct Answer: By replicating data blocks across multiple nodes
Explanation:
HDFS achieves reliability and fault tolerance by replicating data blocks (usually 3 times by default) across different nodes in the cluster.
Incorrect! Try again.
7Which data access model does HDFS follow?
Hadoop Storage: HDFS
Easy
A.Write-once, read-many
B.Write-many, read-many
C.Write-once, read-once
D.Write-many, read-once
Correct Answer: Write-once, read-many
Explanation:
HDFS is designed for a 'write-once, read-many' model. A file once created, written, and closed need not be changed, which simplifies data coherency.
Incorrect! Try again.
8In the MapReduce paradigm, what is the role of the Reduce function?
Hadoop MapReduce paradigm
Easy
A.To split the data into smaller chunks
B.To aggregate and summarize intermediate results
C.To store the final data in a relational database
D.To filter and map data to key-value pairs
Correct Answer: To aggregate and summarize intermediate results
Explanation:
The Reduce function takes the intermediate output from the Map phase and aggregates or summarizes it to produce the final output.
Incorrect! Try again.
9What is the primary data structure passed between the Map and Reduce phases?
Hadoop MapReduce paradigm
Easy
A.Arrays
B.XML nodes
C.Key-Value pairs
D.JSON objects
Correct Answer: Key-Value pairs
Explanation:
The MapReduce framework operates exclusively on <key, value> pairs, meaning the map function outputs intermediate key-value pairs which are then processed by the reduce function.
Incorrect! Try again.
10Which phase occurs directly between the Map phase and the Reduce phase to group data by keys?
Hadoop MapReduce paradigm
Easy
A.File Writing
B.Data Splitting
C.Data Ingestion
D.Shuffle and Sort
Correct Answer: Shuffle and Sort
Explanation:
The 'Shuffle and Sort' phase collects the output from the mappers, sorts them by key, and transfers them to the reducers.
Incorrect! Try again.
11In MapReduce terminology, what is an 'InputSplit'?
MapReduce Terminology
Easy
A.A command to divide the cluster into smaller networks
B.A physical file on the disk
C.A logical representation of data processed by a single Map task
D.An error that splits a job into two
Correct Answer: A logical representation of data processed by a single Map task
Explanation:
An InputSplit is a logical division of the input data. Each Map task processes exactly one InputSplit.
Incorrect! Try again.
12What does a 'RecordReader' do in a MapReduce job?
MapReduce Terminology
Easy
A.It combines the outputs of multiple Reducers
B.It reads the final output from HDFS
C.It translates an InputSplit into key-value pairs for the Mapper
D.It monitors the health of the DataNodes
Correct Answer: It translates an InputSplit into key-value pairs for the Mapper
Explanation:
The RecordReader reads data from an InputSplit and converts it into key-value pairs so the Mapper can process them.
Incorrect! Try again.
13What is the term for the output produced by the Mapper before it reaches the Reducer?
MapReduce Terminology
Easy
A.Aggregated Data
B.Raw Data
C.Intermediate Data
D.Final Output
Correct Answer: Intermediate Data
Explanation:
The key-value pairs generated by the Map function are called Intermediate Data. This data is temporarily stored before being sent to the Reducer.
Incorrect! Try again.
14In HDFS, which node is responsible for storing the metadata about the file system?
Hadoop - Namenode, DataNode
Easy
A.NameNode
B.DataNode
C.JobTracker
D.TaskTracker
Correct Answer: NameNode
Explanation:
The NameNode acts as the master server in HDFS. It manages the file system namespace and stores all metadata, such as file names, permissions, and block locations.
Incorrect! Try again.
15What is the primary function of a DataNode in HDFS?
Hadoop - Namenode, DataNode
Easy
A.To schedule MapReduce jobs
B.To run the JobTracker
C.To manage user permissions
D.To store the actual data blocks
Correct Answer: To store the actual data blocks
Explanation:
DataNodes are the worker nodes in HDFS. Their primary role is to store and retrieve the actual data blocks when told to by the NameNode or the client.
Incorrect! Try again.
16What happens if the NameNode fails in a traditional Hadoop 1.x cluster (without High Availability)?
Hadoop - Namenode, DataNode
Easy
A.MapReduce jobs switch to local mode automatically
B.The entire HDFS becomes inaccessible
C.The cluster continues to operate normally
D.A DataNode automatically becomes the new NameNode
Correct Answer: The entire HDFS becomes inaccessible
Explanation:
In early versions of Hadoop without High Availability, the NameNode was a single point of failure. If it failed, the metadata was unavailable, making the entire file system inaccessible.
Incorrect! Try again.
17In the MapReduce version 1 (MRv1) architecture, which component manages the resources and schedules jobs across the cluster?
Job Tracker and TaskTracker
Easy
A.DataNode
B.NameNode
C.JobTracker
D.TaskTracker
Correct Answer: JobTracker
Explanation:
The JobTracker in MRv1 is the master service responsible for resource management, scheduling MapReduce jobs, and tracking their progress.
Incorrect! Try again.
18Where does a TaskTracker typically run in a Hadoop MRv1 cluster?
Job Tracker and TaskTracker
Easy
A.On a dedicated master node
B.On the same node as a DataNode
C.Outside the Hadoop cluster
D.On the NameNode
Correct Answer: On the same node as a DataNode
Explanation:
To achieve data locality, a TaskTracker usually runs on the same physical machine as a DataNode, allowing it to process data that is stored locally.
Incorrect! Try again.
19When running a typical Word Count program in Hadoop, what is the expected output format?
word count on command line
Easy
A.A list of unique words alongside their frequency of occurrence
B.A single integer representing the total number of words
C.A compressed zip file of all words
D.A graphical chart of word frequencies
Correct Answer: A list of unique words alongside their frequency of occurrence
Explanation:
The standard Word Count MapReduce program outputs a text file containing key-value pairs, where the key is a unique word and the value is the number of times it appeared.
Incorrect! Try again.
20Which command is commonly used on the command line to execute a compiled MapReduce JAR file?
word count on command line
Easy
A.hdfs execute
B.hadoop run
C.hadoop jar
D.mapreduce start
Correct Answer: hadoop jar
Explanation:
The hadoop jar command is used to run a MapReduce job packaged in a JAR file on a Hadoop cluster.
Incorrect! Try again.
21A user wants to store a file of size 300 MB in HDFS with a configured block size of 128 MB. Assuming the replication factor is set to 3, how many physical block replicas will be stored across the cluster in total?
Hadoop Storage: HDFS
Medium
A.6 blocks
B.3 blocks
C.12 blocks
D.9 blocks
Correct Answer: 9 blocks
Explanation:
The file is split into 3 logical blocks (128 MB, 128 MB, and 44 MB). Since the replication factor is 3, each of the 3 logical blocks will have 3 replicas, resulting in physical blocks stored in the cluster.
Incorrect! Try again.
22In a Hadoop cluster configured with Rack Awareness and a replication factor of 3, how does the cluster typically place the replicas to ensure fault tolerance while optimizing write bandwidth?
Hadoop Architecture
Medium
A.One replica is placed on the local rack, and the other two are placed on two different nodes in a different rack.
B.The first two replicas are placed on the local node, and the third is placed on a remote rack.
C.Each replica is placed on a completely different rack in the data center.
D.All three replicas are placed on different nodes within the same rack.
Correct Answer: One replica is placed on the local rack, and the other two are placed on two different nodes in a different rack.
Explanation:
To balance fault tolerance and write bandwidth, Hadoop's default rack awareness policy places the first replica on the local rack, and the second and third replicas on two different nodes within a single remote rack. This protects against a single rack failure.
Incorrect! Try again.
23What is the primary architectural advantage of the 'Data Locality' principle in Hadoop?
Hadoop Architecture
Medium
A.It guarantees that all localized databases are synchronized with the NameNode.
B.It moves data across the network to specialized compute nodes to increase processing speed.
C.It schedules computational tasks on the node where the data physically resides, minimizing network congestion.
D.It ensures that data is stored locally on the client machine before being uploaded to HDFS.
Correct Answer: It schedules computational tasks on the node where the data physically resides, minimizing network congestion.
Explanation:
Data locality means moving the computation to the data rather than moving data to the computation. By executing map tasks on the nodes where the HDFS blocks are stored, Hadoop drastically reduces network overhead and improves performance.
Incorrect! Try again.
24An administrator notices that the NameNode is running out of RAM, even though the cluster's total storage capacity is mostly empty. What is the most likely cause of this issue?
Hadoop Storage: HDFS
Medium
A.The Secondary NameNode has failed to back up the data properly.
B.The DataNodes are sending heartbeats too frequently.
C.The cluster is storing an excessive number of very small files.
D.The replication factor is set too high, consuming extra RAM.
Correct Answer: The cluster is storing an excessive number of very small files.
Explanation:
The NameNode stores metadata for every file and block in memory (RAM). Even if the files are tiny (e.g., a few kilobytes), each still consumes a fixed amount of memory for metadata (roughly 150 bytes). Millions of small files can exhaust NameNode memory before disk space runs out.
Incorrect! Try again.
25What is the actual role of the Secondary NameNode in a standard Hadoop cluster?
Hadoop Storage: HDFS
Medium
A.It periodically downloads the fsimage and edits files, merges them, and uploads the updated fsimage to the primary NameNode.
B.It serves as an instant failover node if the primary NameNode crashes.
C.It manages metadata for secondary storage devices attached to DataNodes.
D.It acts as a load balancer for client read/write requests to the NameNode.
Correct Answer: It periodically downloads the fsimage and edits files, merges them, and uploads the updated fsimage to the primary NameNode.
Explanation:
The Secondary NameNode is not a high-availability backup. Its purpose is to perform checkpointing by periodically merging the edits log into the fsimage to prevent the edits file from growing too large, which would cause long NameNode startup times.
Incorrect! Try again.
26During an HDFS write operation, a client wants to write a block with a replication factor of 3. How is the data practically transferred to the DataNodes?
Hadoop Storage: HDFS
Medium
A.The client writes it locally, and HDFS automatically replicates it in the background after the file is closed.
B.The NameNode coordinates the transfer by receiving the data from the client and pushing it to the DataNodes.
C.The client sends the data to the first DataNode, which pipes it to the second, which in turn pipes it to the third.
D.The client sends the block simultaneously to all three DataNodes.
Correct Answer: The client sends the data to the first DataNode, which pipes it to the second, which in turn pipes it to the third.
Explanation:
HDFS uses a replication pipeline. The client streams the data to the first DataNode, which immediately forwards it to the second DataNode, which forwards it to the third. This pipelining efficiently utilizes network bandwidth.
Incorrect! Try again.
27If a MapReduce job is configured with zero Reducers (setNumReduceTasks(0)), what is the final output of the job?
Hadoop MapReduce paradigm
Medium
A.The job fails because at least one reducer is required to aggregate data.
B.The job simply verifies data integrity but writes no output.
C.The output consists of the unsorted key-value pairs exactly as outputted by the Map tasks, stored in HDFS.
D.The output consists of the sorted key-value pairs directly from the Map phase, stored in HDFS.
Correct Answer: The output consists of the unsorted key-value pairs exactly as outputted by the Map tasks, stored in HDFS.
Explanation:
If the number of reducers is zero, the MapReduce job becomes a 'map-only' job. The shuffle and sort phases are skipped, and the output of the map phase is written directly to HDFS without being sorted.
Incorrect! Try again.
28How does a Combiner optimize a MapReduce job that calculates the total sales per region?
Hadoop MapReduce paradigm
Medium
A.It automatically adjusts the number of map tasks based on cluster availability.
B.It runs on the Reducer node to filter out invalid records before the final reduction.
C.It merges small files in HDFS into larger ones before the Map phase begins.
D.It performs a local aggregation of map output data on the Map node, reducing the amount of data sent across the network during the shuffle phase.
Correct Answer: It performs a local aggregation of map output data on the Map node, reducing the amount of data sent across the network during the shuffle phase.
Explanation:
A Combiner acts as a 'mini-reducer' running locally on the output of a Map task. By aggregating data locally before it is sent over the network to the Reducers, it significantly minimizes the bandwidth required during the shuffle phase.
Incorrect! Try again.
29During the Shuffle and Sort phase of MapReduce, what specific guarantee is provided to the Reducer regarding its input?
Hadoop MapReduce paradigm
Medium
A.The Reducer will receive data split into blocks matching the HDFS block size.
B.All keys assigned to a single reducer will arrive in randomized order to prevent data skew.
C.Each reducer will receive an exactly equal amount of data, regardless of key distribution.
D.Values associated with the same key are grouped together, and the keys are presented to the Reducer in sorted order.
Correct Answer: Values associated with the same key are grouped together, and the keys are presented to the Reducer in sorted order.
Explanation:
The framework guarantees that all values for a given key are grouped together and passed to the reducer's reduce() method. Furthermore, the keys are sorted, allowing the reducer to process the data sequentially.
Incorrect! Try again.
30In the context of MapReduce Terminology, what is the primary difference between an HDFS Block and an InputSplit?
MapReduce Terminology
Medium
A.They are identical concepts; Hadoop uses the terms interchangeably depending on the version.
B.An HDFS Block is a physical division of data on disk, whereas an InputSplit is a logical division of data that defines the input for a single Map task.
C.An InputSplit is a physical chunk of data handled by the JobTracker, while a Block is an abstract data structure used by the Reducer.
D.An InputSplit determines the number of Reducers, while an HDFS Block determines the number of Mappers.
Correct Answer: An HDFS Block is a physical division of data on disk, whereas an InputSplit is a logical division of data that defines the input for a single Map task.
Explanation:
A Block is a physical chunk of a file stored in HDFS (e.g., 128 MB). An InputSplit is a logical representation of data created by the InputFormat, which the RecordReader parses into key-value pairs for exactly one map task. Often they align, but an InputSplit respects logical record boundaries.
Incorrect! Try again.
31Why must keys emitted by the Mapper implement the WritableComparable interface in Hadoop?
MapReduce Terminology
Medium
A.To ensure that values can be logically split across multiple reducers.
B.Because Hadoop requires all data types to inherit from standard Java Collections.
C.So they can be serialized over the network and sorted during the shuffle phase.
D.So they can be compressed securely before writing to HDFS.
Correct Answer: So they can be serialized over the network and sorted during the shuffle phase.
Explanation:
In Hadoop, data passed between phases must be serializable (Writable). Furthermore, because keys must be sorted and grouped during the shuffle phase before reaching the Reducer, they must also be comparable (Comparable). Thus, keys implement WritableComparable.
Incorrect! Try again.
32What component is directly responsible for converting raw input data (e.g., lines of a text file) into the initial <key, value> pairs processed by the Mapper?
MapReduce Terminology
Medium
A.The OutputCommitter
B.The Partitioner
C.The InputSplitter
D.The RecordReader
Correct Answer: The RecordReader
Explanation:
The RecordReader interacts directly with the InputSplit. It reads the raw data from the data source and translates it into <key, value> pairs that are passed one by one to the map() function.
Incorrect! Try again.
33In a MapReduce v1 (MRv1) architecture, what is the primary role of the JobTracker?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Medium
A.To store the metadata of HDFS files and direct clients to the correct DataNodes.
B.To execute the individual map and reduce tasks assigned by the NameNode.
D.To merge the edit logs into the fsimage to keep the NameNode from crashing.
Correct Answer: To allocate resources, schedule jobs, monitor TaskTrackers, and re-execute failed tasks.
Explanation:
In classic Hadoop (MRv1), the JobTracker is the central service responsible for resource management, task scheduling, monitoring the progress of jobs, and handling failures by re-assigning tasks to available TaskTrackers.
Incorrect! Try again.
34A TaskTracker node unexpectedly loses power while executing a Map task. How does the cluster recognize and handle this failure?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Medium
A.The TaskTracker reboots and resumes the task from the last saved checkpoint in HDFS.
B.The NameNode detects missing block heartbeats and reschedules the Map task on another rack.
C.The JobTracker stops receiving heartbeats from the TaskTracker, marks it as dead, and schedules the incomplete task on another available TaskTracker.
D.The JobTracker immediately fails the entire MapReduce job to prevent data corruption.
Correct Answer: The JobTracker stops receiving heartbeats from the TaskTracker, marks it as dead, and schedules the incomplete task on another available TaskTracker.
Explanation:
TaskTrackers send regular heartbeats to the JobTracker. If the JobTracker misses heartbeats for a certain threshold, it assumes the TaskTracker has failed and re-assigns any running tasks from that node to other healthy nodes.
Incorrect! Try again.
35When a JobTracker determines that a specific Map task is running unusually slow compared to others in the same job, what mechanism can it use to mitigate this?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Medium
A.Data Rebalancing
B.Speculative Execution
C.Dynamic Partitioning
D.Garbage Collection
Correct Answer: Speculative Execution
Explanation:
Speculative Execution is a feature where the JobTracker detects a slow-running task (a straggler) and launches a duplicate, or speculative, task on another node. Whichever task finishes first is used, and the other is killed.
Incorrect! Try again.
36Which of the following best describes the relationship between DataNodes and TaskTrackers in a traditional Hadoop 1.x cluster?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Medium
A.TaskTrackers manage the metadata while DataNodes handle the actual computations.
B.DataNodes act as a backup for TaskTrackers in case the JobTracker fails.
C.They are entirely separate entities running on isolated hardware to prevent CPU and I/O contention.
D.They typically run on the same physical machines to enable data locality for MapReduce tasks.
Correct Answer: They typically run on the same physical machines to enable data locality for MapReduce tasks.
Explanation:
To achieve data locality, slave nodes in a classic Hadoop cluster run both a DataNode daemon (for storage) and a TaskTracker daemon (for computation). This allows tasks to be processed on the same node where the data is stored.
Incorrect! Try again.
37If a DataNode successfully writes its block but its disk subsequently fails, how does the NameNode eventually find out about the missing data?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Medium
A.The DataNode sends regular block reports along with its heartbeats to the NameNode.
B.The Secondary NameNode scans the disks and updates the primary NameNode.
C.The JobTracker notifies the NameNode when a map task fails to read the data.
D.The NameNode continually pings all block locations asynchronously to verify integrity.
Correct Answer: The DataNode sends regular block reports along with its heartbeats to the NameNode.
Explanation:
DataNodes periodically send Block Reports to the NameNode. A Block Report contains a list of all HDFS blocks stored on that DataNode. If a disk fails, the blocks will be missing from the report, prompting the NameNode to replicate them elsewhere.
Incorrect! Try again.
38A user attempts to run a pre-compiled word count MapReduce job using the command: hadoop jar wc.jar WordCount /user/data/input /user/data/output. However, the job immediately fails before running any map tasks. What is the most likely cause?
word count on command line
Medium
A.The input directory is empty, which throws a fatal execution exception.
B.The jar file lacks a combiner class, which is mandatory for WordCount.
C.The /user/data/output directory already exists in HDFS.
D.The user forgot to specify the number of reducers in the command arguments.
Correct Answer: The /user/data/output directory already exists in HDFS.
Explanation:
Hadoop strictly requires that the output directory for a MapReduce job does not exist prior to execution. This safety mechanism prevents a job from accidentally overwriting the output of a previous job. If it exists, the job setup fails immediately.
Incorrect! Try again.
39When executing a WordCount program via the command line (hadoop jar wordcount.jar org.example.WordCount /input /output), what happens to the output data produced by the Reducers?
word count on command line
Medium
A.It is appended directly to the input files to keep data localized.
B.It is written as multiple part files (e.g., part-r-00000) inside the /output directory in HDFS.
C.It is printed directly to the terminal stdout.
D.It is stored in the local file system of the node where the command was executed.
Correct Answer: It is written as multiple part files (e.g., part-r-00000) inside the /output directory in HDFS.
Explanation:
By default, MapReduce jobs write their final output to HDFS in the specified output directory. Each Reducer creates its own output file, typically named part-r-00000, part-r-00001, etc., depending on the number of reduce tasks.
Incorrect! Try again.
40In a Hadoop High Availability (HA) cluster, what prevents the 'split-brain' scenario where two NameNodes both think they are active and attempt to alter the filesystem simultaneously?
Hadoop Architecture
Medium
A.The Secondary NameNode acts as an arbiter to vote on the true active node.
B.The JobTracker coordinates a distributed lock that limits metadata edits.
C.Fencing mechanisms are configured to isolate or power off the previously active NameNode.
D.DataNodes will only send heartbeats to the IP address with the lowest latency.
Correct Answer: Fencing mechanisms are configured to isolate or power off the previously active NameNode.
Explanation:
To prevent split-brain in an HA setup, Hadoop uses 'fencing' methods. If a failover occurs, the cluster employs techniques (like shutting down the old NameNode's network port or even cutting its power via an intelligent PDU) to ensure it can no longer write to the shared storage.
Incorrect! Try again.
41During an HDFS write operation, if the second DataNode in the replication pipeline fails while receiving a block, what is the immediate sequence of actions taken by the HDFS client and the remaining DataNodes?
Hadoop Storage: HDFS
Hard
A.The entire block is discarded, the client requests a completely new pipeline from the NameNode, and the write operation restarts from the beginning.
B.The client reports the failure to the NameNode, which immediately allocates a new DataNode to maintain the replication factor before continuing the write.
C.The pipeline is closed, the failed DataNode is removed, the remaining DataNodes are given a new generation stamp, and the write resumes with the remaining DataNodes.
D.The first DataNode caches the data in memory, waits for the NameNode to restart the second DataNode, and then resumes the data transfer.
Correct Answer: The pipeline is closed, the failed DataNode is removed, the remaining DataNodes are given a new generation stamp, and the write resumes with the remaining DataNodes.
Explanation:
When a DataNode fails in the pipeline, the pipeline is temporarily closed. Any data in the ack queue is returned to the data queue. The remaining DataNodes receive a new generation stamp so the failed node's partial block is discarded if it recovers. The write resumes on the remaining nodes; the NameNode will later asynchronously replicate the block to meet the target replication factor.
Incorrect! Try again.
42In the MRv1 architecture, what happens if a TaskTracker stops sending heartbeats to the JobTracker due to a temporary network partition that exceeds the timeout period?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Hard
A.The JobTracker marks the TaskTracker as dead, fails its running tasks, and reschedules them on other nodes, while the TaskTracker pauses its tasks.
B.The JobTracker marks the TaskTracker as dead and reschedules its tasks; when the partition resolves, the TaskTracker attempts to reconnect and is instructed to kill its old tasks.
C.The TaskTracker promotes itself to an independent JobTracker for its local tasks and merges results back when the network is restored.
D.The JobTracker delegates the tracking to a Standby JobTracker, which polls the TaskTracker directly until the network recovers.
Correct Answer: The JobTracker marks the TaskTracker as dead and reschedules its tasks; when the partition resolves, the TaskTracker attempts to reconnect and is instructed to kill its old tasks.
Explanation:
If a TaskTracker's heartbeat times out, the JobTracker assumes it is dead and reschedules its tasks. The TaskTracker, however, continues executing its tasks blindly until the network is restored. Once it reconnects, the JobTracker recognizes the stale state and sends a command to the TaskTracker to kill those obsolete tasks.
Incorrect! Try again.
43How does the MapReduce framework handle speculative execution when a task is straggling due to systemic data skew (e.g., one Reducer receives 90% of the data) rather than hardware degradation?
Hadoop MapReduce paradigm
Hard
A.It splits the skewed Reducer task into multiple sub-reducers, effectively parallelizing the heavy partition.
B.It successfully mitigates the delay by launching a speculative task that dynamically re-partitions the skewed data.
C.It detects data skew via the partitioner metrics and automatically disables speculative execution for that specific task.
D.It launches a speculative task, but both the original and speculative tasks will take equally long since the skew is inherent to the data, potentially wasting cluster resources.
Correct Answer: It launches a speculative task, but both the original and speculative tasks will take equally long since the skew is inherent to the data, potentially wasting cluster resources.
Explanation:
Speculative execution is designed to counter hardware slowness by launching duplicate tasks on different nodes. However, it cannot solve data skew. If a Reducer is slow because it received vastly more data, the speculative duplicate will receive the exact same skewed data and suffer the same delay, simply wasting cluster resources.
Incorrect! Try again.
44Suppose an HDFS file is 130 MB and the block size is 64 MB. The file contains textual records where a single logical record spans across the boundary of the first and second block. How does TextInputFormat handle the InputSplit boundary to ensure data integrity?
MapReduce Terminology
Hard
A.The framework copies the overflowing record entirely into the second block before assigning the InputSplits to the Mappers.
B.The JobTracker detects the split boundary violation and merges the two blocks into a single 128 MB InputSplit processed by one Mapper.
C.The first Map task processes exactly 64 MB. The second Map task reads the truncated record from the beginning of the second block, resulting in a data parsing error.
D.The first Map task processes the first block and reads past the 64 MB boundary into the second block until the end of the current record. The second Map task ignores the first partial record in its block.
Correct Answer: The first Map task processes the first block and reads past the 64 MB boundary into the second block until the end of the current record. The second Map task ignores the first partial record in its block.
Explanation:
TextInputFormat uses LineRecordReader, which is designed to handle records crossing block boundaries. The Mapper assigned to a split will read past the end of its block to finish the last record. Conversely, the Mapper for the next split will skip the first partial record it encounters, knowing the previous Mapper handled it.
Incorrect! Try again.
45Which of the following best describes the structural transition of metadata when the NameNode is restarted, specifically regarding the FsImage and EditLog?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Hard
A.The Secondary NameNode takes over client requests while the primary NameNode merges the FsImage and EditLog into a new FsImage.
B.The NameNode loads the FsImage into memory, leaves the EditLog untouched, and asynchronously merges them in the background while serving clients.
C.The NameNode discards the old FsImage, regenerates it entirely from the block reports of the DataNodes, and then replays the EditLog.
D.The NameNode applies the EditLog to the FsImage in memory, creates a new FsImage on disk, and truncates the old EditLog before accepting new client requests.
Correct Answer: The NameNode applies the EditLog to the FsImage in memory, creates a new FsImage on disk, and truncates the old EditLog before accepting new client requests.
Explanation:
During startup, the NameNode loads the FsImage into memory and replays the EditLog to reach the latest state. It then flushes this updated state back to disk as a new FsImage and clears the EditLog (since its transactions are now in the FsImage). Only after this process is complete and block reports are received does it exit SafeMode.
Incorrect! Try again.
46When executing a Word Count job via the Hadoop command line using hadoop jar, what is the effect of setting -D mapreduce.job.reduces=0?
word count on command line
Hard
A.The framework automatically uses a Combiner to act as the Reducer, yielding partially aggregated counts per Map task.
B.The job executes normally, but the final output is merged into a single file by the JobTracker instead of Reducers.
C.The Map output is written directly to HDFS without a shuffle, sort, or reduce phase, resulting in output files containing unaggregated key-value pairs.
D.The job fails immediately because a MapReduce job strictly requires at least one Reducer.
Correct Answer: The Map output is written directly to HDFS without a shuffle, sort, or reduce phase, resulting in output files containing unaggregated key-value pairs.
Explanation:
Setting the number of Reducers to 0 creates a Map-only job. In this scenario, the shuffle, sort, and reduce phases are completely bypassed. The Mappers output their raw key-value pairs directly to HDFS. For Word Count, this means outputting (word, 1) for every occurrence without any global aggregation.
Incorrect! Try again.
47In a Hadoop cluster configured with Rack Awareness, if a client running on a DataNode requests to write a file with a replication factor of 3, how does the block placement policy distribute the replicas?
Hadoop Architecture
Hard
A.All three replicas are placed on different racks to maximize fault tolerance.
B.Replica 1 on the local node, Replica 2 on a random node in a different rack, Replica 3 on another node in that same different rack.
C.Replica 1 on a random node in a different rack, Replica 2 and 3 on different nodes in the local rack.
D.Replica 1 on the local node, Replica 2 on a random node in the same rack, Replica 3 on a random node in a different rack.
Correct Answer: Replica 1 on the local node, Replica 2 on a random node in a different rack, Replica 3 on another node in that same different rack.
Explanation:
Hadoop's default block placement policy places the first replica on the local node (or a random node if the client is outside the cluster). To balance fault tolerance and write bandwidth, the second replica is placed on a node in a different remote rack, and the third replica is placed on a different node in that same remote rack. This minimizes cross-rack network traffic while ensuring survival if a single rack fails.
Incorrect! Try again.
48An HDFS client opens a file for appending (append()). Simultaneously, a network partition isolates the client from the NameNode but not from the DataNodes. How does HDFS handle lease management for this file?
Hadoop Storage: HDFS
Hard
A.The NameNode's lease for the client expires after the hard limit (usually 1 hour). The NameNode initiates lease recovery, closing the file and potentially discarding uncommitted blocks.
B.The DataNodes detect the lack of NameNode heartbeats and automatically revoke the client's write access, saving partial blocks.
C.The client immediately receives an IOException from the DataNodes because DataNodes require continuous token validation from the NameNode during appends.
D.The client continues to write to the DataNodes indefinitely; the NameNode cannot intervene until the network is restored.
Correct Answer: The NameNode's lease for the client expires after the hard limit (usually 1 hour). The NameNode initiates lease recovery, closing the file and potentially discarding uncommitted blocks.
Explanation:
In HDFS, writers hold a lease managed by the NameNode. If the client cannot renew the lease (due to a partition), the soft limit (1 min) and eventually the hard limit (1 hour) will expire. Upon hard limit expiration, the NameNode forcefully revokes the lease, triggers lease recovery on the last block via the DataNodes, and closes the file, guaranteeing the file system's integrity.
Incorrect! Try again.
49Consider a MapReduce job where the map output keys are custom objects representing composite keys: [String category, Long timestamp]. You want the Reducer to process data grouped by category, but sorted internally by timestamp. Which components must be explicitly configured to achieve this Secondary Sorting?
Hadoop MapReduce paradigm
Hard
A.A custom Combiner to pre-sort by timestamp and a Partitioner on category.
B.A custom Partitioner on category, a custom GroupingComparator on category, and a custom SortComparator on [category, timestamp].
C.A custom Partitioner on [category, timestamp], and a custom GroupingComparator on timestamp.
D.Only a custom SortComparator on [category, timestamp] is required; Hadoop inherently handles the grouping.
Correct Answer: A custom Partitioner on category, a custom GroupingComparator on category, and a custom SortComparator on [category, timestamp].
Explanation:
For Secondary Sorting, the Partitioner must partition only by the natural key (category) so all related records go to the same Reducer. The SortComparator must sort the composite key by [category, timestamp] during the shuffle phase. Finally, the GroupingComparator must group only by the natural key (category) so the Reducer's reduce() method receives all timestamps for a category in one iterable.
Incorrect! Try again.
50What is the primary constraint placed on the Combiner function in the MapReduce paradigm to ensure the correctness of the final output?
MapReduce Terminology
Hard
A.Its input key-value types must match the output key-value types, and the operation it performs must be both commutative and associative.
B.It must guarantee execution exactly once per Map output split before the data is shuffled.
C.It must implement the WritableComparable interface to ensure intermediate data is sortable.
D.It must be an exact programmatic clone of the Mapper class.
Correct Answer: Its input key-value types must match the output key-value types, and the operation it performs must be both commutative and associative.
Explanation:
A Combiner acts as a mini-reducer on the map side. Because Hadoop does not guarantee how many times a Combiner will be called (it could be zero, once, or multiple times), the function must be commutative and associative (like addition or finding a maximum). Furthermore, its input types must match its output types so it can safely be chained or bypassed.
Incorrect! Try again.
51How does HDFS ensure data integrity during a read operation if a client detects a checksum mismatch for a block?
Hadoop Storage: HDFS
Hard
A.The DataNode dynamically reconstructs the block from parity bits stored on the local disk before sending it to the client.
B.The client throws a ChecksumException, terminating the application immediately without retry.
C.The client reports the bad block and the DataNode to the NameNode, then proceeds to read from another replica of the block.
D.The NameNode detects the mismatch via a heartbeat, marks the DataNode as dead, and routes the client to a secondary NameNode.
Correct Answer: The client reports the bad block and the DataNode to the NameNode, then proceeds to read from another replica of the block.
Explanation:
When an HDFS client reads data, it verifies checksums. If it detects a corrupted block, it reports the corruption to the NameNode (so the NameNode can schedule block replication from a healthy replica to fix the corruption) and transparently switches to another DataNode that holds a healthy replica of the block to continue the read operation.
Incorrect! Try again.
52In a heavily utilized MRv1 cluster, the JobTracker must schedule tasks based on data locality. If a node has a free slot, but no pending Map tasks have data local to that node, what is the default delay scheduling strategy often used by fair/capacity schedulers?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Hard
A.The JobTracker immediately assigns a non-local task to utilize the free slot, prioritizing cluster utilization over locality.
B.The JobTracker waits for a short, configurable period of time before assigning a non-local task, hoping a task with local data becomes available.
C.The JobTracker preempts a running task on another node to migrate it to the node with the free slot.
D.The JobTracker assigns a Reduce task instead, since Reduce tasks do not depend on data locality.
Correct Answer: The JobTracker waits for a short, configurable period of time before assigning a non-local task, hoping a task with local data becomes available.
Explanation:
Delay scheduling is an optimization in schedulers (like Fair and Capacity schedulers). Instead of immediately assigning a rack-local or off-switch task when node-local data isn't available, the scheduler temporarily skips the job and waits a short time. This highly increases the probability that a node-local task will become available shortly, improving overall cluster throughput.
Incorrect! Try again.
53During the Shuffle and Sort phase of MapReduce, what dictates the transition of map output data from memory to disk on the Mapper side?
Hadoop MapReduce paradigm
Hard
A.The Mapper stores all key-value pairs in JVM heap memory until the map task finishes, at which point the entire dataset is flushed to disk simultaneously.
B.The OutputCommitter evaluates the block size limit; once 64 MB of data is accumulated, the framework initiates a blocking write to HDFS.
C.The Mapper writes directly to the disk cache of the operating system; Hadoop relies on the OS to flush data to disk asynchronously.
D.Data is buffered in a circular in-memory buffer; when the buffer reaches a certain threshold (e.g., 80%), a background thread begins spilling the contents to disk while the Mapper continues writing to the remaining space.
Correct Answer: Data is buffered in a circular in-memory buffer; when the buffer reaches a certain threshold (e.g., 80%), a background thread begins spilling the contents to disk while the Mapper continues writing to the remaining space.
Explanation:
Map tasks write their output to a circular memory buffer (default 100 MB). When the buffer fill reaches a threshold (default 80%), a background thread spills the data to local disk. The Mapper continues producing output into the remaining 20% of the buffer. If the buffer fills completely, the map thread blocks until the spill finishes.
Incorrect! Try again.
54A user executes a Word Count job from the command line using a compressed input file (input.txt.gz). What determines whether Hadoop can split this compressed file into multiple InputSplits?
word count on command line
Hard
A.The file size; if it exceeds the HDFS block size, Hadoop forces a split regardless of the compression algorithm.
B.The InputFormat class used; TextInputFormat automatically decompresses and splits all formats, while SequenceFileInputFormat does not.
C.The compression codec used; algorithms like Gzip do not support splitting, so the entire file must be processed by a single Mapper, whereas bzip2 is splittable.
D.The command-line argument -D mapreduce.input.fileinputformat.split.maxsize; it overrides any compression limitations.
Correct Answer: The compression codec used; algorithms like Gzip do not support splitting, so the entire file must be processed by a single Mapper, whereas bzip2 is splittable.
Explanation:
Not all compression codecs are splittable. Gzip creates a continuous compressed stream without synchronization markers, meaning a Mapper cannot jump to the middle of the file to start reading. Thus, a Gzip file must be processed by one Mapper. Codecs like bzip2 or LZO (with indexing) provide sync points, allowing Hadoop to split them across multiple Mappers.
Incorrect! Try again.
55In the context of the WritableComparable interface, which is strictly required for MapReduce keys, what is the specific purpose of the compareTo() and readFields() methods, respectively?
MapReduce Terminology
Hard
A.compareTo() handles sorting of keys during the shuffle phase; readFields() deserializes the object state from an incoming DataInput stream.
B.compareTo() ensures uniqueness for the GroupingComparator; readFields() serializes the object state into a byte array.
C.compareTo() evaluates the equality of values; readFields() reads configuration properties from the JobContext.
D.compareTo() dictates which Reducer a key is assigned to; readFields() writes the object to HDFS.
Correct Answer: compareTo() handles sorting of keys during the shuffle phase; readFields() deserializes the object state from an incoming DataInput stream.
Explanation:
The WritableComparable interface combines Writable (for serialization/deserialization) and Comparable (for sorting). compareTo() is used to sort keys during the shuffle and sort phase. readFields() is the method from the Writable interface used to deserialize the data from the binary stream (DataInput) back into the object's fields.
Incorrect! Try again.
56Which of the following describes the most critical limitation of the MRv1 architecture (JobTracker/TaskTracker) that ultimately necessitated the shift to YARN (Yet Another Resource Negotiator)?
Hadoop Architecture
Hard
A.TaskTrackers were incapable of running Java Virtual Machines (JVMs), requiring all map tasks to execute as native C++ threads.
B.The JobTracker could only process unstructured data, making it incompatible with SQL-like query engines such as Hive or Pig.
C.The JobTracker was deeply tightly coupled with both cluster resource management and job lifecycle scheduling, creating a massive scalability bottleneck around 4,000 nodes.
D.MRv1 required NameNodes to participate in MapReduce shuffle operations, overloading HDFS metadata operations.
Correct Answer: The JobTracker was deeply tightly coupled with both cluster resource management and job lifecycle scheduling, creating a massive scalability bottleneck around 4,000 nodes.
Explanation:
In MRv1, the JobTracker performed dual duties: managing cluster resources and tracking the status/lifecycle of every single task in every job. This tight coupling and heavy workload caused the JobTracker to become a CPU and memory bottleneck, severely limiting the maximum size of a Hadoop cluster (typically maxing out around 4,000 nodes). YARN solved this by splitting these roles into the ResourceManager and ApplicationMaster.
Incorrect! Try again.
57If the JobTracker JVM fails and undergoes a restart in a classic MRv1 setup, what is the fate of the currently executing jobs and the TaskTrackers?
Hadoop - Namenode, DataNode, Job Tracker and TaskTracker
Hard
A.The JobTracker recovers the exact state of all tasks from the FsImage and seamlessly reconnects to the TaskTrackers.
B.TaskTrackers independently continue running tasks and hold the results in a distributed cache until the JobTracker reconnects.
C.All running jobs fail entirely because the job metadata and task state held in the JobTracker's memory are lost; TaskTrackers reconnect to the new JobTracker as empty nodes.
D.The Secondary JobTracker instantaneously promotes itself, ensuring zero downtime and continuous task execution.
Correct Answer: All running jobs fail entirely because the job metadata and task state held in the JobTracker's memory are lost; TaskTrackers reconnect to the new JobTracker as empty nodes.
Explanation:
In classic MRv1, the JobTracker is a single point of failure (SPOF) without native High Availability. Job states, task progress, and scheduling metadata are kept in the JobTracker's RAM. If it crashes, all running jobs are lost and must be resubmitted. Upon restart, TaskTrackers reconnect but old jobs cannot be resurrected.
Incorrect! Try again.
58In MapReduce, DistributedCache is used to broadcast side data. If an application utilizes DistributedCache.addCacheArchive(), how does the TaskTracker process this payload before task execution?
Hadoop MapReduce paradigm
Hard
A.It queries the NameNode for the archive contents dynamically via RPC calls every time a task requests a file.
B.It un-archives the file automatically on the local disk of the worker node, and provides the path to the task via symlinks in the task's working directory.
C.It loads the archive strictly into the JVM heap space of each Mapper, making it accessible via standard memory references.
D.It copies the archive to the HDFS block pool on the node, strictly enforcing replication logic before task initialization.
Correct Answer: It un-archives the file automatically on the local disk of the worker node, and provides the path to the task via symlinks in the task's working directory.
Explanation:
When an archive (e.g., zip, tar, tgz) is added to the DistributedCache using addCacheArchive (or the newer addArchiveToClassPath), the framework automatically copies the archive to the local disk of the TaskTracker, extracts (un-archives) it, and creates a symlink in the working directory of the task, allowing local file I/O access to the extracted contents.
Incorrect! Try again.
59Regarding data localization, what distinguishes a Rack-local task from a Node-local task in Hadoop MapReduce?
MapReduce Terminology
Hard
A.Node-local tasks execute within the JVM of the JobTracker; Rack-local tasks execute on the remote TaskTracker nodes.
B.Node-local tasks fetch data via HTTP; Rack-local tasks fetch data via RPC over the top-of-rack switch.
C.Node-local tasks process data residing on the same DataNode as the TaskTracker; Rack-local tasks process data residing on a different DataNode but within the same network switch.
D.Node-local tasks are Map tasks; Rack-local tasks are strictly Reduce tasks.
Correct Answer: Node-local tasks process data residing on the same DataNode as the TaskTracker; Rack-local tasks process data residing on a different DataNode but within the same network switch.
Explanation:
Data locality involves placing the computation near the data. A 'Node-local' map task runs on the exact same physical machine that holds the HDFS block, avoiding network transit. A 'Rack-local' task runs on a different machine than the data block, but on the same rack, meaning data only traverses the top-of-rack switch, which is slower than node-local but faster than off-rack.
Incorrect! Try again.
60Hadoop employs an abstraction called SequenceFile for storing binary key-value pairs. Within the architecture, what is the structural advantage of using SequenceFile.CompressionType.BLOCK over RECORD compression?
Hadoop Architecture
Hard
A.BLOCK compression forces the file to align exactly with HDFS block boundaries (e.g., 128 MB), preventing InputSplits from spanning across nodes.
B.BLOCK compression disables sync markers, relying entirely on the NameNode metadata to locate record boundaries.
C.BLOCK compression stores the key uncompressed and the value compressed, allowing for faster key sorting during the shuffle phase.
D.BLOCK compression compresses multiple records together as a single block, achieving much higher compression ratios than compressing individual records, while maintaining splittability.
Correct Answer: BLOCK compression compresses multiple records together as a single block, achieving much higher compression ratios than compressing individual records, while maintaining splittability.
Explanation:
SequenceFile supports NONE, RECORD, and BLOCK compression. RECORD compresses only the values. BLOCK compression aggregates multiple records (both keys and values) into blocks and compresses the block as a whole. Because it applies compression over a larger data payload, it achieves significantly better compression ratios. It also writes sync markers between blocks, ensuring the file remains splittable for MapReduce.