1 $What is Apache Hadoop primarily used for?$

Introduction to Hadoop Easy

A.

Distributed storage and processing of large datasets

B.

Designing web pages

C.

Compiling Java applications

D.

Creating relational databases

2 $Which core component of Hadoop is responsible for storing data?$

HDFS Easy

A.

MapReduce

B.

HDFS

C.

ZooKeeper

D.

YARN

3 $Which core component of Hadoop is responsible for processing data?$

MapReduce Easy

A.

HDFS

B.

Flume

C.

MapReduce

D.

Oozie

4 $What does HDFS stand for?$

HDFS Easy

A.

Hadoop Data File System

B.

Hyper Distributed File System

C.

Highly Distributed File System

D.

Hadoop Distributed File System

5 $Who originally created Hadoop?$

Introduction to Hadoop Easy

A.

Doug Cutting and Mike Cafarella

B.

James Gosling

C.

Linus Torvalds

D.

Bill Gates

6 $In Hadoop 2.x and later, which component is responsible for resource management and job scheduling?$

YARN Easy

A.

MapReduce

B.

YARN

C.

Hive

D.

HDFS

7 $What does YARN stand for?$

YARN Easy

A.

Yielding And Resource Node

B.

Yahoo Application Resource Network

C.

Yet Another Relational Network

D.

Yet Another Resource Negotiator

8 $What is the default block size in Hadoop 2.x and Hadoop 3.x HDFS?$

HDFS Easy

A.

128 MB

B.

256 MB

C.

64 MB

D.

512 MB

9 $Which daemon runs on the master node and manages the file system namespace in HDFS?$

Hadoop Architecture Easy

A.

ResourceManager

B.

DataNode

C.

NodeManager

D.

NameNode

10 $Which daemon runs on worker nodes to store the actual data blocks in HDFS?$

Hadoop Architecture Easy

A.

JobTracker

B.

ResourceManager

C.

NameNode

D.

DataNode

11 $In MapReduce, which function is responsible for aggregating and combining the output of the Map phase?$

MapReduce Easy

A.

Map function

B.

Combine function

C.

Shuffle function

D.

Reduce function

12 $Which of the following is NOT one of the standard 'V's used to characterize Big Data?$

Big Data Basics Easy

A.

Velocity

B.

Variety

C.

Volatility

D.

Volume

13 $Which programming language is Hadoop primarily written in?$

Introduction to Hadoop Easy

A.

C++

B.

Java

C.

Scala

D.

Python

14 $Which Hadoop ecosystem tool provides a SQL-like interface for querying data stored in HDFS?$

Hadoop Ecosystem Easy

A.

Pig

B.

Sqoop

C.

Flume

D.

Hive

15 $Which Hadoop ecosystem tool is designed to transfer bulk data between Hadoop and structured relational databases?$

Hadoop Ecosystem Easy

A.

ZooKeeper

B.

Flume

C.

Oozie

D.

Sqoop

16 $What kind of data does the NameNode store?$

HDFS Easy

A.

Relational database tables

B.

Metadata about files and blocks

C.

Actual file data blocks

D.

MapReduce task outputs

17 $What happens if a DataNode fails in a Hadoop cluster?$

Hadoop Architecture Easy

A.

The NameNode replicates the lost blocks using replicas on other DataNodes

B.

The user must manually restore the data from a backup

C.

The entire cluster crashes

D.

The data is permanently lost

18 $Which component in the MapReduce framework takes the initial input, processes it, and produces intermediate key-value pairs?$

MapReduce Easy

A.

Combiner

B.

Reducer

C.

Partitioner

D.

Mapper

19 $What is the default replication factor in HDFS?$

HDFS Easy

A.

5

B.

2

C.

3

D.

1

20 $Which Hadoop ecosystem component provides a centralized service for maintaining configuration information and naming (distributed synchronization)?$

Hadoop Ecosystem Easy

A.

Ambari

B.

Oozie

C.

Mahout

D.

ZooKeeper

21 $A user needs to store a file in HDFS. If the default block size is and the replication factor is $3$, what is the total storage space consumed in the cluster for this file?$

HDFS Architecture Medium

A.

B.

C.

D.

22 $In a Hadoop cluster, a client wants to read a file from HDFS. Which of the following describes the correct sequence of interactions?$

HDFS Architecture Medium

A.

The client broadcasts a request to all DataNodes to find which ones hold the required blocks.

B.

The client contacts the NameNode to read the actual data blocks directly.

C.

The client contacts the NameNode to get block locations, then reads the data directly from the DataNodes.

D.

The client contacts a DataNode, which retrieves metadata from the NameNode and streams data to the client.

23 $In a MapReduce job designed to count word frequencies, network bandwidth is becoming a bottleneck during the shuffle phase. Which component can be implemented to optimize this by performing local aggregation on the Map node before data is transferred?$

MapReduce Framework Medium

A.

Secondary Mapper

B.

Partitioner

C.

Reducer

D.

Combiner

24 $Which of the following scenarios is LEAST suitable for a Hadoop-based solution?$

Hadoop vs RDBMS Medium

A.

Processing petabytes of historical web server logs.

B.

Managing high-frequency, low-latency transactional updates for an e-commerce checkout system.

C.

Archiving large volumes of sensor data for predictive maintenance.

D.

Performing complex analytical queries on unstructured text data.

25 $In YARN, when a client submits a MapReduce application, which component is primarily responsible for negotiating resources from the ResourceManager and tracking the application's progress?$

YARN Architecture Medium

A.

ApplicationMaster

B.

JobTracker

C.

Container

D.

NodeManager

26 $A company analyzes social media feeds, server logs, relational database tables, and customer service call audio recordings to determine brand sentiment. Which of the '5 Vs' of Big Data is most prominently highlighted in this scenario?$

Big Data Characteristics Medium

A.

Variety

B.

Velocity

C.

Volume

D.

Veracity

27 $What is the primary function of the Secondary NameNode in a Hadoop 2.x cluster?$

HDFS Architecture Medium

A.

It acts as a backup storage location for the actual HDFS data blocks.

B.

It provides automatic failover and takes over immediately if the primary NameNode crashes.

C.

It periodically merges the EditLog with the FsImage to prevent the EditLog from becoming too large.

D.

It manages the DataNodes when the primary NameNode is overloaded with requests.

28 $What dictates the number of Mapper tasks spawned when a MapReduce job is executed on a dataset?$

MapReduce Framework Medium

A.

The number of Input Splits generated from the input files.

B.

The number of blocks configured for the Reducer phase.

C.

The configuration set by the user in mapreduce.job.maps only.

D.

The number of DataNodes in the cluster.

29 $An organization wants to stream high volumes of log data generated by multiple web servers directly into HDFS in near real-time. Which Hadoop ecosystem tool is specifically designed for this task?$

Hadoop Ecosystem Medium

A.

Apache Sqoop

B.

Apache Pig

C.

Apache Flume

D.

Apache Hive

30 $How does HDFS ensure fault tolerance and data reliability in the event of a DataNode hardware failure?$

HDFS Architecture Medium

A.

By writing data directly to the NameNode's local disk as a backup.

B.

By utilizing RAID 5 configurations on every DataNode.

C.

By relying on the Secondary NameNode to recover lost blocks.

D.

By replicating data blocks across multiple independent DataNodes.

31 $In YARN, which component is a per-node agent responsible for monitoring local resource usage (CPU, memory) and reporting it back to the ResourceManager?$

YARN Architecture Medium

A.

TaskTracker

B.

NodeManager

C.

ApplicationMaster

D.

JobTracker

32 $During a MapReduce job execution, exactly when does the 'Shuffle and Sort' phase occur?$

MapReduce Framework Medium

A.

Before the Map phase begins, to prepare data for the Mappers.

B.

Concurrently with the Map phase, reading directly from HDFS blocks.

C.

After the Reduce phase, to sort the final output before writing to HDFS.

D.

After the Map phase finishes and before the Reduce phase begins.

33 $If a Hadoop cluster exhibits 'Rack Awareness', how does the NameNode place the replicas of a block when the replication factor is 3?$

HDFS Architecture Medium

A.

All three replicas are placed on the same rack to maximize read speeds.

B.

One replica is on the local rack, and the other two are placed on a single remote rack.

C.

The placement is completely random across all available racks in the cluster.

D.

Each of the three replicas is placed on a completely different rack.

34 $A data analyst familiar with SQL needs to query massive datasets stored in HDFS but does not know Java or MapReduce. Which Hadoop component is best suited to translate SQL-like queries into MapReduce jobs?$

Hadoop Ecosystem Medium

A.

Apache Oozie

B.

Apache HBase

C.

Apache Hive

D.

Apache Pig

35 $What is the primary function of the Partitioner in a MapReduce job?$

MapReduce Framework Medium

A.

To partition the final output of the Reducer into smaller HDFS files.

B.

To determine which Reducer instance will receive a specific key and its associated values.

C.

To split the input data into manageable blocks for the Mappers.

D.

To combine intermediate keys locally on the Mapper to save network bandwidth.

36 $How does the NameNode detect that a DataNode has failed or is unreachable?$

HDFS Architecture Medium

A.

The ResourceManager alerts the NameNode of a Container failure.

B.

The NameNode actively pings every DataNode every 3 seconds.

C.

The DataNode sends an error alert to the NameNode right before failing.

D.

The DataNode stops sending periodic Heartbeat signals to the NameNode.

37 $An enterprise wants to perform bulk data transfers between their legacy Oracle Database (an RDBMS) and Hadoop HDFS. Which tool is specifically designed for this structured data transfer?$

Hadoop Ecosystem Medium

A.

Apache Kafka

B.

Apache Sqoop

C.

Apache Flume

D.

Apache Zookeeper

38 $Data collected from IoT sensors occasionally contains null values, missing timestamps, and noisy signals due to hardware glitches. Managing this issue primarily addresses which of the '5 Vs' of Big Data?$

Big Data Characteristics Medium

A.

Volume

B.

Velocity

C.

Veracity

D.

Value

39 $When a client writes data to HDFS, how is the replication of blocks handled across the DataNodes?$

HDFS Architecture Medium

A.

The client writes to the first DataNode, which pipelines the data to the second, which pipelines it to the third.

B.

The client writes to all three DataNodes simultaneously in parallel.

C.

The client writes to the first DataNode, and a background MapReduce job replicates the data later.

D.

The NameNode receives the data from the client and broadcasts it to the DataNodes.

40 $In the MapReduce framework, what format does the Mapper output before it is passed to the framework for shuffling?$

MapReduce Framework Medium

A.

Intermediate key-value pairs.

B.

Raw text strings representing lines of data.

C.

Final aggregated key-value pairs.

D.

Serialized Java objects representing entire database rows.

41 $During an HDFS write operation with a replication factor of 3, a client is writing a block but the second DataNode in the pipeline suddenly crashes. What is the immediate, automatic sequence of events that the HDFS client and NameNode perform to handle this failure?$

HDFS Read/Write Mechanisms Hard

A.

The client abandons the write, deletes the partial block on all nodes, requests a completely new block allocation from the NameNode, and restarts the write.

B.

The NameNode pauses the client, spawns a new DataNode to replace the failed one in the pipeline, synchronizes the data, and resumes the write.

C.

The pipeline is closed, the good DataNodes are synchronized to a new generation stamp, the failed node is removed from the pipeline, and writing resumes to the remaining two nodes.

D.

The client buffers the data locally until the NameNode verifies the DataNode is dead via missed heartbeats, then routes the buffered data directly to the third DataNode.

42 $A MapReduce job processes financial transactions and uses Speculative Execution to mitigate straggler nodes. The Map tasks interact with an external REST API to update an external database during processing. Which of the following is the most significant risk in this architecture?$

MapReduce Execution Framework Hard

A.

The external REST API may become a bottleneck, causing the NameNode to dynamically kill all speculative tasks.

B.

Because Map tasks are not idempotent, speculative execution will lead to duplicate API calls and corrupted external state.

C.

Speculative execution only applies to Reduce tasks, so the Map tasks will not be duplicated.

D.

Speculative tasks do not share the same Distributed Cache, leading to inconsistent API endpoints.

43 $A file is 135 MB in size. It is uploaded to an HDFS cluster configured with a 128 MB logical block size and a replication factor of 3. Assuming standard HDFS behavior without Erasure Coding, what is the actual physical disk space consumed across the DataNodes?$

HDFS Block Allocation Hard

A.

768 MB

B.

384 MB

C.

405 MB

D.

256 MB

44 $In a High Availability (HA) Hadoop cluster utilizing a Quorum Journal Manager (QJM), a 'split-brain' scenario must be prevented. If the Active NameNode enters a garbage collection pause and the Standby NameNode successfully transitions to Active, what fencing mechanism ensures the original Active NameNode does not corrupt the filesystem state when it resumes?$

NameNode High Availability Hard

A.

The Zookeeper Failover Controller (ZKFC) sends a SIGKILL to all DataNodes holding blocks belonging to the old NameNode.

B.

The DataNodes will immediately format their block pools upon receiving heartbeats from two Active NameNodes.

C.

The original NameNode validates its state with the Secondary NameNode before committing any EditLogs.

D.

The JournalNodes will reject writes from the original NameNode because the new Active NameNode has incremented the epoch number.

45 $In a YARN cluster, a long-running ApplicationMaster (AM) unexpectedly crashes due to an OutOfMemoryError. The application has already completed 80% of its tasks. What is YARN's default recovery behavior for this specific application?$

YARN Architecture and Resource Management Hard

A.

The NodeManager running the AM promotes one of the active container processes to act as the new AM, maintaining uninterrupted task execution.

B.

The application is immediately marked as FAILED in the ResourceManager, and the user must manually resubmit the job with higher memory limits.

C.

The ResourceManager kills all running containers, restarts the AM, and the entire job must run from the beginning.

D.

The ResourceManager instantiates a new AM. Depending on the application framework's implementation (like MapReduce), the new AM can recover the state of already completed tasks and only re-run the pending tasks.

46 $A MapReduce job processes a large text file in HDFS. A logical record spans the boundary between HDFS Block A and HDFS Block B. How does the standard TextInputFormat handle the mapper assigned to Block A?$

MapReduce Execution Framework Hard

A.

The mapper for Block A reads past its own block boundary into Block B to complete the record, while the mapper for Block B skips the first partial record in its block.

B.

The mapper for Block A skips the partial record at the end of its block, leaving the mapper for Block B to fetch the beginning of the record via a remote read.

C.

The mapper for Block A processes up to the boundary, and the mapper for Block B resumes from the exact byte offset, requiring complex cross-node state management.

D.

The RecordReader throws an exception, as HDFS requires all records to be perfectly aligned within the block boundaries during file ingestion.

47 $During a MapReduce job, a custom partitioner is implemented to route keys to 10 reducers. Due to data skew, Reducer 0 receives 95% of the data, while Reducers 1-9 receive the remaining 5%. Which phase of the MapReduce pipeline will be most significantly bottlenecked, and why?$

MapReduce Execution Framework Hard

A.

The Output phase, because the OutputFormat enforces balanced file sizes across all part-r-0000X files.

B.

The Partitioning phase, because the partitioner must recalculate hashes dynamically to redistribute the load.

C.

The Shuffle and Sort phase, because Reducer 0 must pull and merge a massive amount of data over the network, leading to potential OOM errors and disk I/O bottlenecks.

D.

The Map phase, because mappers must wait for Reducer 0 to acknowledge receipt of the data before they can process new splits.

48 $A developer writes a Combiner for a MapReduce job calculating the mathematical average (mean) of a dataset. The Combiner uses the exact same logic as the Reducer: sum(values) / count(values) . Why is this implementation fundamentally flawed?$

MapReduce Execution Framework Hard

A.

The Reducer expects a raw list of strings, but the Combiner outputs serialized floating-point numbers.

B.

The mathematical mean is not an associative and commutative operation, so applying it partially in the Combiner will yield mathematically incorrect final results.

C.

Combiners are only executed if data is spilled to disk; therefore, the average will be miscalculated in memory.

D.

A Combiner cannot output the same key-value types as the Mapper.

49 $The Secondary NameNode in Hadoop 2.x is often misunderstood. Which of the following accurately describes its memory requirements and primary architectural function?$

HDFS Architecture and Fault Tolerance Hard

A.

It requires twice the memory of the primary NameNode because it simultaneously holds both the old FsImage and the newly merged FsImage in RAM.

B.

It requires very little memory because it only streams the EditLog directly to the Standby NameNode for High Availability failover.

C.

It requires the same amount of memory as the primary NameNode because it must load the FsImage into RAM to merge it with the EditLog, preventing the primary NameNode's EditLog from growing indefinitely.

D.

It acts as a caching layer for DataNode block reports, requiring memory proportional to the cluster's data velocity rather than its metadata size.

50 $Hadoop's default rack awareness policy determines replica placement to maximize data availability and cluster throughput. For a block with a replication factor of 3, how does HDFS place the replicas?$

HDFS Fault Tolerance Hard

A.

Replica 1 on the local node, Replica 2 on a node in a different rack, Replica 3 on a node in a third distinct rack.

B.

Replica 1 on the local node, Replica 2 on a node in a different rack, Replica 3 on a different node in that same different rack.

C.

All three replicas are placed on different nodes within the same rack to maximize write pipeline speed.

D.

Replica 1 on the local node, Replica 2 on a different node in the same rack, Replica 3 on a node in a different rack.

51 $A Hadoop cluster uses Quorum Journal Manager (QJM) for NameNode High Availability. If the design requirement is to tolerate up to JournalNode failures, what is the minimum number of JournalNodes () required in the cluster, and what is the mathematical formula governing this?$

NameNode High Availability Hard

A.

, because QJM uses a simple majority voting system.

B.

, to account for potential split-brain scenarios and Byzantine failures.

C.

, because only one active node needs to access the journal at a time.

D.

, because the Active NameNode must write to a strict majority of nodes to successfully commit an edit.

52 $During the Shuffle and Sort phase of a MapReduce job, a mapper outputs data to a circular memory buffer (default 100MB). What happens when the buffer reaches its threshold (default 80%)?$

MapReduce Execution Framework Hard

A.

A background thread begins to spill the contents to disk, partitioning and sorting the data, while the mapper continues writing to the remaining 20% of the buffer.

B.

The data is immediately flushed to HDFS to ensure fault tolerance before the reducer reads it.

C.

The memory buffer expands dynamically by requesting more heap space from the JVM to prevent costly disk I/O.

D.

The mapper pauses execution until the reducer pulls the 80MB of data over the network.

53 $HDFS Short-Circuit Local Reads are enabled to improve performance for applications like HBase. How does this mechanism bypass standard DataNode data transfer?$

HDFS Read/Write Mechanisms Hard

A.

The DataNode copies the block into a shared YARN memory container that the client can access without disk I/O.

B.

The client connects via RPC to the NameNode, which streams the block directly to the client's memory.

C.

The client intercepts the DataNode's heartbeat and hijacks the TCP payload containing the requested block.

D.

The DataNode passes a UNIX domain socket file descriptor directly to the client, allowing the client to read the local file system bypassing the DataNode's JVM.

54 $In a multitenant YARN cluster using the Fair Scheduler, Queue A is heavily backlogged and Queue B is empty. A new application is submitted to Queue B but all cluster resources are currently occupied by Queue A. How does YARN guarantee Queue B gets its fair share?$

YARN Architecture and Resource Management Hard

A.

YARN queues the Queue B application until Queue A naturally completes its current containers.

B.

The ResourceManager instructs the ApplicationMaster of Queue A to gracefully shrink its heap size to accommodate Queue B.

C.

The Fair Scheduler triggers an HDFS rebalance to free up local disk space, allowing Queue B containers to spawn.

D.

The Fair Scheduler preempts resources by identifying containers in Queue A, sending them a warning, and forcefully killing them if they do not terminate within a timeout.

55 $The 'Small Files Problem' in HDFS severely degrades cluster performance. If a cluster stores 10 million 1KB files instead of a single 10GB file, what is the exact architectural bottleneck that occurs?$

HDFS Architecture and Fault Tolerance Hard

A.

The network fabric becomes saturated because small files bypass Rack Awareness policies.

B.

The NameNode's JVM Heap is exhausted because every file, block, and directory occupies roughly 150 bytes of RAM, regardless of the file's physical size.

C.

MapReduce cannot process small files because InputSplits require files to be exactly the size of an HDFS block.

D.

DataNodes become overwhelmed by the sheer number of TCP socket connections required to heartbeat the blocks.

56 $A developer needs to implement Secondary Sorting in MapReduce to sort values associated with a key before they arrive at the Reducer. Which combination of custom components is strictly required to implement this pattern?$

MapReduce Execution Framework Hard

A.

A Custom RecordReader and a HashMap inside the Mapper's setup() method.

B.

Custom WritableComparator for grouping, Custom WritableComparator for sorting, Custom Partitioner, and a Composite Key.

C.

DistributedCache, SequenceFileOutputFormat, and a Custom Partitioner.

D.

Custom Combiner, Custom InputFormat, and an Identity Reducer.

57 $A DataNode discovers that a block on its local disk has a checksum mismatch due to silent data corruption. How and when is this corruption addressed by HDFS?$

HDFS Architecture and Fault Tolerance Hard

A.

The client reading the block detects the error, patches it dynamically, and overwrites the corrupt block directly on the DataNode.

B.

The DataNode fixes the block locally using parity bits stored in the filesystem journal.

C.

The NameNode detects the corruption during the Secondary NameNode checkpoint process and halts cluster writes until the administrator manually runs fsck .

D.

The DataNode informs the NameNode during its next block report; the NameNode marks the block as corrupt and schedules a replication from a healthy replica to another DataNode.

58 $An application uses the Hadoop Distributed Cache to distribute a 500MB lookup table. By default, how does YARN manage the lifecycle of this localized file on a NodeManager?$

MapReduce Execution Framework Hard

A.

It splits the 500MB file into 128MB blocks and assigns a dedicated Mapper to serve as a distributed lookup service.

B.

It copies the file to the NodeManager's local disk, makes it accessible via symlink to the container's working directory, and deletes it once all containers for that job on the node finish.

C.

It injects the file into the HDFS block pool of the node, bypassing local OS caching.

D.

It permanently pins the file into the NodeManager's RAM, requiring a cluster restart to clear.

59 $In a YARN cluster, a NodeManager has 32GB of physical RAM and yarn.nodemanager.vmem-pmem-ratio is set to 2.1. A container is allocated 4GB of memory. What happens if the container's processes allocate 5GB of physical memory and 9GB of virtual memory?$

YARN Architecture and Resource Management Hard

A.

The container is allowed to run because 9GB is less than the virtual memory limit (+ tolerance).

B.

The ResourceManager instructs the ApplicationMaster to negotiate an additional 1GB of physical memory.

C.

The container begins to swap heavily to local disk, causing a task timeout.

D.

The NodeManager kills the container because the 5GB physical memory usage exceeds the 4GB allocated limit.

60 $Hadoop 3 introduced Erasure Coding (EC) to reduce storage overhead compared to traditional 3x replication. Using an EC policy of RS-6-3 (Reed-Solomon 6 data blocks, 3 parity blocks), what is the storage overhead percentage, and what is the fault tolerance?$

HDFS Block Allocation Hard

A.

Overhead is 50%; it can tolerate the loss of up to 3 DataNodes.

B.

Overhead is 30%; it can tolerate the loss of up to 2 DataNodes.

C.

Overhead is 150%; it can tolerate the loss of up to 3 DataNodes.

D.

Overhead is 200%; it can tolerate the loss of up to 6 DataNodes.

Unit 1 - Practice Quiz