Unit 6 - Practice Quiz

CSE211 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What type of processor is Nvidia most famous for developing?

Nvidia Case Study Easy
A. Central Processing Units (CPUs)
B. Field-Programmable Gate Arrays (FPGAs)
C. Graphics Processing Units (GPUs)
D. Digital Signal Processors (DSPs)

2 What is the name of Nvidia's proprietary parallel computing platform and programming model that allows software to use its GPUs for general purpose processing?

Nvidia Case Study Easy
A. CUDA
B. Vulkan
C. DirectX
D. OpenCL

3 What is the primary purpose of a supercomputer?

Introduction to Supercomputer Easy
A. Everyday personal computing and web browsing
B. Performing highly intensive computational tasks like climate modeling and scientific simulations
C. Running office applications like word processors
D. Serving web pages for a small business

4 Which unit is commonly used to measure the performance of supercomputers?

Introduction to Supercomputer Easy
A. FLOPS (Floating-Point Operations Per Second)
B. Revolutions Per Minute (RPM)
C. Gigahertz (GHz)
D. Megabytes per second (MB/s)

5 What is the basic unit of information in a quantum computer?

Introduction to Qubits and Quantum Computing Easy
A. Qubit
B. Bit
C. Register
D. Byte

6 A classical bit can be in a state of 0 or 1. Due to the principle of superposition, a qubit can be in a state of:

Introduction to Qubits and Quantum Computing Easy
A. 0, 1, or a combination of both simultaneously
B. Only 1
C. Only 0
D. Only a value between 0 and 1

7 What is the trend of placing multiple processing cores on a single chip called?

Latest Technology and Trends in Computer Architecture Easy
A. Multi-core architecture
B. Hyper-threading
C. Virtualization
D. Single-core processing

8 Which term describes the logical implementation of an Instruction Set Architecture (ISA)?

Microarchitecture Easy
A. Software
B. Firmware
C. Compiler
D. Microarchitecture

9 Which company is well-known for its Snapdragon series of processors, primarily used in Android smartphones?

Latest Processor for Smartphone or Tablet and Desktop Easy
A. Qualcomm
B. AMD
C. Nvidia
D. Intel

10 What is a key characteristic of a System on a Chip (SoC)?

Next Generation Processors Architecture Easy
A. It is exclusively used for supercomputers.
B. It is always the largest chip in a computer.
C. It only contains the CPU.
D. It integrates multiple components like CPU, GPU, and memory controllers onto a single chip.

11 Which microarchitectural technique breaks down an instruction's execution into smaller, sequential steps that can be overlapped to increase throughput?

Microarchitecture Easy
A. Interrupt handling
B. Caching
C. Branch prediction
D. Pipelining

12 The trend of designing processors optimized for specific tasks, such as AI acceleration, is known as:

Latest Technology and Trends in Computer Architecture Easy
A. General-Purpose Computing
B. Centralized Computing
C. Legacy Architecture
D. Domain-Specific Architecture (DSA)

13 Apple's M-series chips (e.g., M1, M2) used in their recent Mac computers and iPads are based on which instruction set architecture?

Latest Processor for Smartphone or Tablet and Desktop Easy
A. MIPS
B. SPARC
C. x86-64
D. ARM

14 Supercomputers achieve their immense speed by having thousands of processors that work on different parts of a problem at the same time. This is an example of:

Introduction to Supercomputer Easy
A. Serial processing
B. Single-threaded performance
C. Parallel processing
D. Standalone computing

15 What is the quantum mechanical phenomenon where two or more qubits become linked and share the same fate, even when separated by large distances?

Introduction to Qubits and Quantum Computing Easy
A. Decoherence
B. Superposition
C. Entanglement
D. Tunneling

16 Besides gaming, Nvidia GPUs are now widely used for what other high-demand application due to their parallel processing power?

Nvidia Case Study Easy
A. Running operating systems
B. Basic word processing
C. Web browsing
D. Artificial Intelligence (AI) and Deep Learning

17 What is the term for a design approach where a large processor is built by combining smaller, specialized integrated circuits called 'chiplets'?

Next Generation Processors Architecture Easy
A. Transistor design
B. Chiplet-based design
C. Integrated circuit
D. Monolithic design

18 What is the primary function of a CPU cache?

Microarchitecture Easy
A. To cool down the processor.
B. To provide long-term storage for files.
C. To store frequently accessed data and instructions close to the CPU for faster access.
D. To connect the processor to the internet.

19 What does RISC stand for in the context of processor architecture?

Latest Technology and Trends in Computer Architecture Easy
A. Really Integrated Silicon Chip
B. Re-engineered Integrated System Chip
C. Reduced Instruction Set Computer
D. Random Instruction Set Computer

20 A major architectural difference between processors in high-end desktops and most smartphones is that desktop CPUs often use the ____ architecture, while smartphone CPUs typically use the ____ architecture.

Latest Processor for Smartphone or Tablet and Desktop Easy
A. x86, ARM
B. ARM, x86
C. SPARC, MIPS
D. MIPS, SPARC

21 In the context of Nvidia's GPU architecture, what is the primary role of a Tensor Core, and how does it differ from a standard CUDA core?

Nvidia Case Study Medium
A. Tensor Cores manage the GPU's memory hierarchy and L2 cache, while CUDA cores are responsible for executing shader programs.
B. Tensor Cores are specialized for high-precision 64-bit floating-point arithmetic, while CUDA cores handle graphics texturing.
C. Tensor Cores execute matrix-multiply-accumulate (MMA) operations on small matrices (e.g., 4x4) at high speed, primarily for AI/ML workloads, whereas CUDA cores are general-purpose integer/FP32 processors.
D. Tensor Cores are responsible for ray tracing acceleration (RT Cores do this), while CUDA cores handle the overall thread scheduling for the entire Streaming Multiprocessor (SM).

22 A modern supercomputer is designed for a large-scale climate simulation that requires frequent, small data exchanges between thousands of nodes. Which interconnect technology characteristic would be most critical for this application's performance?

Introduction to Supercomputer Medium
A. High Power Efficiency
B. Low Latency
C. Support for TCP/IP Offloading
D. High Bandwidth

23 A quantum system is composed of 4 entangled qubits. If one qubit is measured and its state collapses to , what is the immediate effect on the other three qubits?

Introduction to Qubits and Quantum Computing Medium
A. The other three qubits form a new entangled system independent of the measured qubit.
B. The state of the other three qubits is instantly constrained or determined, collapsing from their superposition into a specific state correlated with the measured qubit.
C. The other three qubits are unaffected and remain in a superposition of all possible states.
D. The other three qubits also collapse to .

24 Why is the chiplet-based design approach becoming a major trend in high-performance processor manufacturing?

Latest Technology and Trends in Computer Architecture Medium
A. It improves manufacturing yield for large, complex processors and allows for mixing and matching different process technologies for different functions (e.g., I/O vs. CPU cores).
B. It allows for a single monolithic die to be clocked at a much higher frequency.
C. It simplifies the microarchitecture by removing the need for an on-chip memory controller.
D. It exclusively uses a new, more efficient silicon substrate material that cannot be used for monolithic designs.

25 What is the primary microarchitectural trade-off when implementing a deep, multi-stage pipeline in a processor?

Microarchitecture Medium
A. Simplified branch prediction logic in exchange for a more complex instruction decode stage.
B. Lower single-thread performance in exchange for better multi-threaded performance.
C. Increased clock frequency and instruction throughput, but at the cost of a higher branch misprediction penalty and increased instruction latency.
D. Decreased instruction throughput in exchange for lower power consumption.

26 Which characteristic of the RISC-V instruction set architecture (ISA) is a key differentiator from proprietary ISAs like x86 and ARM, making it attractive for custom silicon development?

Next Generation Processors Architecture Medium
A. Its open-standard, royalty-free nature and modular design, which allows for custom extensions to the base ISA.
B. Its inherent superiority in floating-point performance due to a unique vector processing design.
C. Its backward compatibility with all legacy x86-64 software without emulation.
D. Its strict requirement for a 10-stage pipeline, which guarantees high clock speeds.

27 Modern high-end mobile SoCs (System-on-Chip) like the Apple M-series or Qualcomm Snapdragon often outperform some desktop CPUs in specific tasks despite having a lower power budget. What architectural feature is a primary contributor to this efficiency?

Latest Processor for Smartphone or Tablet and Desktop Medium
A. A much larger L3 cache compared to desktop processors, which eliminates memory bottlenecks.
B. Exclusive use of VLIW (Very Long Instruction Word) architecture for all processing units.
C. Higher clock speeds achieved through exotic cooling solutions within the mobile device.
D. Tight integration of specialized hardware accelerators (e.g., Neural Processing Unit, Image Signal Processor, GPU) on the same die, which offload tasks from the general-purpose CPU cores.

28 In Nvidia's CUDA programming model, a 'warp' is a fundamental unit of execution. How does the microarchitecture of a Streaming Multiprocessor (SM) handle instruction dispatch for threads within a warp?

Nvidia Case Study Medium
A. Threads within a warp are dynamically grouped to execute different instructions based on data availability.
B. Each thread in a warp executes an independent instruction stream, similar to MIMD.
C. All 32 threads in a warp execute the same instruction at the same time on different data, following a SIMT (Single Instruction, Multiple Thread) model.
D. The SM selects one thread from the warp at random to execute per clock cycle.

29 When analyzing the scalability of a parallel program on a supercomputer with a very large number of processors, Gustafson's Law often provides a more optimistic prediction than Amdahl's Law. What is the fundamental assumption behind Gustafson's Law that accounts for this difference?

Introduction to Supercomputer Medium
A. The serial portion of the program diminishes as more processors are added.
B. The clock speed of each processor increases proportionally to the number of nodes.
C. The total problem size scales up with the number of processors, keeping the parallel execution time constant.
D. The communication latency between nodes becomes zero with enough processors.

30 Quantum decoherence is a major obstacle in building functional quantum computers. From an architectural standpoint, what is the primary cause of decoherence?

Introduction to Qubits and Quantum Computing Medium
A. The inability of current technology to create a perfect superposition of and .
B. Unwanted interaction between the quantum system (qubits) and its surrounding environment, which causes the loss of quantum properties like superposition and entanglement.
C. The inherent randomness of quantum measurement, which makes algorithm results unreliable.
D. Errors in the quantum logic gates that cause qubits to flip their states incorrectly.

31 An out-of-order execution engine in a modern CPU uses a Reorder Buffer (ROB) and a Reservation Station. What is the specific role of the Reservation Station in this microarchitecture?

Microarchitecture Medium
A. To act as the primary L1 instruction cache for fetching upcoming instructions.
B. To predict the outcome of branch instructions before they are executed.
C. To store the original program order of instructions to ensure correct program retirement.
D. To hold instructions that have been decoded but are waiting for their operands to become available or for an execution unit to be free.

32 What is the primary motivation behind the trend of developing Domain-Specific Architectures (DSAs), such as Google's TPU for machine learning?

Latest Technology and Trends in Computer Architecture Medium
A. To simplify software development by providing a single instruction set for all computing tasks.
B. To reduce the physical size of processors by removing unnecessary components like the memory controller.
C. To create a new general-purpose processor that can replace all existing CPUs and GPUs.
D. To achieve orders-of-magnitude improvements in performance and power efficiency for a specific target workload by tailoring the hardware to that workload's needs.

33 ARM's big.LITTLE technology is a form of heterogeneous multi-core processing. What is the primary architectural goal of pairing high-performance 'big' cores with high-efficiency 'LITTLE' cores?

Next Generation Processors Architecture Medium
A. To double the number of threads that can be run by having two distinct ISAs on the same chip.
B. To dynamically balance performance and power consumption by migrating tasks between core types based on workload intensity.
C. To execute both 32-bit and 64-bit instructions simultaneously on different core types.
D. To provide hardware-level redundancy in case one set of cores fails.

34 When comparing the architecture of a high-end desktop processor (e.g., Intel Core i9) to a flagship mobile SoC (e.g., Qualcomm Snapdragon 8 Gen-series), a key difference lies in their approach to component integration. Which statement best describes this difference?

Latest Processor for Smartphone or Tablet and Desktop Medium
A. Mobile SoCs are typically monolithic, integrating CPU, GPU, memory controller, NPU, ISP, and modem on a single die, whereas desktop CPUs are often chiplet-based and focus primarily on CPU cores and cache.
B. Desktop CPUs prioritize low-power efficiency cores, while mobile SoCs prioritize a large number of high-performance cores.
C. Mobile SoCs use a system-level cache that is an order of magnitude larger than the L3 cache found in desktop CPUs.
D. Desktop processors integrate more diverse components like NPUs and ISPs directly onto the CPU die, while mobile SoCs keep them as separate chips.

35 A quantum algorithm requires the creation of a uniform superposition of all possible states for a 3-qubit register, initially in the state . Which quantum gate should be applied to each qubit to achieve this?

Introduction to Qubits and Quantum Computing Medium
A. A Hadamard gate
B. An X-gate (NOT gate)
C. A CNOT-gate
D. A Toffoli gate

36 In a processor that supports Simultaneous Multithreading (SMT), how does the microarchitecture enable multiple threads to be active on a single physical core?

Microarchitecture Medium
A. By duplicating all execution units, effectively creating two cores in the space of one.
B. By running one thread on the integer units and another thread exclusively on the floating-point units.
C. By rapidly context-switching between threads on every clock cycle, flushing the pipeline each time.
D. By duplicating the architectural state (e.g., program counter, register file) for each thread and allowing instructions from different threads to share the execution units in the same pipeline.

37 The architectural shift from Nvidia's Ampere to Hopper generation introduced the Transformer Engine. What specific computational challenge in large AI models does this feature address?

Nvidia Case Study Medium
A. It dynamically selects the optimal numerical precision (e.g., FP8, FP16) for different layers of a Transformer model to boost performance and save memory without significant accuracy loss.
B. It accelerates the rendering of 3D graphics by optimizing triangle rasterization.
C. It provides a dedicated hardware block for data compression to reduce the GPU's memory bandwidth requirements.
D. It enables direct peer-to-peer communication between GPUs without involving the CPU, using a new version of NVLink.

38 What is the primary architectural purpose of using a parallel file system (e.g., Lustre, GPFS) in a large-scale supercomputer environment?

Introduction to Supercomputer Medium
A. To reduce the power consumption of the storage system by spinning down idle disks.
B. To enforce strict security policies by isolating each user's data on a separate physical disk.
C. To allow thousands of compute nodes to access and write to a shared storage pool simultaneously at very high aggregate bandwidth, avoiding I/O bottlenecks.
D. To provide data redundancy and automatic backups for all user data.

39 Dataflow architectures represent a fundamental departure from the traditional von Neumann architecture. What is the core principle of a dataflow machine's execution model?

Next Generation Processors Architecture Medium
A. An instruction is ready to execute as soon as its required input data (operands) are available.
B. All operations are performed directly on data held in a large, unified register file, bypassing memory.
C. Instructions are executed sequentially as determined by a program counter.
D. The processor fetches large blocks of data and instructions together from memory to reduce latency.

40 Processing-in-Memory (PIM) or Compute-in-Memory (CIM) is an emerging trend to overcome a major performance bottleneck. What specific bottleneck does this technology aim to mitigate?

Latest Technology and Trends in Computer Architecture Medium
A. The high cost of manufacturing large on-chip caches (L3 cache).
B. The 'Memory Wall' or von Neumann bottleneck, which is the separation of processing and data storage that leads to high latency and energy consumption from data movement.
C. The difficulty of writing correct parallel software for multi-core processors.
D. The performance gap between integer and floating-point execution units.

41 The NVIDIA Hopper architecture's Tensor Cores introduced the Transformer Engine. How does this engine fundamentally improve performance for models like GPT-3 compared to the A100 (Ampere) Tensor Cores, beyond simply offering higher raw FLOPS?

Nvidia Case Study Hard
A. It introduces a hardware-based systolic array scheduler that completely removes the need for CUDA warp-level scheduling for matrix operations.
B. It exclusively uses a novel 4-bit floating point format (FP4) for all matrix multiplications, quadrupling the throughput compared to Ampere's TF32.
C. It integrates the functionality of the NVLink switch directly into the Tensor Core, allowing direct data exchange between Tensor Cores of different GPUs without traversing the SM's memory hierarchy.
D. It dynamically selects between FP8 and FP16 precision for different layers of a transformer model on a per-op basis to maximize throughput while maintaining accuracy.

42 A CUDA kernel is designed to perform a large-scale stencil computation on a 2D grid. The kernel exhibits poor performance, and profiling reveals high global memory latency. The stencil requires each thread to access its own element and the 8 neighboring elements. The grid is too large to fit entirely in shared memory. Which optimization strategy would most effectively mitigate the global memory latency bottleneck in this specific scenario?

Nvidia Case Study Hard
A. Increase the CUDA grid size and decrease the block size to create more warps, hoping to hide latency through increased thread-level parallelism (TLP).
B. Implement tiling by loading a 2D tile of the grid from global memory into shared memory, including a 'halo' or 'ghost cell' region for neighbors, process the tile, and write results back.
C. Replace all global memory accesses with __ldg() intrinsic functions to cache the data in the L1/texture cache, assuming read-only access patterns.
D. Use pinned host memory (page-locked memory) for the grid data and stream the computation to overlap data transfers with kernel execution.

43 Consider a supercomputer using a Dragonfly interconnect topology. A large-scale simulation requires frequent all-to-all communication patterns (e.g., an MPI_Alltoall operation). Which characteristic of the Dragonfly topology presents the most significant performance challenge for this specific communication pattern compared to a less scalable but more direct topology like a full crossbar?

Introduction to Supercomputer Hard
A. The potential for network contention on the high-radix global links connecting different groups, requiring adaptive routing to mitigate hotspots.
B. The static routing algorithm mandated by the topology, which cannot adapt to network load.
C. The reliance on optical cables for all links, which have higher latency than electrical links for intra-group communication.
D. The high diameter of the network, leading to excessive hop counts and latency for any communication pattern.

44 A 2-qubit system is in the state . Which of the following statements accurately describes this quantum state?

Introduction to Qubits and Quantum Computing Hard
A. If a Hadamard gate is applied to both qubits, the resulting state is |11⟩.
B. If the first qubit is measured in the computational basis and the result is |1⟩, the second qubit collapses to the state |0⟩ - |11⟩, which is not a valid quantum state.
C. The state is an entangled Bell state.
D. The state is a product state, meaning the qubits are not entangled.

45 Two processors, P1 and P2, share a memory location X managed by a MESI cache coherence protocol. Initially, X is not in either cache. Consider the following sequence of operations:
1. P1 reads X.
2. P2 writes to X.
3. P1 reads X.
How many bus transactions of the type BusRd (Bus Read) and BusRdX (Bus Read Exclusive) are generated on the shared bus?

Microarchitecture Hard
A. Three BusRd transactions and zero BusRdX transactions.
B. One BusRd transaction and one BusRdX transaction.
C. Two BusRd transactions and one BusRdX transaction.
D. One BusRd transaction and two BusRdX transactions.

46 The Apple M-series SoCs utilize a Unified Memory Architecture (UMA) where the CPU and GPU share the same physical memory pool. While this reduces data copying and latency, what is a significant microarchitectural challenge or trade-off this design imposes compared to a traditional discrete GPU architecture with its own VRAM?

Latest Processor for Smartphone or Tablet and Desktop Hard
A. The inability to use specialized high-bandwidth memory like GDDR6, forcing the entire system to rely on lower-bandwidth LPDDR5, thus capping peak theoretical memory bandwidth.
B. Increased memory contention and sophisticated quality-of-service (QoS) requirements for the memory controller to arbitrate between CPU's latency-sensitive requests and GPU's bandwidth-hungry requests.
C. Increased power consumption due to the constant need for the CPU to perform cache coherence snooping on memory accesses initiated by the GPU.
D. A fundamental limitation on the maximum amount of addressable memory, as both CPU and GPU must share a single memory address space managed by the CPU's MMU.

47 Compute Express Link (CXL) 2.0 introduces memory pooling. How does the CXL.mem protocol ensure cache coherence for this pooled memory between the host CPU's caches and the CXL device's memory, without requiring the CXL device to be a fully coherent snooping agent?

Next Generation Processors Architecture Hard
A. It uses a directory-based coherence protocol where the CXL device's memory controller acts as the home node and directory, tracking sharers and handling invalidations requested by the host.
B. It enforces a write-through, no-allocate policy for all host accesses to the CXL memory pool, ensuring memory is always up-to-date and bypassing host caches entirely.
C. It requires explicit software-managed cache flushes from the host CPU before the CXL device can access the memory, making coherence a software responsibility.
D. It relies on the host CPU's existing snoopy coherence protocol, treating the CXL link as just another bus participant that must snoop all traffic from all cores.

48 In a chiplet-based processor design, such as AMD's Zen architecture, what is the most significant microarchitectural trade-off when determining the latency and bandwidth of the die-to-die interconnect fabric (e.g., Infinity Fabric)?

Latest Technology and Trends in Computer Architecture Hard
A. The complexity of the routing algorithm within the fabric versus the manufacturing cost associated with using advanced packaging technologies like 2.5D interposers.
B. Balancing the physical distance and signaling power against the NUMA (Non-Uniform Memory Access) factor introduced, which can cause performance variability for threads accessing remote L3 caches or memory controllers.
C. Ensuring the die-to-die clock synchronization is perfectly aligned, which often requires a dedicated global clock chiplet, increasing the bill of materials.
D. Minimizing the silicon area of the interconnect PHYs on each chiplet against the need to support legacy bus protocols like PCIe for backward compatibility.

49 Modern Exascale supercomputers like Frontier are built on heterogeneous architectures. From a system architecture perspective, what is the primary reason this heterogeneity is crucial for approaching the 20 MW power barrier for an ExaFLOP/s system?

Introduction to Supercomputer Hard
A. The use of multiple smaller GPU nodes reduces the total static power leakage compared to a system with a similar number of massive, monolithic CPU cores.
B. GPUs achieve a significantly higher FLOPS/watt ratio for highly parallel computations, allowing the bulk of the floating-point work to be done with greater energy efficiency than on CPUs alone.
C. CPUs in the system can be put into a deep sleep state while the GPUs perform all computations, effectively eliminating the CPU power draw for long periods.
D. The interconnects designed for GPU-centric systems (like NVLink) are an order of magnitude more power-efficient per bit transferred than traditional CPU interconnects.

50 The No-Cloning Theorem is a fundamental principle in quantum mechanics. How does this theorem necessitate a fundamentally different approach to error correction in quantum computers compared to classical error correction techniques like Triple Modular Redundancy (TMR)?

Introduction to Qubits and Quantum Computing Hard
A. It means that a corrupted qubit's state cannot be 're-written' or 'corrected,' forcing quantum algorithms to be redesigned to be inherently fault-tolerant.
B. It restricts error detection to only measuring the parity of qubits, as any other measurement would collapse the quantum state, making correction impossible.
C. It implies that quantum errors are always continuous (e.g., small phase rotations), whereas classical errors are discrete bit-flips, requiring analog correction methods.
D. It prevents the creation of identical copies of an arbitrary quantum state, forcing quantum error correction to use entanglement to distribute the logical information across multiple physical qubits without copying the state itself.

51 In NVIDIA's Ampere architecture, the third-generation Tensor Cores introduced support for Sparsity, which can double throughput. How does this feature work at a microarchitectural level, and what is its primary constraint?

Nvidia Case Study Hard
A. It prunes weights in a fine-grained 2:4 structured pattern (two non-zero weights in every four), allowing hardware to skip operations for the zero-valued weights, but it requires the neural network to be specifically retrained for this structure.
B. It dynamically detects any zero-valued weight during a matrix multiplication and gates the clock for the corresponding MAC unit for one cycle. This works on any sparse matrix without retraining.
C. It uses a form of data compression on the weight matrices, and the Tensor Core has a dedicated decompression unit that feeds the MAC array. The constraint is the high latency of the decompression step.
D. It only works for 8-bit integer (INT8) operations, where a special lookup table maps sparse patterns to dense computations, but it cannot be applied to floating-point calculations like FP16.

52 A modern out-of-order superscalar processor encounters a long-latency cache miss for a load instruction. Which microarchitectural components are most critical for enabling the processor to continue executing and making forward progress on independent instructions that follow the stalled load in program order?

Microarchitecture Hard
A. The Memory Order Buffer (MOB) and the Store-to-Load Forwarding logic.
B. The Reorder Buffer (ROB), Reservation Stations (or an Issue Queue), and a precise exception mechanism.
C. The Arithmetic Logic Units (ALUs), the Floating Point Units (FPUs), and the multiported register file.
D. The Branch Target Buffer (BTB), the Micro-op Cache, and the L1 instruction cache.

53 Processing-in-Memory (PIM) architectures aim to reduce the 'memory wall'. For a PIM system designed to accelerate graph analytics, which often involves pointer-chasing and irregular memory access, what is the most challenging architectural problem to solve efficiently?

Next Generation Processors Architecture Hard
A. Maintaining cache coherence between the main CPU caches and the data being modified by the PIM logic within the memory banks.
B. Designing a low-power logic process that can be economically integrated with a high-density DRAM process on the same die or package.
C. Providing a sufficiently powerful instruction set for the PIM units to handle complex graph traversal logic beyond simple vector operations.
D. Overcoming the limited memory bandwidth available to each individual PIM processing unit, as it's typically confined to a single memory bank.

54 Intel's Performance Hybrid Architecture (e.g., Alder Lake) uses Performance-cores (P-cores) and Efficient-cores (E-cores). Consider a multithreaded video encoding task. In which scenario would the OS scheduler, guided by the Intel Thread Director, make the most effective use of this hybrid architecture?

Latest Processor for Smartphone or Tablet and Desktop Hard
A. Placing all threads of the encoding task exclusively on the E-cores to maximize power efficiency, leaving the P-cores free for any foreground user interaction.
B. Dynamically migrating all active threads between P-cores and E-cores in a round-robin fashion to evenly distribute heat and prevent thermal throttling.
C. Assigning the primary, latency-sensitive encoding thread and GUI thread to the P-cores, while offloading background, parallelizable tasks like motion estimation across all available E-cores.
D. Running all threads on the P-cores initially for a performance burst, and then moving them to the E-cores once the processor's power budget (PL2) is exceeded.

55 What is the fundamental reason a systolic array, the core of a TPU's matrix multiplication unit, is more power-efficient for dense matrix multiplication than a conventional GPU's SIMD architecture?

Latest Technology and Trends in Computer Architecture Hard
A. It operates at a much lower clock frequency than a GPU, relying on massive parallelism to achieve high throughput, and power scales super-linearly with frequency.
B. It maximizes data reuse by pumping data through a grid of processing elements (PEs), drastically reducing data movement from registers or local memory, which is a major source of power consumption.
C. It eliminates the need for complex instruction fetch, decode, and scheduling logic found in a GPU's Streaming Multiprocessor (SM), as the data flow itself dictates the computation.
D. It uses lower-precision arithmetic (e.g., INT8) which inherently consumes less power per operation than the FP32/FP64 units common in GPUs.

56 In the context of physical qubit implementations, what is the primary reason that superconducting transmons are currently favored for building larger-scale quantum processors (e.g., by Google and IBM) despite having shorter coherence times than trapped ions?

Introduction to Qubits and Quantum Computing Hard
A. Superconducting qubits are based on well-established semiconductor fabrication techniques, which are perceived to be more scalable for manufacturing millions of qubits compared to the complexity of laser/vacuum systems for ion traps.
B. Superconducting qubits exhibit significantly lower measurement error rates because the readout process is based on high-fidelity microwave resonators.
C. The connectivity between transmons can be engineered with greater flexibility, allowing for more complex arrangements of qubits on a chip compared to the typically linear arrangement of ions in a trap.
D. The gate operations on superconducting qubits, based on microwave pulses, are significantly faster (nanoseconds) than the laser-based gates for trapped ions (microseconds), allowing for more operations to be performed within the coherence window.

57 A modern CPU's branch predictor combines a Per-address History Table (PHT) with a Global History Register (GHR) in a 'GAg' configuration. In what specific scenario would this GAg predictor significantly outperform a simple Bimodal predictor that only uses a PHT indexed by the branch address?

Microarchitecture Hard
A. A branch whose direction is a simple function of the loop counter (e.g., if (i % 2 == 0)).
B. A branch that is always taken for the first 1000 iterations of a loop and then not taken for the last 1000 iterations.
C. A branch inside a loop whose direction depends on the outcome of a completely different, preceding branch outside the loop (e.g., if (x > 0) { for(...) { if (y > 10) ... } }).
D. A program with many randomly behaving branches where there is no correlation between different branches or past history.

58 The TOP500 list ranks supercomputers based on their performance on the High-Performance Linpack (HPL) benchmark, which solves a dense system of linear equations (). Why is the HPL benchmark often criticized as being an unrepresentative measure of a supercomputer's capability for a broad range of modern scientific applications?

Introduction to Supercomputer Hard
A. HPL performance is primarily limited by the system's I/O and file system performance, not its computational power, making it a poor benchmark for CPU/GPU capabilities.
B. HPL has a very high computational intensity (ratio of floating-point operations to memory operations) and a regular access pattern, which doesn't stress the memory subsystem or interconnect in the same way as sparse, irregular applications like graph analytics or genomics.
C. The HPL algorithm is not easily parallelizable and does not scale well to the millions of cores found in modern systems, leading to artificially low performance numbers.
D. HPL can only be run using 64-bit floating-point (FP64) precision, whereas many modern AI and scientific workloads achieve sufficient accuracy and much higher performance using lower precisions like FP32 or FP16.

59 NVIDIA's CUDA architecture exposes several distinct memory spaces (global, shared, constant, texture). For a kernel performing a 1D convolution, where a small, read-only filter is applied to a large input array, which memory space is most appropriate for storing the filter coefficients to achieve optimal performance, and why?

Nvidia Case Study Hard
A. Constant memory, because it is cached on-chip and optimized for uniform broadcast to all threads in a warp, which is exactly the access pattern for a convolution filter.
B. Global memory accessed via the __ldg() intrinsic, as this will cache the filter in the L1/texture cache system, providing the best performance for any read-only data.
C. Shared memory, because it provides the lowest latency access, and the filter can be pre-loaded into it by the first thread in each block.
D. Pinned (page-locked) host memory, mapped into the GPU's address space to avoid a device-to-device copy of the filter coefficients before the kernel launch.

60 Modern high-end desktop CPUs like AMD's Ryzen 9 with 3D V-Cache technology stack a large L3 cache die directly on top of the core complex die (CCD). What is the primary performance bottleneck that this specific architectural choice is designed to alleviate, particularly for applications like gaming?

Latest Processor for Smartphone or Tablet and Desktop Hard
A. The power consumption associated with the Infinity Fabric interconnect that connects different CCDs and the I/O die.
B. The limited capacity of the L2 cache, as 3D V-Cache allows the L2 cache per core to be significantly larger.
C. The latency of accessing main memory (DRAM), by increasing the L3 cache hit rate so that far fewer requests need to travel off-chip.
D. The bandwidth between the CPU cores and the L3 cache, as the through-silicon vias (TSVs) used in 3D stacking offer a much wider interface.