Unit 6 - Practice Quiz

CSE211 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What type of processor is Nvidia most famous for developing?

Nvidia Case Study Easy
A. Central Processing Units (CPUs)
B. Graphics Processing Units (GPUs)
C. Digital Signal Processors (DSPs)
D. Field-Programmable Gate Arrays (FPGAs)

2 What is the name of Nvidia's proprietary parallel computing platform and programming model that allows software to use its GPUs for general purpose processing?

Nvidia Case Study Easy
A. DirectX
B. OpenCL
C. CUDA
D. Vulkan

3 What is the primary purpose of a supercomputer?

Introduction to Supercomputer Easy
A. Running office applications like word processors
B. Serving web pages for a small business
C. Performing highly intensive computational tasks like climate modeling and scientific simulations
D. Everyday personal computing and web browsing

4 Which unit is commonly used to measure the performance of supercomputers?

Introduction to Supercomputer Easy
A. Revolutions Per Minute (RPM)
B. FLOPS (Floating-Point Operations Per Second)
C. Gigahertz (GHz)
D. Megabytes per second (MB/s)

5 What is the basic unit of information in a quantum computer?

Introduction to Qubits and Quantum Computing Easy
A. Qubit
B. Bit
C. Register
D. Byte

6 A classical bit can be in a state of 0 or 1. Due to the principle of superposition, a qubit can be in a state of:

Introduction to Qubits and Quantum Computing Easy
A. Only a value between 0 and 1
B. Only 1
C. 0, 1, or a combination of both simultaneously
D. Only 0

7 What is the trend of placing multiple processing cores on a single chip called?

Latest Technology and Trends in Computer Architecture Easy
A. Virtualization
B. Multi-core architecture
C. Hyper-threading
D. Single-core processing

8 Which term describes the logical implementation of an Instruction Set Architecture (ISA)?

Microarchitecture Easy
A. Software
B. Compiler
C. Firmware
D. Microarchitecture

9 Which company is well-known for its Snapdragon series of processors, primarily used in Android smartphones?

Latest Processor for Smartphone or Tablet and Desktop Easy
A. Nvidia
B. Intel
C. Qualcomm
D. AMD

10 What is a key characteristic of a System on a Chip (SoC)?

Next Generation Processors Architecture Easy
A. It integrates multiple components like CPU, GPU, and memory controllers onto a single chip.
B. It is always the largest chip in a computer.
C. It is exclusively used for supercomputers.
D. It only contains the CPU.

11 Which microarchitectural technique breaks down an instruction's execution into smaller, sequential steps that can be overlapped to increase throughput?

Microarchitecture Easy
A. Caching
B. Interrupt handling
C. Pipelining
D. Branch prediction

12 The trend of designing processors optimized for specific tasks, such as AI acceleration, is known as:

Latest Technology and Trends in Computer Architecture Easy
A. General-Purpose Computing
B. Centralized Computing
C. Legacy Architecture
D. Domain-Specific Architecture (DSA)

13 Apple's M-series chips (e.g., M1, M2) used in their recent Mac computers and iPads are based on which instruction set architecture?

Latest Processor for Smartphone or Tablet and Desktop Easy
A. MIPS
B. SPARC
C. x86-64
D. ARM

14 Supercomputers achieve their immense speed by having thousands of processors that work on different parts of a problem at the same time. This is an example of:

Introduction to Supercomputer Easy
A. Parallel processing
B. Single-threaded performance
C. Serial processing
D. Standalone computing

15 What is the quantum mechanical phenomenon where two or more qubits become linked and share the same fate, even when separated by large distances?

Introduction to Qubits and Quantum Computing Easy
A. Decoherence
B. Superposition
C. Entanglement
D. Tunneling

16 Besides gaming, Nvidia GPUs are now widely used for what other high-demand application due to their parallel processing power?

Nvidia Case Study Easy
A. Basic word processing
B. Web browsing
C. Artificial Intelligence (AI) and Deep Learning
D. Running operating systems

17 What is the term for a design approach where a large processor is built by combining smaller, specialized integrated circuits called 'chiplets'?

Next Generation Processors Architecture Easy
A. Chiplet-based design
B. Monolithic design
C. Transistor design
D. Integrated circuit

18 What is the primary function of a CPU cache?

Microarchitecture Easy
A. To store frequently accessed data and instructions close to the CPU for faster access.
B. To provide long-term storage for files.
C. To cool down the processor.
D. To connect the processor to the internet.

19 What does RISC stand for in the context of processor architecture?

Latest Technology and Trends in Computer Architecture Easy
A. Random Instruction Set Computer
B. Reduced Instruction Set Computer
C. Really Integrated Silicon Chip
D. Re-engineered Integrated System Chip

20 A major architectural difference between processors in high-end desktops and most smartphones is that desktop CPUs often use the ____ architecture, while smartphone CPUs typically use the ____ architecture.

Latest Processor for Smartphone or Tablet and Desktop Easy
A. x86, ARM
B. SPARC, MIPS
C. MIPS, SPARC
D. ARM, x86

21 In the context of Nvidia's GPU architecture, what is the primary role of a Tensor Core, and how does it differ from a standard CUDA core?

Nvidia Case Study Medium
A. Tensor Cores are specialized for high-precision 64-bit floating-point arithmetic, while CUDA cores handle graphics texturing.
B. Tensor Cores are responsible for ray tracing acceleration (RT Cores do this), while CUDA cores handle the overall thread scheduling for the entire Streaming Multiprocessor (SM).
C. Tensor Cores manage the GPU's memory hierarchy and L2 cache, while CUDA cores are responsible for executing shader programs.
D. Tensor Cores execute matrix-multiply-accumulate (MMA) operations on small matrices (e.g., 4x4) at high speed, primarily for AI/ML workloads, whereas CUDA cores are general-purpose integer/FP32 processors.

22 A modern supercomputer is designed for a large-scale climate simulation that requires frequent, small data exchanges between thousands of nodes. Which interconnect technology characteristic would be most critical for this application's performance?

Introduction to Supercomputer Medium
A. High Bandwidth
B. High Power Efficiency
C. Support for TCP/IP Offloading
D. Low Latency

23 A quantum system is composed of 4 entangled qubits. If one qubit is measured and its state collapses to , what is the immediate effect on the other three qubits?

Introduction to Qubits and Quantum Computing Medium
A. The state of the other three qubits is instantly constrained or determined, collapsing from their superposition into a specific state correlated with the measured qubit.
B. The other three qubits also collapse to .
C. The other three qubits form a new entangled system independent of the measured qubit.
D. The other three qubits are unaffected and remain in a superposition of all possible states.

24 Why is the chiplet-based design approach becoming a major trend in high-performance processor manufacturing?

Latest Technology and Trends in Computer Architecture Medium
A. It improves manufacturing yield for large, complex processors and allows for mixing and matching different process technologies for different functions (e.g., I/O vs. CPU cores).
B. It simplifies the microarchitecture by removing the need for an on-chip memory controller.
C. It allows for a single monolithic die to be clocked at a much higher frequency.
D. It exclusively uses a new, more efficient silicon substrate material that cannot be used for monolithic designs.

25 What is the primary microarchitectural trade-off when implementing a deep, multi-stage pipeline in a processor?

Microarchitecture Medium
A. Increased clock frequency and instruction throughput, but at the cost of a higher branch misprediction penalty and increased instruction latency.
B. Lower single-thread performance in exchange for better multi-threaded performance.
C. Decreased instruction throughput in exchange for lower power consumption.
D. Simplified branch prediction logic in exchange for a more complex instruction decode stage.

26 Which characteristic of the RISC-V instruction set architecture (ISA) is a key differentiator from proprietary ISAs like x86 and ARM, making it attractive for custom silicon development?

Next Generation Processors Architecture Medium
A. Its strict requirement for a 10-stage pipeline, which guarantees high clock speeds.
B. Its backward compatibility with all legacy x86-64 software without emulation.
C. Its open-standard, royalty-free nature and modular design, which allows for custom extensions to the base ISA.
D. Its inherent superiority in floating-point performance due to a unique vector processing design.

27 Modern high-end mobile SoCs (System-on-Chip) like the Apple M-series or Qualcomm Snapdragon often outperform some desktop CPUs in specific tasks despite having a lower power budget. What architectural feature is a primary contributor to this efficiency?

Latest Processor for Smartphone or Tablet and Desktop Medium
A. Exclusive use of VLIW (Very Long Instruction Word) architecture for all processing units.
B. A much larger L3 cache compared to desktop processors, which eliminates memory bottlenecks.
C. Tight integration of specialized hardware accelerators (e.g., Neural Processing Unit, Image Signal Processor, GPU) on the same die, which offload tasks from the general-purpose CPU cores.
D. Higher clock speeds achieved through exotic cooling solutions within the mobile device.

28 In Nvidia's CUDA programming model, a 'warp' is a fundamental unit of execution. How does the microarchitecture of a Streaming Multiprocessor (SM) handle instruction dispatch for threads within a warp?

Nvidia Case Study Medium
A. Threads within a warp are dynamically grouped to execute different instructions based on data availability.
B. The SM selects one thread from the warp at random to execute per clock cycle.
C. All 32 threads in a warp execute the same instruction at the same time on different data, following a SIMT (Single Instruction, Multiple Thread) model.
D. Each thread in a warp executes an independent instruction stream, similar to MIMD.

29 When analyzing the scalability of a parallel program on a supercomputer with a very large number of processors, Gustafson's Law often provides a more optimistic prediction than Amdahl's Law. What is the fundamental assumption behind Gustafson's Law that accounts for this difference?

Introduction to Supercomputer Medium
A. The communication latency between nodes becomes zero with enough processors.
B. The serial portion of the program diminishes as more processors are added.
C. The clock speed of each processor increases proportionally to the number of nodes.
D. The total problem size scales up with the number of processors, keeping the parallel execution time constant.

30 Quantum decoherence is a major obstacle in building functional quantum computers. From an architectural standpoint, what is the primary cause of decoherence?

Introduction to Qubits and Quantum Computing Medium
A. The inability of current technology to create a perfect superposition of and .
B. Unwanted interaction between the quantum system (qubits) and its surrounding environment, which causes the loss of quantum properties like superposition and entanglement.
C. The inherent randomness of quantum measurement, which makes algorithm results unreliable.
D. Errors in the quantum logic gates that cause qubits to flip their states incorrectly.

31 An out-of-order execution engine in a modern CPU uses a Reorder Buffer (ROB) and a Reservation Station. What is the specific role of the Reservation Station in this microarchitecture?

Microarchitecture Medium
A. To hold instructions that have been decoded but are waiting for their operands to become available or for an execution unit to be free.
B. To predict the outcome of branch instructions before they are executed.
C. To store the original program order of instructions to ensure correct program retirement.
D. To act as the primary L1 instruction cache for fetching upcoming instructions.

32 What is the primary motivation behind the trend of developing Domain-Specific Architectures (DSAs), such as Google's TPU for machine learning?

Latest Technology and Trends in Computer Architecture Medium
A. To simplify software development by providing a single instruction set for all computing tasks.
B. To achieve orders-of-magnitude improvements in performance and power efficiency for a specific target workload by tailoring the hardware to that workload's needs.
C. To create a new general-purpose processor that can replace all existing CPUs and GPUs.
D. To reduce the physical size of processors by removing unnecessary components like the memory controller.

33 ARM's big.LITTLE technology is a form of heterogeneous multi-core processing. What is the primary architectural goal of pairing high-performance 'big' cores with high-efficiency 'LITTLE' cores?

Next Generation Processors Architecture Medium
A. To provide hardware-level redundancy in case one set of cores fails.
B. To double the number of threads that can be run by having two distinct ISAs on the same chip.
C. To execute both 32-bit and 64-bit instructions simultaneously on different core types.
D. To dynamically balance performance and power consumption by migrating tasks between core types based on workload intensity.

34 When comparing the architecture of a high-end desktop processor (e.g., Intel Core i9) to a flagship mobile SoC (e.g., Qualcomm Snapdragon 8 Gen-series), a key difference lies in their approach to component integration. Which statement best describes this difference?

Latest Processor for Smartphone or Tablet and Desktop Medium
A. Mobile SoCs are typically monolithic, integrating CPU, GPU, memory controller, NPU, ISP, and modem on a single die, whereas desktop CPUs are often chiplet-based and focus primarily on CPU cores and cache.
B. Desktop CPUs prioritize low-power efficiency cores, while mobile SoCs prioritize a large number of high-performance cores.
C. Mobile SoCs use a system-level cache that is an order of magnitude larger than the L3 cache found in desktop CPUs.
D. Desktop processors integrate more diverse components like NPUs and ISPs directly onto the CPU die, while mobile SoCs keep them as separate chips.

35 A quantum algorithm requires the creation of a uniform superposition of all possible states for a 3-qubit register, initially in the state . Which quantum gate should be applied to each qubit to achieve this?

Introduction to Qubits and Quantum Computing Medium
A. A Toffoli gate
B. An X-gate (NOT gate)
C. A Hadamard gate
D. A CNOT-gate

36 In a processor that supports Simultaneous Multithreading (SMT), how does the microarchitecture enable multiple threads to be active on a single physical core?

Microarchitecture Medium
A. By duplicating all execution units, effectively creating two cores in the space of one.
B. By duplicating the architectural state (e.g., program counter, register file) for each thread and allowing instructions from different threads to share the execution units in the same pipeline.
C. By running one thread on the integer units and another thread exclusively on the floating-point units.
D. By rapidly context-switching between threads on every clock cycle, flushing the pipeline each time.

37 The architectural shift from Nvidia's Ampere to Hopper generation introduced the Transformer Engine. What specific computational challenge in large AI models does this feature address?

Nvidia Case Study Medium
A. It enables direct peer-to-peer communication between GPUs without involving the CPU, using a new version of NVLink.
B. It dynamically selects the optimal numerical precision (e.g., FP8, FP16) for different layers of a Transformer model to boost performance and save memory without significant accuracy loss.
C. It provides a dedicated hardware block for data compression to reduce the GPU's memory bandwidth requirements.
D. It accelerates the rendering of 3D graphics by optimizing triangle rasterization.

38 What is the primary architectural purpose of using a parallel file system (e.g., Lustre, GPFS) in a large-scale supercomputer environment?

Introduction to Supercomputer Medium
A. To reduce the power consumption of the storage system by spinning down idle disks.
B. To enforce strict security policies by isolating each user's data on a separate physical disk.
C. To allow thousands of compute nodes to access and write to a shared storage pool simultaneously at very high aggregate bandwidth, avoiding I/O bottlenecks.
D. To provide data redundancy and automatic backups for all user data.

39 Dataflow architectures represent a fundamental departure from the traditional von Neumann architecture. What is the core principle of a dataflow machine's execution model?

Next Generation Processors Architecture Medium
A. The processor fetches large blocks of data and instructions together from memory to reduce latency.
B. Instructions are executed sequentially as determined by a program counter.
C. An instruction is ready to execute as soon as its required input data (operands) are available.
D. All operations are performed directly on data held in a large, unified register file, bypassing memory.

40 Processing-in-Memory (PIM) or Compute-in-Memory (CIM) is an emerging trend to overcome a major performance bottleneck. What specific bottleneck does this technology aim to mitigate?

Latest Technology and Trends in Computer Architecture Medium
A. The performance gap between integer and floating-point execution units.
B. The 'Memory Wall' or von Neumann bottleneck, which is the separation of processing and data storage that leads to high latency and energy consumption from data movement.
C. The high cost of manufacturing large on-chip caches (L3 cache).
D. The difficulty of writing correct parallel software for multi-core processors.

41 The NVIDIA Hopper architecture's Tensor Cores introduced the Transformer Engine. How does this engine fundamentally improve performance for models like GPT-3 compared to the A100 (Ampere) Tensor Cores, beyond simply offering higher raw FLOPS?

Nvidia Case Study Hard
A. It integrates the functionality of the NVLink switch directly into the Tensor Core, allowing direct data exchange between Tensor Cores of different GPUs without traversing the SM's memory hierarchy.
B. It exclusively uses a novel 4-bit floating point format (FP4) for all matrix multiplications, quadrupling the throughput compared to Ampere's TF32.
C. It introduces a hardware-based systolic array scheduler that completely removes the need for CUDA warp-level scheduling for matrix operations.
D. It dynamically selects between FP8 and FP16 precision for different layers of a transformer model on a per-op basis to maximize throughput while maintaining accuracy.

42 A CUDA kernel is designed to perform a large-scale stencil computation on a 2D grid. The kernel exhibits poor performance, and profiling reveals high global memory latency. The stencil requires each thread to access its own element and the 8 neighboring elements. The grid is too large to fit entirely in shared memory. Which optimization strategy would most effectively mitigate the global memory latency bottleneck in this specific scenario?

Nvidia Case Study Hard
A. Increase the CUDA grid size and decrease the block size to create more warps, hoping to hide latency through increased thread-level parallelism (TLP).
B. Use pinned host memory (page-locked memory) for the grid data and stream the computation to overlap data transfers with kernel execution.
C. Replace all global memory accesses with __ldg() intrinsic functions to cache the data in the L1/texture cache, assuming read-only access patterns.
D. Implement tiling by loading a 2D tile of the grid from global memory into shared memory, including a 'halo' or 'ghost cell' region for neighbors, process the tile, and write results back.

43 Consider a supercomputer using a Dragonfly interconnect topology. A large-scale simulation requires frequent all-to-all communication patterns (e.g., an MPI_Alltoall operation). Which characteristic of the Dragonfly topology presents the most significant performance challenge for this specific communication pattern compared to a less scalable but more direct topology like a full crossbar?

Introduction to Supercomputer Hard
A. The potential for network contention on the high-radix global links connecting different groups, requiring adaptive routing to mitigate hotspots.
B. The high diameter of the network, leading to excessive hop counts and latency for any communication pattern.
C. The static routing algorithm mandated by the topology, which cannot adapt to network load.
D. The reliance on optical cables for all links, which have higher latency than electrical links for intra-group communication.

44 A 2-qubit system is in the state . Which of the following statements accurately describes this quantum state?

Introduction to Qubits and Quantum Computing Hard
A. If a Hadamard gate is applied to both qubits, the resulting state is |11⟩.
B. The state is an entangled Bell state.
C. The state is a product state, meaning the qubits are not entangled.
D. If the first qubit is measured in the computational basis and the result is |1⟩, the second qubit collapses to the state |0⟩ - |11⟩, which is not a valid quantum state.

45 Two processors, P1 and P2, share a memory location X managed by a MESI cache coherence protocol. Initially, X is not in either cache. Consider the following sequence of operations:
1. P1 reads X.
2. P2 writes to X.
3. P1 reads X.
How many bus transactions of the type BusRd (Bus Read) and BusRdX (Bus Read Exclusive) are generated on the shared bus?

Microarchitecture Hard
A. One BusRd transaction and two BusRdX transactions.
B. Three BusRd transactions and zero BusRdX transactions.
C. Two BusRd transactions and one BusRdX transaction.
D. One BusRd transaction and one BusRdX transaction.

46 The Apple M-series SoCs utilize a Unified Memory Architecture (UMA) where the CPU and GPU share the same physical memory pool. While this reduces data copying and latency, what is a significant microarchitectural challenge or trade-off this design imposes compared to a traditional discrete GPU architecture with its own VRAM?

Latest Processor for Smartphone or Tablet and Desktop Hard
A. A fundamental limitation on the maximum amount of addressable memory, as both CPU and GPU must share a single memory address space managed by the CPU's MMU.
B. The inability to use specialized high-bandwidth memory like GDDR6, forcing the entire system to rely on lower-bandwidth LPDDR5, thus capping peak theoretical memory bandwidth.
C. Increased memory contention and sophisticated quality-of-service (QoS) requirements for the memory controller to arbitrate between CPU's latency-sensitive requests and GPU's bandwidth-hungry requests.
D. Increased power consumption due to the constant need for the CPU to perform cache coherence snooping on memory accesses initiated by the GPU.

47 Compute Express Link (CXL) 2.0 introduces memory pooling. How does the CXL.mem protocol ensure cache coherence for this pooled memory between the host CPU's caches and the CXL device's memory, without requiring the CXL device to be a fully coherent snooping agent?

Next Generation Processors Architecture Hard
A. It uses a directory-based coherence protocol where the CXL device's memory controller acts as the home node and directory, tracking sharers and handling invalidations requested by the host.
B. It relies on the host CPU's existing snoopy coherence protocol, treating the CXL link as just another bus participant that must snoop all traffic from all cores.
C. It enforces a write-through, no-allocate policy for all host accesses to the CXL memory pool, ensuring memory is always up-to-date and bypassing host caches entirely.
D. It requires explicit software-managed cache flushes from the host CPU before the CXL device can access the memory, making coherence a software responsibility.

48 In a chiplet-based processor design, such as AMD's Zen architecture, what is the most significant microarchitectural trade-off when determining the latency and bandwidth of the die-to-die interconnect fabric (e.g., Infinity Fabric)?

Latest Technology and Trends in Computer Architecture Hard
A. Ensuring the die-to-die clock synchronization is perfectly aligned, which often requires a dedicated global clock chiplet, increasing the bill of materials.
B. Minimizing the silicon area of the interconnect PHYs on each chiplet against the need to support legacy bus protocols like PCIe for backward compatibility.
C. The complexity of the routing algorithm within the fabric versus the manufacturing cost associated with using advanced packaging technologies like 2.5D interposers.
D. Balancing the physical distance and signaling power against the NUMA (Non-Uniform Memory Access) factor introduced, which can cause performance variability for threads accessing remote L3 caches or memory controllers.

49 Modern Exascale supercomputers like Frontier are built on heterogeneous architectures. From a system architecture perspective, what is the primary reason this heterogeneity is crucial for approaching the 20 MW power barrier for an ExaFLOP/s system?

Introduction to Supercomputer Hard
A. GPUs achieve a significantly higher FLOPS/watt ratio for highly parallel computations, allowing the bulk of the floating-point work to be done with greater energy efficiency than on CPUs alone.
B. The interconnects designed for GPU-centric systems (like NVLink) are an order of magnitude more power-efficient per bit transferred than traditional CPU interconnects.
C. CPUs in the system can be put into a deep sleep state while the GPUs perform all computations, effectively eliminating the CPU power draw for long periods.
D. The use of multiple smaller GPU nodes reduces the total static power leakage compared to a system with a similar number of massive, monolithic CPU cores.

50 The No-Cloning Theorem is a fundamental principle in quantum mechanics. How does this theorem necessitate a fundamentally different approach to error correction in quantum computers compared to classical error correction techniques like Triple Modular Redundancy (TMR)?

Introduction to Qubits and Quantum Computing Hard
A. It restricts error detection to only measuring the parity of qubits, as any other measurement would collapse the quantum state, making correction impossible.
B. It implies that quantum errors are always continuous (e.g., small phase rotations), whereas classical errors are discrete bit-flips, requiring analog correction methods.
C. It means that a corrupted qubit's state cannot be 're-written' or 'corrected,' forcing quantum algorithms to be redesigned to be inherently fault-tolerant.
D. It prevents the creation of identical copies of an arbitrary quantum state, forcing quantum error correction to use entanglement to distribute the logical information across multiple physical qubits without copying the state itself.

51 In NVIDIA's Ampere architecture, the third-generation Tensor Cores introduced support for Sparsity, which can double throughput. How does this feature work at a microarchitectural level, and what is its primary constraint?

Nvidia Case Study Hard
A. It prunes weights in a fine-grained 2:4 structured pattern (two non-zero weights in every four), allowing hardware to skip operations for the zero-valued weights, but it requires the neural network to be specifically retrained for this structure.
B. It only works for 8-bit integer (INT8) operations, where a special lookup table maps sparse patterns to dense computations, but it cannot be applied to floating-point calculations like FP16.
C. It dynamically detects any zero-valued weight during a matrix multiplication and gates the clock for the corresponding MAC unit for one cycle. This works on any sparse matrix without retraining.
D. It uses a form of data compression on the weight matrices, and the Tensor Core has a dedicated decompression unit that feeds the MAC array. The constraint is the high latency of the decompression step.

52 A modern out-of-order superscalar processor encounters a long-latency cache miss for a load instruction. Which microarchitectural components are most critical for enabling the processor to continue executing and making forward progress on independent instructions that follow the stalled load in program order?

Microarchitecture Hard
A. The Reorder Buffer (ROB), Reservation Stations (or an Issue Queue), and a precise exception mechanism.
B. The Branch Target Buffer (BTB), the Micro-op Cache, and the L1 instruction cache.
C. The Arithmetic Logic Units (ALUs), the Floating Point Units (FPUs), and the multiported register file.
D. The Memory Order Buffer (MOB) and the Store-to-Load Forwarding logic.

53 Processing-in-Memory (PIM) architectures aim to reduce the 'memory wall'. For a PIM system designed to accelerate graph analytics, which often involves pointer-chasing and irregular memory access, what is the most challenging architectural problem to solve efficiently?

Next Generation Processors Architecture Hard
A. Maintaining cache coherence between the main CPU caches and the data being modified by the PIM logic within the memory banks.
B. Designing a low-power logic process that can be economically integrated with a high-density DRAM process on the same die or package.
C. Overcoming the limited memory bandwidth available to each individual PIM processing unit, as it's typically confined to a single memory bank.
D. Providing a sufficiently powerful instruction set for the PIM units to handle complex graph traversal logic beyond simple vector operations.

54 Intel's Performance Hybrid Architecture (e.g., Alder Lake) uses Performance-cores (P-cores) and Efficient-cores (E-cores). Consider a multithreaded video encoding task. In which scenario would the OS scheduler, guided by the Intel Thread Director, make the most effective use of this hybrid architecture?

Latest Processor for Smartphone or Tablet and Desktop Hard
A. Dynamically migrating all active threads between P-cores and E-cores in a round-robin fashion to evenly distribute heat and prevent thermal throttling.
B. Placing all threads of the encoding task exclusively on the E-cores to maximize power efficiency, leaving the P-cores free for any foreground user interaction.
C. Running all threads on the P-cores initially for a performance burst, and then moving them to the E-cores once the processor's power budget (PL2) is exceeded.
D. Assigning the primary, latency-sensitive encoding thread and GUI thread to the P-cores, while offloading background, parallelizable tasks like motion estimation across all available E-cores.

55 What is the fundamental reason a systolic array, the core of a TPU's matrix multiplication unit, is more power-efficient for dense matrix multiplication than a conventional GPU's SIMD architecture?

Latest Technology and Trends in Computer Architecture Hard
A. It operates at a much lower clock frequency than a GPU, relying on massive parallelism to achieve high throughput, and power scales super-linearly with frequency.
B. It eliminates the need for complex instruction fetch, decode, and scheduling logic found in a GPU's Streaming Multiprocessor (SM), as the data flow itself dictates the computation.
C. It uses lower-precision arithmetic (e.g., INT8) which inherently consumes less power per operation than the FP32/FP64 units common in GPUs.
D. It maximizes data reuse by pumping data through a grid of processing elements (PEs), drastically reducing data movement from registers or local memory, which is a major source of power consumption.

56 In the context of physical qubit implementations, what is the primary reason that superconducting transmons are currently favored for building larger-scale quantum processors (e.g., by Google and IBM) despite having shorter coherence times than trapped ions?

Introduction to Qubits and Quantum Computing Hard
A. Superconducting qubits are based on well-established semiconductor fabrication techniques, which are perceived to be more scalable for manufacturing millions of qubits compared to the complexity of laser/vacuum systems for ion traps.
B. The connectivity between transmons can be engineered with greater flexibility, allowing for more complex arrangements of qubits on a chip compared to the typically linear arrangement of ions in a trap.
C. The gate operations on superconducting qubits, based on microwave pulses, are significantly faster (nanoseconds) than the laser-based gates for trapped ions (microseconds), allowing for more operations to be performed within the coherence window.
D. Superconducting qubits exhibit significantly lower measurement error rates because the readout process is based on high-fidelity microwave resonators.

57 A modern CPU's branch predictor combines a Per-address History Table (PHT) with a Global History Register (GHR) in a 'GAg' configuration. In what specific scenario would this GAg predictor significantly outperform a simple Bimodal predictor that only uses a PHT indexed by the branch address?

Microarchitecture Hard
A. A program with many randomly behaving branches where there is no correlation between different branches or past history.
B. A branch that is always taken for the first 1000 iterations of a loop and then not taken for the last 1000 iterations.
C. A branch inside a loop whose direction depends on the outcome of a completely different, preceding branch outside the loop (e.g., if (x > 0) { for(...) { if (y > 10) ... } }).
D. A branch whose direction is a simple function of the loop counter (e.g., if (i % 2 == 0)).

58 The TOP500 list ranks supercomputers based on their performance on the High-Performance Linpack (HPL) benchmark, which solves a dense system of linear equations (). Why is the HPL benchmark often criticized as being an unrepresentative measure of a supercomputer's capability for a broad range of modern scientific applications?

Introduction to Supercomputer Hard
A. HPL performance is primarily limited by the system's I/O and file system performance, not its computational power, making it a poor benchmark for CPU/GPU capabilities.
B. HPL has a very high computational intensity (ratio of floating-point operations to memory operations) and a regular access pattern, which doesn't stress the memory subsystem or interconnect in the same way as sparse, irregular applications like graph analytics or genomics.
C. HPL can only be run using 64-bit floating-point (FP64) precision, whereas many modern AI and scientific workloads achieve sufficient accuracy and much higher performance using lower precisions like FP32 or FP16.
D. The HPL algorithm is not easily parallelizable and does not scale well to the millions of cores found in modern systems, leading to artificially low performance numbers.

59 NVIDIA's CUDA architecture exposes several distinct memory spaces (global, shared, constant, texture). For a kernel performing a 1D convolution, where a small, read-only filter is applied to a large input array, which memory space is most appropriate for storing the filter coefficients to achieve optimal performance, and why?

Nvidia Case Study Hard
A. Constant memory, because it is cached on-chip and optimized for uniform broadcast to all threads in a warp, which is exactly the access pattern for a convolution filter.
B. Shared memory, because it provides the lowest latency access, and the filter can be pre-loaded into it by the first thread in each block.
C. Pinned (page-locked) host memory, mapped into the GPU's address space to avoid a device-to-device copy of the filter coefficients before the kernel launch.
D. Global memory accessed via the __ldg() intrinsic, as this will cache the filter in the L1/texture cache system, providing the best performance for any read-only data.

60 Modern high-end desktop CPUs like AMD's Ryzen 9 with 3D V-Cache technology stack a large L3 cache die directly on top of the core complex die (CCD). What is the primary performance bottleneck that this specific architectural choice is designed to alleviate, particularly for applications like gaming?

Latest Processor for Smartphone or Tablet and Desktop Hard
A. The bandwidth between the CPU cores and the L3 cache, as the through-silicon vias (TSVs) used in 3D stacking offer a much wider interface.
B. The latency of accessing main memory (DRAM), by increasing the L3 cache hit rate so that far fewer requests need to travel off-chip.
C. The limited capacity of the L2 cache, as 3D V-Cache allows the L2 cache per core to be significantly larger.
D. The power consumption associated with the Infinity Fabric interconnect that connects different CCDs and the I/O die.