Unit6 - Subjective Questions
CSE211 • Practice Questions with Detailed Answers
Compare and contrast the architectural differences between a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit). Why is a GPU better suited for parallel processing?
Comparison between CPU and GPU Architecture:
-
Core Design:
- CPU: Consists of a few (4-64) heavy-weight cores designed for sequential serial processing. It focuses on maximizing the execution speed of a single thread.
- GPU: Consists of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously.
-
Goal:
- CPU: Optimized for Low Latency (completing a task as quickly as possible).
- GPU: Optimized for High Throughput (completing as many tasks as possible in a given time).
-
Control Logic and Cache:
- CPU: A significant portion of the die area is dedicated to complex control logic (branch prediction, out-of-order execution) and large caches to minimize memory access latency.
- GPU: Allocates more transistors to data processing (ALUs) rather than flow control or caching. It hides memory latency by switching between thousands of active threads.
Suitability for Parallel Processing:
A GPU is better suited for parallel processing because of its SIMD (Single Instruction, Multiple Data) or SIMT (Single Instruction, Multiple Threads) architecture. It can execute the same operation across massive blocks of data (like pixels in an image or matrices in Machine Learning) simultaneously, whereas a CPU processes them sequentially or with limited parallelism.
Explain the concept of SIMT (Single Instruction, Multiple Threads) in the context of the Nvidia GPU architecture.
Single Instruction, Multiple Threads (SIMT) is an execution model used in Nvidia GPU architectures (like CUDA).
Key Concepts:
- Thread Execution: In SIMT, the programmer writes code for a single thread. The GPU hardware then instantiates thousands of these threads to execute the same program code.
- Warps: Threads are grouped into bundles called Warps (typically 32 threads in Nvidia hardware). The processor issues a single instruction, and all threads in the warp execute that instruction simultaneously on different data elements.
- Divergence: If threads within a warp need to execute different paths (e.g., an
if-elseblock where some threads take theifand others take theelse), the hardware serializes the execution. This is known as Warp Divergence and can reduce performance. - Abstraction: Unlike SIMD (Single Instruction, Multiple Data), which exposes vector width to the software, SIMT allows threads to have their own instruction address counters and register states, providing a more flexible abstraction for parallel programming.
Define a Supercomputer and explain the metric FLOPS used to measure its performance. How does clustering contribute to supercomputing?
Supercomputer:
A supercomputer is a high-level computer that performs at or near the highest operational rate for computers. Unlike mainframes which focus on transaction processing, supercomputers focus on number crunching for scientific simulations (e.g., weather forecasting, nuclear modeling).
FLOPS (Floating Point Operations Per Second):
Performance is measured in FLOPS rather than MIPS (Instructions Per Second) because scientific calculations rely heavily on floating-point math.
Clustering in Supercomputing:
Modern supercomputers are essentially massive Clusters.
- They consist of thousands of individual nodes (computers) connected via a high-speed, low-latency interconnect (like InfiniBand).
- Parallelism: Problems are broken down into smaller chunks using MPI (Message Passing Interface), and each node processes a chunk simultaneously, combining the results.
What is a Qubit? Mathematically describe the state of a qubit compared to a classical bit.
Definition of a Qubit:
A Qubit (Quantum Bit) is the fundamental unit of quantum information. unlike a classical bit, which can strictly be either 0 or 1, a qubit can exist in a state of superposition, representing both 0 and 1 simultaneously until it is measured.
Mathematical Representation:
- Classical Bit: Can be represented as state or .
-
Qubit: The state of a qubit is a linear combination (superposition) of the basis states:
Where:
- and are complex numbers known as probability amplitudes.
- The probability of measuring the qubit as '0' is .
- The probability of measuring the qubit as '1' is .
- Normalization Constraint: .
Describe the two fundamental principles of Quantum Computing: Superposition and Entanglement.
1. Superposition:
- Concept: In classical physics, an object can only be in one state at a time (e.g., a coin is either heads or tails). In quantum mechanics, a system can exist in multiple states simultaneously.
- Application: A classical 2-bit register holds one of 4 values (00, 01, 10, 11) at a time. A 2-qubit quantum register in superposition holds all 4 values simultaneously. This allows quantum computers to process massive distinct input configurations in parallel.
2. Entanglement:
- Concept: Entanglement is a phenomenon where pairs or groups of particles interact in such a way that the quantum state of each particle cannot be described independently of the state of the others, even when the particles are separated by a large distance.
- Application: If two qubits are entangled, measuring the state of one qubit immediately determines the state of the other (e.g., if one is '0', the other might effectively be forced to be '1'). This creates strong correlations that are used for quantum teleportation and faster-than-classical algorithms.
Discuss the Nvidia CUDA (Compute Unified Device Architecture) model. How does it bridge the gap between software and GPU hardware?
Nvidia CUDA (Compute Unified Device Architecture):
CUDA is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers to use a CUDA-enabled GPU for general-purpose processing (GPGPU), not just graphics rendering.
Bridging Software and Hardware:
- Language Extension: CUDA extends standard languages like C, C++, and Fortran. Developers write a function called a Kernel, which is executed N times in parallel by N different CUDA threads.
- Hardware Abstraction:
- Grids and Blocks: Software organizes threads into Blocks, and Blocks into Grids. This maps logically to the hardware's Streaming Multiprocessors (SMs).
- The CUDA driver handles the complexity of scheduling these blocks onto the physical cores of the GPU.
- Memory Management: CUDA provides APIs for managing different types of memory (Global, Shared, Constant) explicitly, allowing developers to optimize data locality and bandwidth usage manually.
Explain the concept of big.LITTLE architecture found in modern smartphone processors (e.g., ARM based). Why is it efficient?
big.LITTLE Architecture:
big.LITTLE is a heterogeneous computing architecture developed by ARM, commonly found in smartphone SoCs (System on Chips).
Mechanism:
It couples distinct types of processor cores on a single die:
- 'big' cores: High-performance, power-hungry cores (e.g., Cortex-X or Cortex-A7xx series) designed for intensive tasks like gaming, video editing, or heavy web browsing.
- 'LITTLE' cores: Low-performance, highly energy-efficient cores (e.g., Cortex-A5xx series) designed for background tasks, idle processes, and music playback.
Efficiency & Benefits:
- Dynamic Switching: The operating system task scheduler (like Energy Aware Scheduling in Android) dynamically allocates tasks. Heavy tasks go to big cores; light tasks go to LITTLE cores.
- Battery Life: By not powering up the high-performance cores for trivial tasks (like checking emails in the background), the device saves significant battery power while maintaining peak performance when needed.
What are Chiplets in the context of next-generation processor architecture? How do they differ from monolithic designs?
Chiplets:
A chiplet design methodology breaks down a large System on Chip (SoC) into smaller, modular dies (chiplets) that are packaged together to function as a single processor.
Difference from Monolithic Design:
-
Monolithic Design:
- All components (Cores, Cache, I/O, Memory Controllers) are fabricated on a single large piece of silicon.
- Issue: As chips get larger, yield rates drop (a single defect ruins the whole chip), and manufacturing becomes incredibly expensive.
-
Chiplet Design (e.g., AMD Ryzen):
- Components are fabricated separately. For example, high-speed cores are made on a cutting-edge 5nm process, while the I/O die is made on a cheaper 12nm process.
- These separate dies are connected via a high-speed interconnect (like Infinity Fabric) within the same package.
- Benefit: Higher yield, lower cost, and the ability to mix and match process technologies.
Discuss the Memory Hierarchy in a typical GPU. How does Shared Memory differ from Global Memory?
GPU Memory Hierarchy:
To feed thousands of cores, GPUs utilize a specialized memory hierarchy designed to maximize bandwidth.
- Registers: The fastest memory, private to each thread.
- Shared Memory (L1): A user-managed cache located on the Streaming Multiprocessor (SM). Accessible by all threads in a specific block.
- L2 Cache: Unified cache shared across all SMs.
- Global Memory (VRAM): Large, off-chip memory (e.g., GDDR6 or HBM), accessible by all threads but with high latency.
Difference between Shared and Global Memory:
- Scope:
- Shared Memory: Visible only to threads within the same block.
- Global Memory: Visible to all threads in the application.
- Speed:
- Shared Memory: Extremely fast (low latency), similar to L1 cache. Used for inter-thread communication.
- Global Memory: Slow (high latency). Access patterns must be "coalesced" to maximize efficiency.
- Usage: Programmers explicitly load data from Global memory into Shared memory to perform repeated calculations efficiently.
Explain Moore's Law and Dennard Scaling. Why have these trends slowed down in recent years, leading to the "Dark Silicon" era?
Moore's Law:
Observation by Gordon Moore that the number of transistors on a microchip doubles approximately every two years, while the cost per computer halves.
Dennard Scaling:
The principle that as transistors get smaller, their power density stays constant. This meant we could increase clock speeds without significantly increasing power consumption.
The Slowdown and Dark Silicon:
- Physical Limits: We are approaching atomic scales (2nm, 3nm). Quantum tunneling makes it hard to shrink transistors further reliably.
- Thermal Wall (Breakdown of Dennard Scaling): As transistors shrank further, leakage current increased. We can no longer increase clock frequencies (stalled around 3-5 GHz) because the chips would melt.
- Dark Silicon: Because of the thermal constraints, we can fit more transistors on a chip (following Moore's Law), but we cannot power them all on simultaneously. A large portion of the chip must remain powered off ("dark") at any given time to prevent overheating. This has led to the rise of specialized accelerators (DSAs) rather than just faster general-purpose cores.
What is a System on Chip (SoC)? Detail the typical components found in a modern Smartphone SoC.
System on Chip (SoC):
An SoC is an integrated circuit that integrates all or most components of a computer or other electronic system on a single substrate. It is the standard for mobile computing due to space and power efficiency.
Typical Components of a Smartphone SoC:
- CPU: Usually multi-core ARM architecture (big.LITTLE configuration).
- GPU: Handles graphics rendering and UI fluidity.
- NPU (Neural Processing Unit): Specialized hardware for AI tasks (FaceID, photography enhancement).
- ISP (Image Signal Processor): Processes raw data from the camera sensors.
- Modem: 4G/5G, Wi-Fi, and Bluetooth connectivity modules.
- Memory Controller: Interfaces with RAM.
- DSP (Digital Signal Processor): For audio processing and sensor data management.
- Security Enclave: Dedicated hardware for storing biometrics and encryption keys.
Describe RISC-V and explain why it is considered a significant trend in next-generation processor architecture.
RISC-V:
RISC-V (pronounced "risk-five") is an open standard Instruction Set Architecture (ISA) based on established Reduced Instruction Set Computer (RISC) principles.
Significance in Next-Gen Architecture:
- Open Source: Unlike x86 (Intel/AMD) or ARM (Arm Holdings), which require expensive licensing fees, RISC-V is free and open. Anyone can design, manufacture, and sell RISC-V chips without paying royalties.
- Modularity: The ISA is designed with a small base integer ISA and optional extensions (e.g., for floating point, vector math, or crypto). Designers can implement only the extensions they need, making it ideal for custom embedded systems.
- Sovereignty: Countries and companies concerned about trade restrictions or supply chain control are adopting RISC-V to build independent semiconductor ecosystems.
- Innovation: It allows academia and startups to experiment with novel microarchitectures without legal hurdles.
Explain the concept of Unified Memory Architecture (UMA) as seen in modern processors like Apple's M-series chips.
Unified Memory Architecture (UMA):
Traditionally, the CPU has its own RAM (System Memory) and the GPU has its own RAM (Video Memory or VRAM). Data must be copied back and forth between them over a bus (like PCIe), which is slow and energy-inefficient.
How UMA Works (Apple Silicon/M-Series Context):
- Single Pool: The CPU, GPU, and Neural Engine share a single, high-bandwidth, low-latency pool of memory.
- Zero-Copy: Data does not need to be copied. If the CPU processes an image and the GPU needs to render it, the GPU simply reads the memory address where the CPU left it.
Advantages:
- Performance: drastically reduces latency associated with data copying.
- Efficiency: Reduces power consumption.
- Flexibility: The GPU has access to the full system memory amount (e.g., 16GB or 32GB), rather than being limited to a small dedicated VRAM allocation.
What are Domain Specific Architectures (DSAs)? Why are they becoming essential in modern computer architecture trends?
Domain Specific Architectures (DSAs):
DSAs are processors tailored to a specific class of problems/domains, unlike general-purpose CPUs which are designed to do everything reasonably well.
Examples:
- TPUs (Tensor Processing Units): For Machine Learning.
- GPUs: For Graphics and parallel math.
- ISPs: For Image processing.
Why they are essential:
- End of Scaling: With Moore's Law slowing and Dennard Scaling dead, simply adding more general-purpose cores doesn't yield proportional speedups.
- Efficiency: A DSA can be 10x-100x more energy-efficient for its specific task because the hardware logic maps directly to the algorithm (e.g., matrix multiplication in AI).
- Amdahl’s Law: To speed up the overall system, architects are offloading the most time-consuming parts of execution (like AI inference) to specialized hardware.
In the context of Quantum Computing, briefly explain Quantum Decoherence and Quantum Error Correction.
Quantum Decoherence:
- The Problem: Qubits are incredibly fragile. They must be isolated from the environment (heat, electromagnetic waves, vibration). Any interaction with the external environment causes the quantum state (superposition) to collapse into a classical state.
- This loss of quantum information is called Decoherence. It limits the time a quantum computer has to perform calculations (coherence time).
Quantum Error Correction:
- The Solution: In classical computing, we use simple redundancy (like checksums). In quantum computing, we cannot simply "copy" a qubit (No-Cloning Theorem).
- Mechanism: Quantum Error Correction involves using multiple physical qubits (e.g., 1,000) to create a single "Logical Qubit." The system spreads the information across these physical qubits using entanglement so that if one physical qubit flips due to noise, the overall logical information remains intact and can be corrected.
Explain the concept of Speculative Execution and Branch Prediction in microarchitecture. How does this improve performance?
Branch Prediction:
In a pipelined processor, when a conditional branch (like an if statement) is encountered, the processor doesn't know which path to take until the condition is calculated. Instead of waiting (stalling), the Branch Predictor guesses which path is most likely based on history.
Speculative Execution:
Based on the prediction, the processor starts executing instructions along the guessed path before it knows for sure if it is correct. This is Speculative Execution.
Performance Improvement:
- If Correct: The processor keeps the results. The pipeline remained full, and no time was wasted waiting for the condition check.
- If Incorrect: The processor flushes the pipeline and rolls back changes, then takes the correct path.
- Since modern predictors are >95% accurate, this technique keeps the pipeline utilized and significantly increases instruction throughput.
What is 3D Stacking technology in processor manufacturing? How does it impact bandwidth and footprint?
3D Stacking:
3D Stacking is a packaging technology where multiple layers of silicon circuitry (logic dies or memory dies) are stacked vertically on top of each other and interconnected using Through-Silicon Vias (TSVs).
Impact:
- Footprint: It allows for higher density in a smaller form factor (saving space on the motherboard).
- Bandwidth: By stacking memory directly on top of the logic (or very close to it), the distance signals travel is minimized. This drastically increases communication bandwidth and reduces power consumption compared to sending signals across a PCB.
- Examples: Intel's Foveros, AMD's 3D V-Cache (stacking extra cache on top of the CPU core to boost gaming performance).
Compare x86 and ARM architectures. Why is ARM dominating the mobile market while x86 dominates the high-performance desktop/server market (historically)?
Comparison:
-
Instruction Set:
- x86 (Intel/AMD): CISC (Complex Instruction Set Computer). Instructions are complex, variable length, and can perform multiple operations (load, calc, store) in one go.
- ARM: RISC (Reduced Instruction Set Computer). Instructions are simple, fixed length, and generally perform one specific task per cycle.
-
Power Consumption:
- x86: Historically prioritizes raw performance over power efficiency. Requires complex hardware to decode instructions.
- ARM: Prioritizes energy efficiency. Simple decoding logic leads to lower power draw.
Market Dominance:
- Mobile (ARM): The strict thermal and battery constraints of phones favor ARM's efficiency. You cannot put a hot, power-hungry chip in a phone without a fan.
- Desktop/Server (x86): Historically, these devices are plugged into the wall and have large cooling solutions, so x86's ability to handle complex legacy code and high raw throughput was preferred. Note: This line is blurring with Apple Silicon and ARM servers.
What is Neuromorphic Computing? How does it differ from traditional Von Neumann architecture?
Neuromorphic Computing:
This is a computer engineering method that designs elements of computer systems to mimic the biological structure of the human nervous system (neurons and synapses).
Differences from Von Neumann Architecture:
-
Von Neumann:
- Separate Processing Unit (CPU) and Memory Unit.
- Data moves back and forth (The Von Neumann Bottleneck).
- Operates sequentially using a clock.
-
Neuromorphic (e.g., Intel Loihi, IBM TrueNorth):
- Colocation: Memory and processing are integrated into the same unit (artificial neurons).
- Event-Driven: It does not use a global clock. Neurons "spike" (send signals) only when needed, similar to a brain.
- Power: Extremely low power consumption for tasks like pattern recognition and sensory data processing.
Explain the concept of GPGPU (General-Purpose computing on Graphics Processing Units). Provide two real-world applications.
GPGPU:
GPGPU refers to the technique of using a GPU, which typically handles computer graphics and image generation, to perform computation in applications traditionally handled by the CPU.
Why it works:
GPUs are massively parallel processors. If a problem can be broken down into thousands of independent small tasks, a GPU can process it hundreds of times faster than a CPU.
Real-World Applications:
- Deep Learning / AI Training: Training neural networks involves multiplying massive matrices of numbers. GPUs can perform these matrix operations in parallel, reducing training time from months to days.
- Cryptocurrency Mining: Calculating hashes (like SHA-256 for Bitcoin) involves repetitive math operations that are easily parallelized on GPUs.
- Scientific Simulation: Protein folding (Folding@home) or weather modeling.