D.To divide an image into meaningful regions or objects
Correct Answer: To divide an image into meaningful regions or objects
Explanation:
Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels) to simplify its representation and make it more meaningful and easier to analyze.
Incorrect! Try again.
2Which term describes the process of assigning a class label to every single pixel in an image?
Overview of image segmentation
Easy
A.Object detection
B.Image classification
C.Semantic segmentation
D.Image compression
Correct Answer: Semantic segmentation
Explanation:
Semantic segmentation involves treating image segmentation as a pixel-wise classification problem, where every pixel is assigned to a specific class (e.g., car, road, sky).
Incorrect! Try again.
3In thresholding based segmentation, how are pixels typically separated into foreground and background?
thresholding based segmentation
Easy
A.By connecting edges together to form a boundary
B.By calculating the optical flow of moving objects
C.By generating new pixels using a neural network
D.By comparing each pixel's intensity to a specific threshold value
Correct Answer: By comparing each pixel's intensity to a specific threshold value
Explanation:
Thresholding works by evaluating pixel intensities; pixels with values greater than a threshold are separated from those with values less than .
Incorrect! Try again.
4Which well-known method is commonly used to automatically calculate the optimal threshold value?
thresholding based segmentation
Easy
A.Vision Transformer
B.YOLO algorithm
C.Otsu's method
D.Canny edge detector
Correct Answer: Otsu's method
Explanation:
Otsu's method is a classic algorithm used to perform automatic image thresholding by maximizing the variance between the foreground and background classes.
Incorrect! Try again.
5Edge-based segmentation primarily relies on identifying what property in an image?
edge-based segmentation
Easy
A.The most frequently occurring color
B.Rapid changes or discontinuities in pixel intensity
C.Large, uniform regions of identical pixels
D.The overall brightness of the image
Correct Answer: Rapid changes or discontinuities in pixel intensity
Explanation:
Edge detection identifies boundaries of objects within images by looking for areas where the image brightness changes sharply or has discontinuities.
Incorrect! Try again.
6Which of the following is a widely used operator for detecting edges in an image?
edge-based segmentation
Easy
A.Sobel operator
B.ReLU operator
C.K-means operator
D.Max-pooling operator
Correct Answer: Sobel operator
Explanation:
The Sobel operator is a discrete differentiation operator used to compute an approximation of the gradient of the image intensity function, making it a standard tool for edge detection.
Incorrect! Try again.
7Which segmentation technique starts with a 'seed' pixel and iteratively adds neighboring pixels that share similar properties?
region based segmentation
Easy
A.Image generation
B.Edge linking
C.Region growing
D.Global thresholding
Correct Answer: Region growing
Explanation:
Region growing is a pixel-based image segmentation method that begins with one or more seed points and grows regions by appending adjacent pixels that have properties similar to the seed.
Incorrect! Try again.
8What is the fundamental mechanism behind the 'split and merge' segmentation technique?
region based segmentation
Easy
A.Training a neural network to classify the image into two halves
B.Detecting all edges and deleting the internal pixels
C.Dividing the image into quadrants and merging adjacent regions that are homogenous
D.Applying a global threshold and discarding small pixels
Correct Answer: Dividing the image into quadrants and merging adjacent regions that are homogenous
Explanation:
The split and merge algorithm works by recursively splitting an image into smaller regions if they are not uniform, and then merging adjacent regions that satisfy a similarity criterion.
Incorrect! Try again.
9Which standard machine learning algorithm is frequently applied for clustering-based image segmentation?
clustering based segmentation
Easy
A.Logistic regression
B.K-means clustering
C.Linear regression
D.Decision trees
Correct Answer: K-means clustering
Explanation:
K-means clustering is highly popular for image segmentation. It groups pixels into clusters based on their color or intensity similarity.
Incorrect! Try again.
10When applying K-means clustering to segment an RGB color image, what do the data points typically represent?
clustering based segmentation
Easy
A.Text labels describing the image
B.Image file sizes
C.Bounding box coordinates around objects
D.Pixel color values (e.g., R, G, B vectors)
Correct Answer: Pixel color values (e.g., R, G, B vectors)
Explanation:
In K-means image segmentation, each pixel is treated as a data point in a color space (like RGB), and the algorithm groups pixels with similar color vectors.
Incorrect! Try again.
11What is the primary function of generative models in computer vision?
Generative Models
Easy
A.To detect faces in a crowded scene
B.To sort images into chronological order
C.To compress images without losing quality
D.To create new, synthetic images that resemble the original training data
Correct Answer: To create new, synthetic images that resemble the original training data
Explanation:
Generative models learn the underlying distribution of the training data and can be used to generate entirely new samples (e.g., images) that look realistic.
Incorrect! Try again.
12Which of the following architectures is a famous example of a generative model?
GANs consist of a generator and a discriminator network that compete against each other, making them one of the most prominent frameworks for generating realistic images.
Incorrect! Try again.
13Vision Transformers (ViT) process an input image by first dividing it into what?
Vision transformers
Easy
A.Continuous edges
B.Single pixels
C.Fixed-size patches
D.Color channels only
Correct Answer: Fixed-size patches
Explanation:
Unlike standard CNNs, a Vision Transformer (ViT) divides an image into a sequence of non-overlapping, fixed-size patches, treating them similarly to words in a text sentence.
Incorrect! Try again.
14What core mechanism allows Vision Transformers to determine the relationship and importance of different image patches?
Vision transformers
Easy
A.Max-pooling
B.Otsu's thresholding
C.Self-attention mechanism
D.Gaussian blurring
Correct Answer: Self-attention mechanism
Explanation:
Transformers rely on the self-attention mechanism, which enables the model to weigh the importance of different patches of the image relative to one another.
Incorrect! Try again.
15What is the defining characteristic of a multimodal vision model?
multimodal vision models
Easy
A.It only processes high-resolution black and white images
B.It processes and connects multiple types of data modalities, such as images and text
C.It uses only convolutional layers without any fully connected layers
D.It splits a single image into multiple color channels
Correct Answer: It processes and connects multiple types of data modalities, such as images and text
Explanation:
Multimodal models integrate and process information from different modalities (like vision, language, and audio) simultaneously to perform complex tasks.
Incorrect! Try again.
16Which of the following is a classic example of a task performed by a multimodal vision model?
multimodal vision models
Easy
A.Image captioning (generating descriptive text from an image)
B.Grayscale conversion
C.Basic thresholding
D.Applying a median filter
Correct Answer: Image captioning (generating descriptive text from an image)
Explanation:
Image captioning requires the model to understand visual content (vision) and generate natural language descriptions (text), making it a multimodal task.
Incorrect! Try again.
17Large vision models (LVMs) are primarily characterized by which of the following?
large vision models
Easy
A.Being lightweight enough to run on basic digital watches
B.Having a massive number of parameters and being trained on vast datasets
C.Segmenting images using basic global thresholding only
D.Using only a single hidden layer of neurons
Correct Answer: Having a massive number of parameters and being trained on vast datasets
Explanation:
Large Vision Models scale up both the model size (billions of parameters) and the amount of training data, allowing them to learn highly robust and generalizable features.
Incorrect! Try again.
18What is a major advantage of utilizing large vision models over traditional, smaller models?
large vision models
Easy
A.Better generalization and strong performance on a wide variety of tasks (zero-shot learning)
B.They completely eliminate the need for computer memory
C.They can be trained from scratch in minutes on a standard laptop CPU
D.They require zero training data to function
Correct Answer: Better generalization and strong performance on a wide variety of tasks (zero-shot learning)
Explanation:
Because they are trained on massive datasets, large vision models demonstrate excellent generalization, often performing well on new tasks without task-specific fine-tuning (zero-shot capabilities).
Incorrect! Try again.
19CLIP (Contrastive Language-Image Pretraining) is a well-known example of which type of model?
Vision-Language Models
Easy
A.Simple Edge Detector
B.K-means Clustering Model
C.Vision-Language Model
D.Region Growing Algorithm
Correct Answer: Vision-Language Model
Explanation:
CLIP is a foundational Vision-Language Model developed by OpenAI that learns to connect images and text by predicting which text snippet corresponds to which image.
Incorrect! Try again.
20Which of the following is a primary use case for Vision-Language Models?
Vision-Language Models
Easy
A.Converting a color photograph into a grayscale array
B.Finding the optimal threshold limit for image binarization
C.Applying a Gaussian blur to reduce image noise
D.Searching for specific images using natural language text queries
Correct Answer: Searching for specific images using natural language text queries
Explanation:
Vision-Language Models align text and image embeddings in the same space, making them highly effective for cross-modal tasks like text-based image retrieval (searching for images using text).
Incorrect! Try again.
21In an autonomous driving application, a system needs to identify every individual car on the road and label the drivable road surface. Which combination of segmentation techniques is most appropriate?
Overview of image segmentation
Medium
A.Semantic segmentation for cars, instance segmentation for the road
B.Instance segmentation for cars, semantic segmentation for the road
C.Panoptic segmentation for the road, bounding box detection for cars
D.Image classification for cars, semantic segmentation for the road
Correct Answer: Instance segmentation for cars, semantic segmentation for the road
Explanation:
Instance segmentation differentiates between individual objects of the same class (like distinct cars), while semantic segmentation categorizes pixels into classes without distinguishing individuals (like the continuous road surface).
Incorrect! Try again.
22Why is generating superpixels often used as a preprocessing step in complex image segmentation tasks?
Overview of image segmentation
Medium
A.To artificially increase the dataset size for training neural networks
B.To group perceptually similar pixels, reducing the complexity of subsequent algorithms
C.To increase the resolution of the image for finer details
D.To convert a colored image into a binary mask directly
Correct Answer: To group perceptually similar pixels, reducing the complexity of subsequent algorithms
Explanation:
Superpixels group adjacent pixels with similar colors or textures into larger coherent regions. This significantly reduces the number of primitives the subsequent segmentation algorithm must process, improving computational efficiency.
Incorrect! Try again.
23Otsu's method is used to automatically find an optimal threshold value. Which of the following mathematical objectives does Otsu's method optimize?
thresholding based segmentation
Medium
A.Maximizing the number of segmented regions
B.Maximizing the within-class variance
C.Minimizing the within-class variance
D.Minimizing the between-class variance
Correct Answer: Minimizing the within-class variance
Explanation:
Otsu's method finds the optimal threshold by minimizing the within-class variance (the spread of pixel intensities within the background and foreground), which is mathematically equivalent to maximizing the between-class variance.
Incorrect! Try again.
24An image of a document has uneven illumination, with the left side heavily shadowed. Applying a global threshold results in the shadowed background being misclassified as text. Which approach best resolves this?
thresholding based segmentation
Medium
A.Histogram equalization followed by Otsu's method
B.Decreasing the global threshold value
C.Increasing the global threshold value
D.Adaptive (local) thresholding
Correct Answer: Adaptive (local) thresholding
Explanation:
Adaptive thresholding calculates the threshold for small regions of the image individually. This allows it to handle varying illumination across the image, unlike global thresholding which uses a single value for the entire image.
Incorrect! Try again.
25In the Canny edge detection algorithm, what is the primary purpose of the hysteresis thresholding step?
edge-based segmentation
Medium
A.To smooth the image using a Gaussian filter
B.To suppress non-maximum pixels to thin the edges
C.To link fragmented edges and eliminate noise-induced weak edges
D.To compute the gradient magnitude and direction
Correct Answer: To link fragmented edges and eliminate noise-induced weak edges
Explanation:
Hysteresis uses a high and a low threshold. Strong edges (above high) are kept, and weak edges (between low and high) are only kept if they are connected to a strong edge, effectively linking fragmented edges while discarding noise.
Incorrect! Try again.
26After applying an edge detector, a circular object in the image appears as a series of disconnected edge fragments. Which technique is most appropriate to fit a continuous boundary to these fragments?
edge-based segmentation
Medium
A.Hough Transform for circles
B.Otsu's thresholding
C.Watershed algorithm
D.K-means clustering
Correct Answer: Hough Transform for circles
Explanation:
The Hough Transform is a feature extraction technique used to find imperfect instances of objects within a certain class of shapes (like lines or circles) by voting in a parameter space, making it ideal for connecting fragmented boundaries.
Incorrect! Try again.
27In a region growing segmentation algorithm, starting from a set of seed points, how does the choice of the homogeneity criterion affect the final output?
region based segmentation
Medium
A.A very relaxed criterion results in over-segmentation.
B.The criterion only affects the speed of the algorithm, not the output size.
C.A very strict criterion results in under-segmentation.
D.A very strict criterion results in small, disjointed regions.
Correct Answer: A very strict criterion results in small, disjointed regions.
Explanation:
If the homogeneity criterion (e.g., tolerance for color difference) is too strict, the algorithm will stop growing prematurely, leading to over-segmentation and many small, fragmented regions.
Incorrect! Try again.
28The split-and-merge segmentation technique represents an image using which specific data structure?
region based segmentation
Medium
A.Binary Search Tree
B.Hash Table
C.Quadtree
D.Linked List
Correct Answer: Quadtree
Explanation:
The split-and-merge algorithm recursively divides the image into four smaller quadrants if a region is non-homogeneous. This hierarchical subdivision is perfectly represented by a quadtree data structure.
Incorrect! Try again.
29When using K-means clustering for image segmentation based on color, why is it often advantageous to convert the image from RGB to the CIELAB () color space first?
clustering based segmentation
Medium
A.RGB cannot be used with Euclidean distance metrics.
B.CIELAB automatically determines the optimal number of clusters ().
C.CIELAB compresses the image, making K-means run faster.
D.CIELAB separates luminance () from chrominance (), and its distances are more perceptually uniform.
Correct Answer: CIELAB separates luminance () from chrominance (), and its distances are more perceptually uniform.
Explanation:
The space is designed to be perceptually uniform, meaning Euclidean distances in this space match human visual perception. By clustering on and , the algorithm is also more robust to lighting variations.
Incorrect! Try again.
30Which of the following is a distinct advantage of Mean Shift clustering over K-means clustering for image segmentation?
clustering based segmentation
Medium
A.Mean Shift is significantly computationally faster than K-means for large images.
B.Mean Shift requires defining the number of clusters a priori.
C.Mean Shift is strictly a parametric model.
D.Mean Shift does not require specifying the number of clusters in advance.
Correct Answer: Mean Shift does not require specifying the number of clusters in advance.
Explanation:
Unlike K-means, which requires the user to specify , Mean Shift is a non-parametric, density-based algorithm that automatically discovers the number of clusters based on the data distribution and a bandwidth parameter.
Incorrect! Try again.
31In a Generative Adversarial Network (GAN) trained to synthesize images, what does the minimax loss function conceptually represent?
Generative Models
Medium
A.The Generator minimizing the Discriminator's accuracy, while the Discriminator maximizes it.
B.The Discriminator generating fake images to fool the Generator.
C.Both models cooperating to minimize the distance to the real data distribution.
D.The Generator maximizing the image resolution while the Discriminator minimizes noise.
Correct Answer: The Generator minimizing the Discriminator's accuracy, while the Discriminator maximizes it.
Explanation:
GANs operate on a minimax game framework: the Generator tries to minimize the probability that the Discriminator correctly identifies fake images, while the Discriminator tries to maximize its ability to distinguish real from fake.
Incorrect! Try again.
32In Denoising Diffusion Probabilistic Models (DDPMs), what is the primary function of the forward (diffusion) process?
Generative Models
Medium
A.To compress an image into a lower-dimensional latent vector.
B.To gradually add Gaussian noise to an image until it becomes pure noise.
C.To train a neural network to remove noise from an image.
D.To map textual prompts to a latent image space.
Correct Answer: To gradually add Gaussian noise to an image until it becomes pure noise.
Explanation:
The forward process in a diffusion model is a fixed Markov chain that progressively adds Gaussian noise to the data over steps, eventually destroying the structure and resulting in an isotropic Gaussian distribution.
Incorrect! Try again.
33How do Vision Transformers (ViTs) handle the 2D spatial structure of an image when processing it via a standard transformer architecture designed for 1D sequences?
Vision transformers
Medium
A.By splitting the image into fixed-size patches, flattening them, and adding 1D positional embeddings.
B.By using 2D convolutional layers before the transformer blocks.
C.By flattening the entire image into a single long sequence of pixels.
D.By relying entirely on the self-attention mechanism to deduce spatial relationships dynamically.
Correct Answer: By splitting the image into fixed-size patches, flattening them, and adding 1D positional embeddings.
Explanation:
ViTs divide an image into a grid of non-overlapping patches, flatten each patch into a 1D vector, and linearly project them. Positional embeddings are added to these patch embeddings so the model retains spatial information.
Incorrect! Try again.
34If an image is divided into patches for a Vision Transformer, what is the computational complexity of the self-attention mechanism with respect to ?
Vision transformers
Medium
A.
B.
C.
D.
Correct Answer:
Explanation:
Standard self-attention computes the pairwise relationships between all elements in a sequence. For patches, it computes an attention matrix, resulting in a quadratic complexity .
Incorrect! Try again.
35In contrastive multimodal models like CLIP, how is the model trained to align images and text?
multimodal vision models
Medium
A.By reconstructing the image from the text embedding using a decoder.
B.By maximizing the cosine similarity between the embeddings of paired images and texts while minimizing it for incorrect pairs.
C.By training a classifier to predict the text tokens directly from the image pixels.
D.By generating a caption for the image and calculating the BLEU score.
Correct Answer: By maximizing the cosine similarity between the embeddings of paired images and texts while minimizing it for incorrect pairs.
Explanation:
CLIP uses a contrastive loss that pulls the embeddings of matching image-text pairs close together in the latent space (high cosine similarity) and pushes apart the embeddings of mismatched pairs.
Incorrect! Try again.
36In a multimodal architecture that uses cross-attention to fuse image and text features, which component typically acts as the 'Query' in the cross-attention layer when generating text conditioned on an image?
multimodal vision models
Medium
A.The image representations
B.The positional encodings
C.The classification token (CLS)
D.The text representations
Correct Answer: The text representations
Explanation:
In cross-attention for text generation, the generated text representations act as the Queries, attending to the Keys and Values which are provided by the image representations, allowing the text generation to be guided by the visual context.
Incorrect! Try again.
37The Segment Anything Model (SAM) is characterized as a 'promptable' segmentation model. What does this mean in practice?
large vision models
Medium
A.It prompts the user to manually draw the segmentation boundaries.
B.It can output a segmentation mask based on various inputs like points, boxes, or text.
C.It generates natural language prompts to describe segments in an image.
D.It requires text prompts to generate an entirely new image.
Correct Answer: It can output a segmentation mask based on various inputs like points, boxes, or text.
Explanation:
SAM is designed to be interactive and promptable, meaning it takes a visual or text prompt (such as a click, a bounding box, or a descriptive word) and generates a precise segmentation mask corresponding to that prompt.
Incorrect! Try again.
38Which of the following best describes the 'zero-shot' capability of large vision models?
large vision models
Medium
A.The ability to predict object trajectories with zero initial data.
B.The ability to compress an image to zero loss.
C.The ability to train a model from scratch in zero time.
D.The ability to perform a task on a dataset or class it was never explicitly trained on, without further fine-tuning.
Correct Answer: The ability to perform a task on a dataset or class it was never explicitly trained on, without further fine-tuning.
Explanation:
Zero-shot transfer refers to a model's ability to accurately categorize or process data from classes or tasks it did not explicitly see during training, usually enabled by rich pre-training on massive, diverse datasets.
Incorrect! Try again.
39In a standard Visual Question Answering (VQA) system, how are the final predictions typically generated?
Vision-Language Models
Medium
A.By converting the image to text via OCR and querying a database.
B.By fusing extracted image features and extracted text features, then passing them through a classifier or decoder.
C.By inputting only the image into a CNN and classifying the scene.
D.By ignoring the image and treating the task as standard text-based question answering.
Correct Answer: By fusing extracted image features and extracted text features, then passing them through a classifier or decoder.
Explanation:
VQA requires understanding both modalities. The standard pipeline extracts features from the image (via a vision encoder) and the question (via a text encoder), fuses them (e.g., via attention or concatenation), and predicts an answer.
Incorrect! Try again.
40Which objective function is most commonly used to train the language decoder portion of an Image Captioning model?
Vision-Language Models
Medium
A.Contrastive loss between two different generated captions.
B.Intersection over Union (IoU) loss.
C.Mean Squared Error (MSE) loss between images.
D.Cross-entropy loss for next-token prediction, conditioned on the image and previous tokens.
Correct Answer: Cross-entropy loss for next-token prediction, conditioned on the image and previous tokens.
Explanation:
Image captioning relies on autoregressive language generation. The model is trained using cross-entropy loss to predict the next correct word in the sequence, given the image representation and the previously generated words.
Incorrect! Try again.
41In a complex urban scene containing mutually occluding cars (countable objects) and continuous regions of sky and road (amorphous regions), which of the following formulations perfectly describes the mathematical objective of panoptic segmentation?
Overview of image segmentation
Hard
A.Generating bounding boxes for all objects and subsequently predicting binary masks independently for each bounding box.
B.Assigning an instance ID to every pixel, treating amorphous regions as a single large instance.
C.Mapping each pixel to a tuple , where uniquely identifies 'thing' instances and is ignored or set to a null value for 'stuff' classes.
D.Assigning a single semantic class label to every pixel in the image, ignoring distinct object identities.
Correct Answer: Mapping each pixel to a tuple , where uniquely identifies 'thing' instances and is ignored or set to a null value for 'stuff' classes.
Explanation:
Panoptic segmentation unifies semantic and instance segmentation. Every pixel is assigned a semantic label . For 'thing' classes (e.g., cars), an instance ID differentiates individual objects. For 'stuff' classes (e.g., sky), the instance ID is irrelevant.
Incorrect! Try again.
42Otsu's method determines the optimal threshold by maximizing the between-class variance . Under which of the following conditions does Otsu's method mathematically fail to identify the ideal threshold separating an object and background?
Thresholding based segmentation
Hard
A.When the image contains high-frequency salt-and-pepper noise.
B.When the image is illuminated by a uniform, spatially invariant light source.
C.When the histogram is perfectly bimodal with equal class probabilities.
D.When the variances of the object and background classes are drastically different, and the object occupies a disproportionately small fraction of the image area.
Correct Answer: When the variances of the object and background classes are drastically different, and the object occupies a disproportionately small fraction of the image area.
Explanation:
Otsu's method implicitly assumes that the histogram is bimodal and the two classes have approximately equal variances and sizes. When one class is vastly smaller or has a much larger variance, the maximum between-class variance shifts away from the true valley in the histogram.
Incorrect! Try again.
43When computing a local adaptive threshold for a window of size using the local mean and variance, what is the asymptotic time complexity per pixel if the algorithm utilizes integral images (summed-area tables)?
Thresholding based segmentation
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
An integral image allows the sum (and sum of squares) of pixel values within any rectangular area to be computed using exactly four array references, regardless of the window size . Therefore, calculating local mean and variance takes constant time per pixel.
Incorrect! Try again.
44In the Canny edge detector, hysteresis thresholding uses a high threshold () and a low threshold (). If a continuous edge segment consists of pixels with gradient magnitudes , , and connected in the order , which pixels are retained as strong edges in the final output?
Edge-based segmentation
Hard
A.None of the pixels.
B.Pixel 1 only.
C.Pixels 1, 2, and 3.
D.Pixels 1 and 3 only.
Correct Answer: Pixel 1 only.
Explanation:
Pixel 1 is retained because . Pixel 2 is immediately discarded because . Since Pixel 3 is only connected to the strong edge (Pixel 1) via Pixel 2, which has been discarded, the path is broken. Thus, Pixel 3 is not promoted to a strong edge.
Incorrect! Try again.
45The Laplacian of Gaussian (LoG) filter is defined as . In the frequency domain, what is the effect of the parameter on the filter's magnitude response as a function of radial frequency ?
Edge-based segmentation
Hard
A.It acts as a low-pass filter where the cutoff frequency increases linearly with .
B.It acts as a high-pass filter where increasing amplifies high frequencies proportionally to .
C.It acts as a band-pass filter where the peak frequency response is inversely proportional to .
D.It acts as an all-pass filter, shifting the phase of edges but leaving magnitudes unchanged.
Correct Answer: It acts as a band-pass filter where the peak frequency response is inversely proportional to .
Explanation:
The LoG is a band-pass filter. The Gaussian smoothing acts as a low-pass filter (attenuating high frequencies), while the Laplacian acts as a high-pass filter (attenuating low frequencies). The resulting peak response in the frequency domain occurs at , making the peak inversely proportional to .
Incorrect! Try again.
46In the Split-and-Merge segmentation algorithm, a region is split into four sub-regions if a homogeneity predicate . To prevent the generation of blocky boundaries, which data structure and subsequent process are critical for optimal merging?
Region based segmentation
Hard
A.A Region Adjacency Graph (RAG) followed by merging adjacent regions if .
B.A minimum spanning tree followed by normalized cuts.
C.A quadtree followed by merging only siblings that share the same parent node if .
D.A KD-Tree followed by K-means clustering.
Correct Answer: A Region Adjacency Graph (RAG) followed by merging adjacent regions if .
Explanation:
While a quadtree is used for the splitting phase, merging only siblings results in blocky artifacts because spatially adjacent regions from different parent nodes are ignored. A Region Adjacency Graph (RAG) allows the algorithm to evaluate and merge any spatially adjacent regions, mitigating blocky boundaries.
Incorrect! Try again.
47The standard watershed transform often suffers from severe over-segmentation due to local minima caused by noise. In marker-controlled watershed segmentation, how is the topological surface (gradient image) modified to enforce markers as the only regional minima?
Region based segmentation
Hard
A.By applying the Hough transform to force watershed lines into geometric shapes.
B.By applying a heavily regularized Gaussian blur to completely eliminate high-frequency noise.
C.By applying a global threshold to the gradient image before the flooding process.
D.By using morphological reconstruction (minima imposition) to modify the gradient image such that local minima only occur at marker locations.
Correct Answer: By using morphological reconstruction (minima imposition) to modify the gradient image such that local minima only occur at marker locations.
Explanation:
Minima imposition is a morphological operation that alters the topology of an image. It suppresses all local minima in the gradient image that do not fall within the predefined marker sets, ensuring that the subsequent watershed flooding originates strictly from the markers, thereby preventing over-segmentation.
Incorrect! Try again.
48In Mean Shift segmentation, pixels are mapped into a joint spatial-color feature space . If the spatial bandwidth approaches infinity while the range bandwidth remains small, what is the asymptotic behavior of the segmentation?
Clustering based segmentation
Hard
A.The algorithm reduces to a purely spatial clustering, resulting in uniformly sized superpixels regardless of color.
B.The algorithm ignores spatial constraints and reduces to exact K-means clustering in the RGB color space.
C.The algorithm fails to converge because the kernel density estimate becomes uniform everywhere.
D.The algorithm converges to density modes entirely based on color similarity, effectively performing global color quantization.
Correct Answer: The algorithm converges to density modes entirely based on color similarity, effectively performing global color quantization.
Explanation:
When the spatial bandwidth , the spatial distance between any two pixels has a negligible penalty in the kernel function. Thus, the Mean Shift vector is computed using all pixels in the image weighted only by their color proximity (controlled by ), acting as a global color clustering algorithm.
Incorrect! Try again.
49When applying Gaussian Mixture Models (GMM) with the EM algorithm for image segmentation, how does the covariance matrix type affect the resulting segment shapes in the color space?
Clustering based segmentation
Hard
A.A spherical covariance matrix allows clusters to have arbitrary orientations as long as the volume is fixed.
B.Covariance constraints only affect the spatial domain, leaving color space clusters unaffected.
C.A diagonal covariance matrix restricts clusters to be perfectly spherical.
D.A full covariance matrix allows clusters to model elliptical distributions oriented at any angle.
Correct Answer: A full covariance matrix allows clusters to model elliptical distributions oriented at any angle.
Explanation:
In a GMM, a full covariance matrix contains off-diagonal elements, allowing the Gaussian distribution to model correlations between color channels. This forms oriented hyper-ellipsoids in the feature space. Spherical constraints restrict clusters to hyperspheres, and diagonal constraints restrict them to axis-aligned ellipsoids.
Incorrect! Try again.
50In the context of Denoising Diffusion Probabilistic Models (DDPMs), the forward diffusion process can be formulated as a discrete Markov chain. In the continuous-time limit, this process transforms into which of the following mathematical constructs?
Generative Models
Hard
A.A Stochastic Differential Equation (SDE).
B.A partial differential equation of the Navier-Stokes family.
C.A Hamiltonian Monte Carlo (HMC) trajectory.
D.A deterministic Ordinary Differential Equation (ODE) controlled by score matching without a Wiener process.
Correct Answer: A Stochastic Differential Equation (SDE).
Explanation:
In the continuous-time limit, the sequence of noise injections in the forward diffusion process is rigorously described by a Stochastic Differential Equation (SDE) of the form , where is a standard Wiener process (Brownian motion).
Incorrect! Try again.
51Variational Autoencoders (VAEs) optimize the Evidence Lower Bound (ELBO), defined as . What phenomenon occurs if the KL divergence term is heavily penalized (e.g., in a -VAE with )?
Generative Models
Hard
A.The discriminator network overpowers the generator, leading to mode collapse.
B.The decoder deterministically maps every point in the latent space to the dataset mean.
C.The reconstruction quality becomes near-perfect, but the latent space becomes highly entangled.
D.The model suffers from posterior collapse, where the latent representation ignores the input and unconditionally matches the prior.
Correct Answer: The model suffers from posterior collapse, where the latent representation ignores the input and unconditionally matches the prior.
Explanation:
When is very large, the optimization overwhelmingly favors minimizing the KL divergence. The easiest way for the network to achieve a KL divergence of 0 is to set the approximate posterior exactly equal to the prior for all inputs. This means carries no information about , a state known as posterior collapse.
Incorrect! Try again.
52To solve the mode collapse issue and vanishing gradients in standard GANs, the Wasserstein GAN (WGAN) introduces the Earth Mover's Distance. What strict constraint must be enforced on the discriminator (critic) for the WGAN formulation to be mathematically valid?
Generative Models
Hard
A.It must be an invertible normalizing flow network.
B.It must be a 1-Lipschitz continuous function.
C.It must output bounded probabilities in the range using a sigmoid activation.
D.It must utilize Batch Normalization in all hidden layers to preserve variance.
Correct Answer: It must be a 1-Lipschitz continuous function.
Explanation:
The Kantorovich-Rubinstein duality used to compute the Wasserstein distance requires the critic (discriminator) function to be 1-Lipschitz continuous. This is typically enforced via weight clipping (in the original WGAN) or a gradient penalty (in WGAN-GP).
Incorrect! Try again.
53A standard Vision Transformer (ViT) splits an image of size into patches of size . What is the computational complexity of the global self-attention mechanism with respect to the input image dimensions, assuming embedding dimension ?
Vision transformers
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
The number of patches (sequence length) is . The self-attention matrix computation scales quadratically with the sequence length, requiring operations. Substituting yields .
Incorrect! Try again.
54When fine-tuning a Vision Transformer (ViT) on higher-resolution images than it was pre-trained on (keeping patch size constant), the sequence length increases. How is the pre-trained positional embedding typically adapted to handle the new sequence length?
Vision transformers
Hard
A.By applying bicubic interpolation to the 1D flattened positional embeddings directly.
B.By reshaping the positional embeddings into a 2D grid matching the original patch layout, interpolating to the new grid size, and flattening back to 1D.
C.By padding the new positional embeddings with zeros to preserve the learned magnitude.
D.By freezing the positional embeddings and applying a recurrent neural network to extrapolate the missing positions.
Correct Answer: By reshaping the positional embeddings into a 2D grid matching the original patch layout, interpolating to the new grid size, and flattening back to 1D.
Explanation:
Because positional embeddings represent the 2D spatial structure of the image patches, adapting them to a higher resolution requires reshaping the 1D sequence of embeddings back into their 2D spatial grid, performing 2D interpolation (e.g., bicubic), and then flattening them again.
Incorrect! Try again.
55The Swin Transformer computes self-attention within local windows and introduces a shifted window partitioning in alternating layers. Mathematically, what is the primary purpose of this shifted window mechanism?
Vision transformers
Hard
A.To augment the data by translation, rendering the transformer perfectly translation-invariant.
B.To allow the direct computation of multiscale features without needing to merge patches in deeper layers.
C.To provide cross-window connections, expanding the receptive field hierarchically while maintaining linear computational complexity relative to image size.
D.To reduce the computational complexity of self-attention from quadratic to strictly logarithmic.
Correct Answer: To provide cross-window connections, expanding the receptive field hierarchically while maintaining linear computational complexity relative to image size.
Explanation:
Local window attention limits computation to isolated regions, which restricts the receptive field. By shifting the window partition by half a window size in successive layers, boundaries are crossed, enabling information flow between adjacent windows and expanding the receptive field hierarchically.
Incorrect! Try again.
56CLIP (Contrastive Language-Image Pre-training) is trained using the InfoNCE loss to align image and text representations. If and are -normalized image and text embeddings in a batch of size , and is a learnable temperature parameter, what is the symmetric loss formulation minimized during training?
Multimodal vision models
Hard
A.
B.
C.
D.
Correct Answer:
Explanation:
CLIP uses a symmetric cross-entropy loss over the similarity matrix. It calculates the contrastive loss in both directions: image-to-text (finding the correct text for an image) and text-to-image (finding the correct image for a text) across the batch, then averages the two.
Incorrect! Try again.
57In the Segment Anything Model (SAM), how does the architecture intrinsically handle the ambiguity of a single point prompt (e.g., a point placed on a person's shirt could mean the shirt, or the whole person)?
Large vision models
Hard
A.By utilizing a recurrent neural network to request a second point from the user.
B.By implicitly decoding the point into text using CLIP and relying on the language model to guess the intended scale.
C.By aggressively applying conditional random fields (CRFs) to force the mask to snap to the largest semantic boundary.
D.By predicting multiple valid masks (e.g., whole, part, and subpart) along with a confidence score for each mask.
Correct Answer: By predicting multiple valid masks (e.g., whole, part, and subpart) along with a confidence score for each mask.
Explanation:
To resolve ambiguity inherent in prompt-based segmentation, SAM's mask decoder is designed to output multiple masks (usually 3: whole, part, subpart) for a single prompt, each accompanied by an estimated intersection-over-union (IoU) confidence score.
Incorrect! Try again.
58DINOv2, a self-supervised large vision model, utilizes a student-teacher knowledge distillation framework without labels. To prevent the notorious feature collapse problem (where the model outputs a constant vector), which two opposing mechanisms are explicitly applied to the teacher's outputs?
Large vision models
Hard
A.Centering (subtracting a moving average) and Sharpening (using a low temperature in the softmax).
B.Instance normalization and layer normalization.
C.Dropout and Stochastic Depth.
D.Weight decay and gradient clipping.
Correct Answer: Centering (subtracting a moving average) and Sharpening (using a low temperature in the softmax).
Explanation:
In DINO/DINOv2, centering prevents one dimension from dominating (which prevents a form of collapse to a one-hot vector), but it encourages uniform distributions. Sharpening opposes this by forcing the distribution to have distinct peaks. Together, they prevent the student and teacher from collapsing into trivial constant solutions.
Incorrect! Try again.
59In Vision-Language Models like Flamingo, how does the Perceiver Resampler module efficiently bridge the high-resolution, variable-length output of the vision encoder to the fixed context window of the Large Language Model?
Vision-Language Models
Hard
A.By interpolating the sequence of visual tokens to exactly match the length of the text prompt tokens.
B.By utilizing a dynamic routing capsule network that drops visual tokens with low semantic entropy.
C.By applying an average pooling operation over the spatial dimensions, reducing any image to a 1x1 embedding.
D.By taking a fixed number of learnable latent queries and applying cross-attention to the flattened vision features, yielding a fixed number of visual tokens.
Correct Answer: By taking a fixed number of learnable latent queries and applying cross-attention to the flattened vision features, yielding a fixed number of visual tokens.
Explanation:
The Perceiver Resampler uses a fixed number of learnable latent queries. It performs cross-attention where the queries attend to the flattened, variable-length sequence of visual features. This produces a constant number of output tokens representing the visual input, independent of the original image resolution or video length.
Incorrect! Try again.
60Models like BLIP-2 introduce a Q-Former to align visual features with the text space. During the representation learning pre-training stage of the Q-Former, three distinct objectives are optimized simultaneously. Which of the following is NOT one of these objectives?
Vision-Language Models
Hard
A.Masked Image Modeling (MIM)
B.Image-Text Contrastive Learning (ITC)
C.Image-Text Matching (ITM)
D.Image-Grounded Text Generation (ITG)
Correct Answer: Masked Image Modeling (MIM)
Explanation:
BLIP-2's Q-Former representation learning stage relies on three objectives: Image-Text Contrastive learning (aligning representations), Image-Text Matching (binary classification of pair validity), and Image-Grounded Text Generation (generating text given the image). Masked Image Modeling (MIM) is a visual self-supervised technique (e.g., used in MAE) but is not part of the Q-Former's alignment objectives.