1 $What is the primary goal of image segmentation?$

Overview of image segmentation Easy

A.

To divide an image into meaningful regions or objects

B.

To compress the image file size for storage

C.

To convert a color image into grayscale

D.

To increase the resolution of a digital image

2 $Which term describes the process of assigning a class label to every single pixel in an image?$

Overview of image segmentation Easy

A.

Image classification

B.

Object detection

C.

Image compression

D.

Semantic segmentation

3 $In thresholding based segmentation, how are pixels typically separated into foreground and background?$

thresholding based segmentation Easy

A.

By connecting edges together to form a boundary

B.

By comparing each pixel's intensity to a specific threshold value

C.

By generating new pixels using a neural network

D.

By calculating the optical flow of moving objects

4 $Which well-known method is commonly used to automatically calculate the optimal threshold value?$

thresholding based segmentation Easy

A.

Otsu's method

B.

Canny edge detector

C.

Vision Transformer

D.

YOLO algorithm

5 $Edge-based segmentation primarily relies on identifying what property in an image?$

edge-based segmentation Easy

A.

The most frequently occurring color

B.

Large, uniform regions of identical pixels

C.

The overall brightness of the image

D.

Rapid changes or discontinuities in pixel intensity

6 $Which of the following is a widely used operator for detecting edges in an image?$

edge-based segmentation Easy

A.

ReLU operator

B.

K-means operator

C.

Max-pooling operator

D.

Sobel operator

7 $Which segmentation technique starts with a 'seed' pixel and iteratively adds neighboring pixels that share similar properties?$

region based segmentation Easy

A.

Global thresholding

B.

Image generation

C.

Edge linking

D.

Region growing

8 $What is the fundamental mechanism behind the 'split and merge' segmentation technique?$

region based segmentation Easy

A.

Dividing the image into quadrants and merging adjacent regions that are homogenous

B.

Applying a global threshold and discarding small pixels

C.

Detecting all edges and deleting the internal pixels

D.

Training a neural network to classify the image into two halves

9 $Which standard machine learning algorithm is frequently applied for clustering-based image segmentation?$

clustering based segmentation Easy

A.

Decision trees

B.

K-means clustering

C.

Logistic regression

D.

Linear regression

10 $When applying K-means clustering to segment an RGB color image, what do the data points typically represent?$

clustering based segmentation Easy

A.

Text labels describing the image

B.

Pixel color values (e.g., R, G, B vectors)

C.

Bounding box coordinates around objects

D.

Image file sizes

11 $What is the primary function of generative models in computer vision?$

Generative Models Easy

A.

To create new, synthetic images that resemble the original training data

B.

To sort images into chronological order

C.

To compress images without losing quality

D.

To detect faces in a crowded scene

12 $Which of the following architectures is a famous example of a generative model?$

Generative Models Easy

A.

K-Nearest Neighbors (KNN)

B.

Support Vector Machines (SVMs)

C.

Random Forests

D.

Generative Adversarial Networks (GANs)

13 $Vision Transformers (ViT) process an input image by first dividing it into what?$

Vision transformers Easy

A.

Single pixels

B.

Continuous edges

C.

Fixed-size patches

D.

Color channels only

14 $What core mechanism allows Vision Transformers to determine the relationship and importance of different image patches?$

Vision transformers Easy

A.

Otsu's thresholding

B.

Self-attention mechanism

C.

Gaussian blurring

D.

Max-pooling

15 $What is the defining characteristic of a multimodal vision model?$

multimodal vision models Easy

A.

It splits a single image into multiple color channels

B.

It processes and connects multiple types of data modalities, such as images and text

C.

It uses only convolutional layers without any fully connected layers

D.

It only processes high-resolution black and white images

16 $Which of the following is a classic example of a task performed by a multimodal vision model?$

multimodal vision models Easy

A.

Basic thresholding

B.

Grayscale conversion

C.

Image captioning (generating descriptive text from an image)

D.

Applying a median filter

17 $Large vision models (LVMs) are primarily characterized by which of the following?$

large vision models Easy

A.

Having a massive number of parameters and being trained on vast datasets

B.

Using only a single hidden layer of neurons

C.

Being lightweight enough to run on basic digital watches

D.

Segmenting images using basic global thresholding only

18 $What is a major advantage of utilizing large vision models over traditional, smaller models?$

large vision models Easy

A.

Better generalization and strong performance on a wide variety of tasks (zero-shot learning)

B.

They completely eliminate the need for computer memory

C.

They can be trained from scratch in minutes on a standard laptop CPU

D.

They require zero training data to function

19 $CLIP (Contrastive Language-Image Pretraining) is a well-known example of which type of model?$

Vision-Language Models Easy

A.

K-means Clustering Model

B.

Vision-Language Model

C.

Region Growing Algorithm

D.

Simple Edge Detector

20 $Which of the following is a primary use case for Vision-Language Models?$

Vision-Language Models Easy

A.

Searching for specific images using natural language text queries

B.

Applying a Gaussian blur to reduce image noise

C.

Finding the optimal threshold limit for image binarization

D.

Converting a color photograph into a grayscale array

21 $In an autonomous driving application, a system needs to identify every individual car on the road and label the drivable road surface. Which combination of segmentation techniques is most appropriate?$

Overview of image segmentation Medium

A.

Panoptic segmentation for the road, bounding box detection for cars

B.

Image classification for cars, semantic segmentation for the road

C.

Semantic segmentation for cars, instance segmentation for the road

D.

Instance segmentation for cars, semantic segmentation for the road

22 $Why is generating superpixels often used as a preprocessing step in complex image segmentation tasks?$

Overview of image segmentation Medium

A.

To convert a colored image into a binary mask directly

B.

To increase the resolution of the image for finer details

C.

To group perceptually similar pixels, reducing the complexity of subsequent algorithms

D.

To artificially increase the dataset size for training neural networks

23 $Otsu's method is used to automatically find an optimal threshold value. Which of the following mathematical objectives does Otsu's method optimize?$

thresholding based segmentation Medium

A.

Minimizing the between-class variance

B.

Maximizing the number of segmented regions

C.

Minimizing the within-class variance

D.

Maximizing the within-class variance

24 $An image of a document has uneven illumination, with the left side heavily shadowed. Applying a global threshold results in the shadowed background being misclassified as text. Which approach best resolves this?$

thresholding based segmentation Medium

A.

Decreasing the global threshold value

B.

Adaptive (local) thresholding

C.

Increasing the global threshold value

D.

Histogram equalization followed by Otsu's method

25 $In the Canny edge detection algorithm, what is the primary purpose of the hysteresis thresholding step?$

edge-based segmentation Medium

A.

To compute the gradient magnitude and direction

B.

To suppress non-maximum pixels to thin the edges

C.

To smooth the image using a Gaussian filter

D.

To link fragmented edges and eliminate noise-induced weak edges

26 $After applying an edge detector, a circular object in the image appears as a series of disconnected edge fragments. Which technique is most appropriate to fit a continuous boundary to these fragments?$

edge-based segmentation Medium

A.

Otsu's thresholding

B.

K-means clustering

C.

Hough Transform for circles

D.

Watershed algorithm

27 $In a region growing segmentation algorithm, starting from a set of seed points, how does the choice of the homogeneity criterion affect the final output?$

region based segmentation Medium

A.

A very strict criterion results in under-segmentation.

B.

A very relaxed criterion results in over-segmentation.

C.

A very strict criterion results in small, disjointed regions.

D.

The criterion only affects the speed of the algorithm, not the output size.

28 $The split-and-merge segmentation technique represents an image using which specific data structure?$

region based segmentation Medium

A.

Binary Search Tree

B.

Quadtree

C.

Hash Table

D.

Linked List

29 $When using K-means clustering for image segmentation based on color, why is it often advantageous to convert the image from RGB to the CIELAB () color space first?$

clustering based segmentation Medium

A.

CIELAB compresses the image, making K-means run faster.

B.

CIELAB automatically determines the optimal number of clusters ().

C.

RGB cannot be used with Euclidean distance metrics.

D.

CIELAB separates luminance () from chrominance (), and its distances are more perceptually uniform.

30 $Which of the following is a distinct advantage of Mean Shift clustering over K-means clustering for image segmentation?$

clustering based segmentation Medium

A.

Mean Shift is strictly a parametric model.

B.

Mean Shift requires defining the number of clusters a priori.

C.

Mean Shift is significantly computationally faster than K-means for large images.

D.

Mean Shift does not require specifying the number of clusters in advance.

31 $In a Generative Adversarial Network (GAN) trained to synthesize images, what does the minimax loss function conceptually represent?$

Generative Models Medium

A.

The Generator maximizing the image resolution while the Discriminator minimizes noise.

B.

The Discriminator generating fake images to fool the Generator.

C.

Both models cooperating to minimize the distance to the real data distribution.

D.

The Generator minimizing the Discriminator's accuracy, while the Discriminator maximizes it.

32 $In Denoising Diffusion Probabilistic Models (DDPMs), what is the primary function of the forward (diffusion) process?$

Generative Models Medium

A.

To gradually add Gaussian noise to an image until it becomes pure noise.

B.

To map textual prompts to a latent image space.

C.

To train a neural network to remove noise from an image.

D.

To compress an image into a lower-dimensional latent vector.

33 $How do Vision Transformers (ViTs) handle the 2D spatial structure of an image when processing it via a standard transformer architecture designed for 1D sequences?$

Vision transformers Medium

A.

By using 2D convolutional layers before the transformer blocks.

B.

By flattening the entire image into a single long sequence of pixels.

C.

By splitting the image into fixed-size patches, flattening them, and adding 1D positional embeddings.

D.

By relying entirely on the self-attention mechanism to deduce spatial relationships dynamically.

34 $If an image is divided into patches for a Vision Transformer, what is the computational complexity of the self-attention mechanism with respect to ?$

Vision transformers Medium

A.

B.

C.

D.

35 $In contrastive multimodal models like CLIP, how is the model trained to align images and text?$

multimodal vision models Medium

A.

By generating a caption for the image and calculating the BLEU score.

B.

By training a classifier to predict the text tokens directly from the image pixels.

C.

By reconstructing the image from the text embedding using a decoder.

D.

By maximizing the cosine similarity between the embeddings of paired images and texts while minimizing it for incorrect pairs.

36 $In a multimodal architecture that uses cross-attention to fuse image and text features, which component typically acts as the 'Query' in the cross-attention layer when generating text conditioned on an image?$

multimodal vision models Medium

A.

The image representations

B.

The text representations

C.

The positional encodings

D.

The classification token (CLS)

37 $The Segment Anything Model (SAM) is characterized as a 'promptable' segmentation model. What does this mean in practice?$

large vision models Medium

A.

It can output a segmentation mask based on various inputs like points, boxes, or text.

B.

It prompts the user to manually draw the segmentation boundaries.

C.

It generates natural language prompts to describe segments in an image.

D.

It requires text prompts to generate an entirely new image.

38 $Which of the following best describes the 'zero-shot' capability of large vision models?$

large vision models Medium

A.

The ability to predict object trajectories with zero initial data.

B.

The ability to train a model from scratch in zero time.

C.

The ability to compress an image to zero loss.

D.

The ability to perform a task on a dataset or class it was never explicitly trained on, without further fine-tuning.

39 $In a standard Visual Question Answering (VQA) system, how are the final predictions typically generated?$

Vision-Language Models Medium

A.

By inputting only the image into a CNN and classifying the scene.

B.

By ignoring the image and treating the task as standard text-based question answering.

C.

By fusing extracted image features and extracted text features, then passing them through a classifier or decoder.

D.

By converting the image to text via OCR and querying a database.

40 $Which objective function is most commonly used to train the language decoder portion of an Image Captioning model?$

Vision-Language Models Medium

A.

Mean Squared Error (MSE) loss between images.

B.

Cross-entropy loss for next-token prediction, conditioned on the image and previous tokens.

C.

Intersection over Union (IoU) loss.

D.

Contrastive loss between two different generated captions.

41 $In a complex urban scene containing mutually occluding cars (countable objects) and continuous regions of sky and road (amorphous regions), which of the following formulations perfectly describes the mathematical objective of panoptic segmentation?$

Overview of image segmentation Hard

A.

Assigning a single semantic class label to every pixel in the image, ignoring distinct object identities.

B.

Mapping each pixel to a tuple, where uniquely identifies 'thing' instances and is ignored or set to a null value for 'stuff' classes.

C.

Assigning an instance ID to every pixel, treating amorphous regions as a single large instance.

D.

Generating bounding boxes for all objects and subsequently predicting binary masks independently for each bounding box.

42 $Otsu's method determines the optimal threshold by maximizing the between-class variance . Under which of the following conditions does Otsu's method mathematically fail to identify the ideal threshold separating an object and background?$

Thresholding based segmentation Hard

A.

When the image contains high-frequency salt-and-pepper noise.

B.

When the variances of the object and background classes are drastically different, and the object occupies a disproportionately small fraction of the image area.

C.

When the image is illuminated by a uniform, spatially invariant light source.

D.

When the histogram is perfectly bimodal with equal class probabilities.

43 $When computing a local adaptive threshold for a window of size using the local mean and variance, what is the asymptotic time complexity per pixel if the algorithm utilizes integral images (summed-area tables)?$

Thresholding based segmentation Hard

A.

B.

C.

D.

44 $In the Canny edge detector, hysteresis thresholding uses a high threshold () and a low threshold (). If a continuous edge segment consists of pixels with gradient magnitudes,, and connected in the order, which pixels are retained as strong edges in the final output?$

Edge-based segmentation Hard

A.

None of the pixels.

B.

Pixels 1, 2, and 3.

C.

Pixels 1 and 3 only.

D.

Pixel 1 only.

45 $The Laplacian of Gaussian (LoG) filter is defined as . In the frequency domain, what is the effect of the parameter on the filter's magnitude response as a function of radial frequency ?$

Edge-based segmentation Hard

A.

It acts as a high-pass filter where increasing amplifies high frequencies proportionally to .

B.

It acts as a band-pass filter where the peak frequency response is inversely proportional to .

C.

It acts as a low-pass filter where the cutoff frequency increases linearly with .

D.

It acts as an all-pass filter, shifting the phase of edges but leaving magnitudes unchanged.

46 $In the Split-and-Merge segmentation algorithm, a region is split into four sub-regions if a homogeneity predicate . To prevent the generation of blocky boundaries, which data structure and subsequent process are critical for optimal merging?$

Region based segmentation Hard

A.

A Region Adjacency Graph (RAG) followed by merging adjacent regions if .

B.

A minimum spanning tree followed by normalized cuts.

C.

A quadtree followed by merging only siblings that share the same parent node if .

D.

A KD-Tree followed by K-means clustering.

47 $The standard watershed transform often suffers from severe over-segmentation due to local minima caused by noise. In marker-controlled watershed segmentation, how is the topological surface (gradient image) modified to enforce markers as the only regional minima?$

Region based segmentation Hard

A.

By using morphological reconstruction (minima imposition) to modify the gradient image such that local minima only occur at marker locations.

B.

By applying the Hough transform to force watershed lines into geometric shapes.

C.

By applying a global threshold to the gradient image before the flooding process.

D.

By applying a heavily regularized Gaussian blur to completely eliminate high-frequency noise.

48 $In Mean Shift segmentation, pixels are mapped into a joint spatial-color feature space . If the spatial bandwidth approaches infinity while the range bandwidth remains small, what is the asymptotic behavior of the segmentation?$

Clustering based segmentation Hard

A.

The algorithm ignores spatial constraints and reduces to exact K-means clustering in the RGB color space.

B.

The algorithm converges to density modes entirely based on color similarity, effectively performing global color quantization.

C.

The algorithm fails to converge because the kernel density estimate becomes uniform everywhere.

D.

The algorithm reduces to a purely spatial clustering, resulting in uniformly sized superpixels regardless of color.

49 $When applying Gaussian Mixture Models (GMM) with the EM algorithm for image segmentation, how does the covariance matrix type affect the resulting segment shapes in the color space?$

Clustering based segmentation Hard

A.

A diagonal covariance matrix restricts clusters to be perfectly spherical.

B.

A full covariance matrix allows clusters to model elliptical distributions oriented at any angle.

C.

A spherical covariance matrix allows clusters to have arbitrary orientations as long as the volume is fixed.

D.

Covariance constraints only affect the spatial domain, leaving color space clusters unaffected.

50 $In the context of Denoising Diffusion Probabilistic Models (DDPMs), the forward diffusion process can be formulated as a discrete Markov chain. In the continuous-time limit, this process transforms into which of the following mathematical constructs?$

Generative Models Hard

A.

A Stochastic Differential Equation (SDE).

B.

A Hamiltonian Monte Carlo (HMC) trajectory.

C.

A partial differential equation of the Navier-Stokes family.

D.

A deterministic Ordinary Differential Equation (ODE) controlled by score matching without a Wiener process.

51 $Variational Autoencoders (VAEs) optimize the Evidence Lower Bound (ELBO), defined as . What phenomenon occurs if the KL divergence term is heavily penalized (e.g., in a -VAE with)?$

Generative Models Hard

A.

The reconstruction quality becomes near-perfect, but the latent space becomes highly entangled.

B.

The model suffers from posterior collapse, where the latent representation ignores the input and unconditionally matches the prior.

C.

The discriminator network overpowers the generator, leading to mode collapse.

D.

The decoder deterministically maps every point in the latent space to the dataset mean.

52 $To solve the mode collapse issue and vanishing gradients in standard GANs, the Wasserstein GAN (WGAN) introduces the Earth Mover's Distance. What strict constraint must be enforced on the discriminator (critic) for the WGAN formulation to be mathematically valid?$

Generative Models Hard

A.

It must be an invertible normalizing flow network.

B.

It must utilize Batch Normalization in all hidden layers to preserve variance.

C.

It must be a 1-Lipschitz continuous function.

D.

It must output bounded probabilities in the range using a sigmoid activation.

53 $A standard Vision Transformer (ViT) splits an image of size into patches of size . What is the computational complexity of the global self-attention mechanism with respect to the input image dimensions, assuming embedding dimension ?$

Vision transformers Hard

A.

B.

C.

D.

54 $When fine-tuning a Vision Transformer (ViT) on higher-resolution images than it was pre-trained on (keeping patch size constant), the sequence length increases. How is the pre-trained positional embedding typically adapted to handle the new sequence length?$

Vision transformers Hard

A.

By reshaping the positional embeddings into a 2D grid matching the original patch layout, interpolating to the new grid size, and flattening back to 1D.

B.

By applying bicubic interpolation to the 1D flattened positional embeddings directly.

C.

By freezing the positional embeddings and applying a recurrent neural network to extrapolate the missing positions.

D.

By padding the new positional embeddings with zeros to preserve the learned magnitude.

55 $The Swin Transformer computes self-attention within local windows and introduces a shifted window partitioning in alternating layers. Mathematically, what is the primary purpose of this shifted window mechanism?$

Vision transformers Hard

A.

To allow the direct computation of multiscale features without needing to merge patches in deeper layers.

B.

To provide cross-window connections, expanding the receptive field hierarchically while maintaining linear computational complexity relative to image size.

C.

To reduce the computational complexity of self-attention from quadratic to strictly logarithmic.

D.

To augment the data by translation, rendering the transformer perfectly translation-invariant.

56 $CLIP (Contrastive Language-Image Pre-training) is trained using the InfoNCE loss to align image and text representations. If and are -normalized image and text embeddings in a batch of size, and is a learnable temperature parameter, what is the symmetric loss formulation minimized during training?$

Multimodal vision models Hard

A.

B.

C.

D.

57 $In the Segment Anything Model (SAM), how does the architecture intrinsically handle the ambiguity of a single point prompt (e.g., a point placed on a person's shirt could mean the shirt, or the whole person)?$

Large vision models Hard

A.

By utilizing a recurrent neural network to request a second point from the user.

B.

By aggressively applying conditional random fields (CRFs) to force the mask to snap to the largest semantic boundary.

C.

By predicting multiple valid masks (e.g., whole, part, and subpart) along with a confidence score for each mask.

D.

By implicitly decoding the point into text using CLIP and relying on the language model to guess the intended scale.

58 $DINOv2, a self-supervised large vision model, utilizes a student-teacher knowledge distillation framework without labels. To prevent the notorious feature collapse problem (where the model outputs a constant vector), which two opposing mechanisms are explicitly applied to the teacher's outputs?$

Large vision models Hard

A.

Centering (subtracting a moving average) and Sharpening (using a low temperature in the softmax).

B.

Dropout and Stochastic Depth.

C.

Instance normalization and layer normalization.

D.

Weight decay and gradient clipping.

59 $In Vision-Language Models like Flamingo, how does the Perceiver Resampler module efficiently bridge the high-resolution, variable-length output of the vision encoder to the fixed context window of the Large Language Model?$

Vision-Language Models Hard

A.

By applying an average pooling operation over the spatial dimensions, reducing any image to a 1x1 embedding.

B.

By utilizing a dynamic routing capsule network that drops visual tokens with low semantic entropy.

C.

By taking a fixed number of learnable latent queries and applying cross-attention to the flattened vision features, yielding a fixed number of visual tokens.

D.

By interpolating the sequence of visual tokens to exactly match the length of the text prompt tokens.

60 $Models like BLIP-2 introduce a Q-Former to align visual features with the text space. During the representation learning pre-training stage of the Q-Former, three distinct objectives are optimized simultaneously. Which of the following is NOT one of these objectives?$

Vision-Language Models Hard

A.

Image-Grounded Text Generation (ITG)

B.

Image-Text Contrastive Learning (ITC)

C.

Image-Text Matching (ITM)

D.

Masked Image Modeling (MIM)

Unit 6 - Practice Quiz