Unit 6 - Practice Quiz

INT345 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the primary goal of image segmentation?

Overview of image segmentation Easy
A. To compress the image file size for storage
B. To convert a color image into grayscale
C. To increase the resolution of a digital image
D. To divide an image into meaningful regions or objects

2 Which term describes the process of assigning a class label to every single pixel in an image?

Overview of image segmentation Easy
A. Object detection
B. Image classification
C. Semantic segmentation
D. Image compression

3 In thresholding based segmentation, how are pixels typically separated into foreground and background?

thresholding based segmentation Easy
A. By connecting edges together to form a boundary
B. By calculating the optical flow of moving objects
C. By generating new pixels using a neural network
D. By comparing each pixel's intensity to a specific threshold value

4 Which well-known method is commonly used to automatically calculate the optimal threshold value?

thresholding based segmentation Easy
A. Vision Transformer
B. YOLO algorithm
C. Otsu's method
D. Canny edge detector

5 Edge-based segmentation primarily relies on identifying what property in an image?

edge-based segmentation Easy
A. The most frequently occurring color
B. Rapid changes or discontinuities in pixel intensity
C. Large, uniform regions of identical pixels
D. The overall brightness of the image

6 Which of the following is a widely used operator for detecting edges in an image?

edge-based segmentation Easy
A. Sobel operator
B. ReLU operator
C. K-means operator
D. Max-pooling operator

7 Which segmentation technique starts with a 'seed' pixel and iteratively adds neighboring pixels that share similar properties?

region based segmentation Easy
A. Image generation
B. Edge linking
C. Region growing
D. Global thresholding

8 What is the fundamental mechanism behind the 'split and merge' segmentation technique?

region based segmentation Easy
A. Training a neural network to classify the image into two halves
B. Detecting all edges and deleting the internal pixels
C. Dividing the image into quadrants and merging adjacent regions that are homogenous
D. Applying a global threshold and discarding small pixels

9 Which standard machine learning algorithm is frequently applied for clustering-based image segmentation?

clustering based segmentation Easy
A. Logistic regression
B. K-means clustering
C. Linear regression
D. Decision trees

10 When applying K-means clustering to segment an RGB color image, what do the data points typically represent?

clustering based segmentation Easy
A. Text labels describing the image
B. Image file sizes
C. Bounding box coordinates around objects
D. Pixel color values (e.g., R, G, B vectors)

11 What is the primary function of generative models in computer vision?

Generative Models Easy
A. To detect faces in a crowded scene
B. To sort images into chronological order
C. To compress images without losing quality
D. To create new, synthetic images that resemble the original training data

12 Which of the following architectures is a famous example of a generative model?

Generative Models Easy
A. Support Vector Machines (SVMs)
B. Random Forests
C. K-Nearest Neighbors (KNN)
D. Generative Adversarial Networks (GANs)

13 Vision Transformers (ViT) process an input image by first dividing it into what?

Vision transformers Easy
A. Continuous edges
B. Single pixels
C. Fixed-size patches
D. Color channels only

14 What core mechanism allows Vision Transformers to determine the relationship and importance of different image patches?

Vision transformers Easy
A. Max-pooling
B. Otsu's thresholding
C. Self-attention mechanism
D. Gaussian blurring

15 What is the defining characteristic of a multimodal vision model?

multimodal vision models Easy
A. It only processes high-resolution black and white images
B. It processes and connects multiple types of data modalities, such as images and text
C. It uses only convolutional layers without any fully connected layers
D. It splits a single image into multiple color channels

16 Which of the following is a classic example of a task performed by a multimodal vision model?

multimodal vision models Easy
A. Image captioning (generating descriptive text from an image)
B. Grayscale conversion
C. Basic thresholding
D. Applying a median filter

17 Large vision models (LVMs) are primarily characterized by which of the following?

large vision models Easy
A. Being lightweight enough to run on basic digital watches
B. Having a massive number of parameters and being trained on vast datasets
C. Segmenting images using basic global thresholding only
D. Using only a single hidden layer of neurons

18 What is a major advantage of utilizing large vision models over traditional, smaller models?

large vision models Easy
A. Better generalization and strong performance on a wide variety of tasks (zero-shot learning)
B. They completely eliminate the need for computer memory
C. They can be trained from scratch in minutes on a standard laptop CPU
D. They require zero training data to function

19 CLIP (Contrastive Language-Image Pretraining) is a well-known example of which type of model?

Vision-Language Models Easy
A. Simple Edge Detector
B. K-means Clustering Model
C. Vision-Language Model
D. Region Growing Algorithm

20 Which of the following is a primary use case for Vision-Language Models?

Vision-Language Models Easy
A. Converting a color photograph into a grayscale array
B. Finding the optimal threshold limit for image binarization
C. Applying a Gaussian blur to reduce image noise
D. Searching for specific images using natural language text queries

21 In an autonomous driving application, a system needs to identify every individual car on the road and label the drivable road surface. Which combination of segmentation techniques is most appropriate?

Overview of image segmentation Medium
A. Semantic segmentation for cars, instance segmentation for the road
B. Instance segmentation for cars, semantic segmentation for the road
C. Panoptic segmentation for the road, bounding box detection for cars
D. Image classification for cars, semantic segmentation for the road

22 Why is generating superpixels often used as a preprocessing step in complex image segmentation tasks?

Overview of image segmentation Medium
A. To artificially increase the dataset size for training neural networks
B. To group perceptually similar pixels, reducing the complexity of subsequent algorithms
C. To increase the resolution of the image for finer details
D. To convert a colored image into a binary mask directly

23 Otsu's method is used to automatically find an optimal threshold value. Which of the following mathematical objectives does Otsu's method optimize?

thresholding based segmentation Medium
A. Maximizing the number of segmented regions
B. Maximizing the within-class variance
C. Minimizing the within-class variance
D. Minimizing the between-class variance

24 An image of a document has uneven illumination, with the left side heavily shadowed. Applying a global threshold results in the shadowed background being misclassified as text. Which approach best resolves this?

thresholding based segmentation Medium
A. Histogram equalization followed by Otsu's method
B. Decreasing the global threshold value
C. Increasing the global threshold value
D. Adaptive (local) thresholding

25 In the Canny edge detection algorithm, what is the primary purpose of the hysteresis thresholding step?

edge-based segmentation Medium
A. To smooth the image using a Gaussian filter
B. To suppress non-maximum pixels to thin the edges
C. To link fragmented edges and eliminate noise-induced weak edges
D. To compute the gradient magnitude and direction

26 After applying an edge detector, a circular object in the image appears as a series of disconnected edge fragments. Which technique is most appropriate to fit a continuous boundary to these fragments?

edge-based segmentation Medium
A. Hough Transform for circles
B. Otsu's thresholding
C. Watershed algorithm
D. K-means clustering

27 In a region growing segmentation algorithm, starting from a set of seed points, how does the choice of the homogeneity criterion affect the final output?

region based segmentation Medium
A. A very relaxed criterion results in over-segmentation.
B. The criterion only affects the speed of the algorithm, not the output size.
C. A very strict criterion results in under-segmentation.
D. A very strict criterion results in small, disjointed regions.

28 The split-and-merge segmentation technique represents an image using which specific data structure?

region based segmentation Medium
A. Binary Search Tree
B. Hash Table
C. Quadtree
D. Linked List

29 When using K-means clustering for image segmentation based on color, why is it often advantageous to convert the image from RGB to the CIELAB () color space first?

clustering based segmentation Medium
A. RGB cannot be used with Euclidean distance metrics.
B. CIELAB automatically determines the optimal number of clusters ().
C. CIELAB compresses the image, making K-means run faster.
D. CIELAB separates luminance () from chrominance (), and its distances are more perceptually uniform.

30 Which of the following is a distinct advantage of Mean Shift clustering over K-means clustering for image segmentation?

clustering based segmentation Medium
A. Mean Shift is significantly computationally faster than K-means for large images.
B. Mean Shift requires defining the number of clusters a priori.
C. Mean Shift is strictly a parametric model.
D. Mean Shift does not require specifying the number of clusters in advance.

31 In a Generative Adversarial Network (GAN) trained to synthesize images, what does the minimax loss function conceptually represent?

Generative Models Medium
A. The Generator minimizing the Discriminator's accuracy, while the Discriminator maximizes it.
B. The Discriminator generating fake images to fool the Generator.
C. Both models cooperating to minimize the distance to the real data distribution.
D. The Generator maximizing the image resolution while the Discriminator minimizes noise.

32 In Denoising Diffusion Probabilistic Models (DDPMs), what is the primary function of the forward (diffusion) process?

Generative Models Medium
A. To compress an image into a lower-dimensional latent vector.
B. To gradually add Gaussian noise to an image until it becomes pure noise.
C. To train a neural network to remove noise from an image.
D. To map textual prompts to a latent image space.

33 How do Vision Transformers (ViTs) handle the 2D spatial structure of an image when processing it via a standard transformer architecture designed for 1D sequences?

Vision transformers Medium
A. By splitting the image into fixed-size patches, flattening them, and adding 1D positional embeddings.
B. By using 2D convolutional layers before the transformer blocks.
C. By flattening the entire image into a single long sequence of pixels.
D. By relying entirely on the self-attention mechanism to deduce spatial relationships dynamically.

34 If an image is divided into patches for a Vision Transformer, what is the computational complexity of the self-attention mechanism with respect to ?

Vision transformers Medium
A.
B.
C.
D.

35 In contrastive multimodal models like CLIP, how is the model trained to align images and text?

multimodal vision models Medium
A. By reconstructing the image from the text embedding using a decoder.
B. By maximizing the cosine similarity between the embeddings of paired images and texts while minimizing it for incorrect pairs.
C. By training a classifier to predict the text tokens directly from the image pixels.
D. By generating a caption for the image and calculating the BLEU score.

36 In a multimodal architecture that uses cross-attention to fuse image and text features, which component typically acts as the 'Query' in the cross-attention layer when generating text conditioned on an image?

multimodal vision models Medium
A. The image representations
B. The positional encodings
C. The classification token (CLS)
D. The text representations

37 The Segment Anything Model (SAM) is characterized as a 'promptable' segmentation model. What does this mean in practice?

large vision models Medium
A. It prompts the user to manually draw the segmentation boundaries.
B. It can output a segmentation mask based on various inputs like points, boxes, or text.
C. It generates natural language prompts to describe segments in an image.
D. It requires text prompts to generate an entirely new image.

38 Which of the following best describes the 'zero-shot' capability of large vision models?

large vision models Medium
A. The ability to predict object trajectories with zero initial data.
B. The ability to compress an image to zero loss.
C. The ability to train a model from scratch in zero time.
D. The ability to perform a task on a dataset or class it was never explicitly trained on, without further fine-tuning.

39 In a standard Visual Question Answering (VQA) system, how are the final predictions typically generated?

Vision-Language Models Medium
A. By converting the image to text via OCR and querying a database.
B. By fusing extracted image features and extracted text features, then passing them through a classifier or decoder.
C. By inputting only the image into a CNN and classifying the scene.
D. By ignoring the image and treating the task as standard text-based question answering.

40 Which objective function is most commonly used to train the language decoder portion of an Image Captioning model?

Vision-Language Models Medium
A. Contrastive loss between two different generated captions.
B. Intersection over Union (IoU) loss.
C. Mean Squared Error (MSE) loss between images.
D. Cross-entropy loss for next-token prediction, conditioned on the image and previous tokens.

41 In a complex urban scene containing mutually occluding cars (countable objects) and continuous regions of sky and road (amorphous regions), which of the following formulations perfectly describes the mathematical objective of panoptic segmentation?

Overview of image segmentation Hard
A. Generating bounding boxes for all objects and subsequently predicting binary masks independently for each bounding box.
B. Assigning an instance ID to every pixel, treating amorphous regions as a single large instance.
C. Mapping each pixel to a tuple , where uniquely identifies 'thing' instances and is ignored or set to a null value for 'stuff' classes.
D. Assigning a single semantic class label to every pixel in the image, ignoring distinct object identities.

42 Otsu's method determines the optimal threshold by maximizing the between-class variance . Under which of the following conditions does Otsu's method mathematically fail to identify the ideal threshold separating an object and background?

Thresholding based segmentation Hard
A. When the image contains high-frequency salt-and-pepper noise.
B. When the image is illuminated by a uniform, spatially invariant light source.
C. When the histogram is perfectly bimodal with equal class probabilities.
D. When the variances of the object and background classes are drastically different, and the object occupies a disproportionately small fraction of the image area.

43 When computing a local adaptive threshold for a window of size using the local mean and variance, what is the asymptotic time complexity per pixel if the algorithm utilizes integral images (summed-area tables)?

Thresholding based segmentation Hard
A.
B.
C.
D.

44 In the Canny edge detector, hysteresis thresholding uses a high threshold () and a low threshold (). If a continuous edge segment consists of pixels with gradient magnitudes , , and connected in the order , which pixels are retained as strong edges in the final output?

Edge-based segmentation Hard
A. None of the pixels.
B. Pixel 1 only.
C. Pixels 1, 2, and 3.
D. Pixels 1 and 3 only.

45 The Laplacian of Gaussian (LoG) filter is defined as . In the frequency domain, what is the effect of the parameter on the filter's magnitude response as a function of radial frequency ?

Edge-based segmentation Hard
A. It acts as a low-pass filter where the cutoff frequency increases linearly with .
B. It acts as a high-pass filter where increasing amplifies high frequencies proportionally to .
C. It acts as a band-pass filter where the peak frequency response is inversely proportional to .
D. It acts as an all-pass filter, shifting the phase of edges but leaving magnitudes unchanged.

46 In the Split-and-Merge segmentation algorithm, a region is split into four sub-regions if a homogeneity predicate . To prevent the generation of blocky boundaries, which data structure and subsequent process are critical for optimal merging?

Region based segmentation Hard
A. A Region Adjacency Graph (RAG) followed by merging adjacent regions if .
B. A minimum spanning tree followed by normalized cuts.
C. A quadtree followed by merging only siblings that share the same parent node if .
D. A KD-Tree followed by K-means clustering.

47 The standard watershed transform often suffers from severe over-segmentation due to local minima caused by noise. In marker-controlled watershed segmentation, how is the topological surface (gradient image) modified to enforce markers as the only regional minima?

Region based segmentation Hard
A. By applying the Hough transform to force watershed lines into geometric shapes.
B. By applying a heavily regularized Gaussian blur to completely eliminate high-frequency noise.
C. By applying a global threshold to the gradient image before the flooding process.
D. By using morphological reconstruction (minima imposition) to modify the gradient image such that local minima only occur at marker locations.

48 In Mean Shift segmentation, pixels are mapped into a joint spatial-color feature space . If the spatial bandwidth approaches infinity while the range bandwidth remains small, what is the asymptotic behavior of the segmentation?

Clustering based segmentation Hard
A. The algorithm reduces to a purely spatial clustering, resulting in uniformly sized superpixels regardless of color.
B. The algorithm ignores spatial constraints and reduces to exact K-means clustering in the RGB color space.
C. The algorithm fails to converge because the kernel density estimate becomes uniform everywhere.
D. The algorithm converges to density modes entirely based on color similarity, effectively performing global color quantization.

49 When applying Gaussian Mixture Models (GMM) with the EM algorithm for image segmentation, how does the covariance matrix type affect the resulting segment shapes in the color space?

Clustering based segmentation Hard
A. A spherical covariance matrix allows clusters to have arbitrary orientations as long as the volume is fixed.
B. Covariance constraints only affect the spatial domain, leaving color space clusters unaffected.
C. A diagonal covariance matrix restricts clusters to be perfectly spherical.
D. A full covariance matrix allows clusters to model elliptical distributions oriented at any angle.

50 In the context of Denoising Diffusion Probabilistic Models (DDPMs), the forward diffusion process can be formulated as a discrete Markov chain. In the continuous-time limit, this process transforms into which of the following mathematical constructs?

Generative Models Hard
A. A Stochastic Differential Equation (SDE).
B. A partial differential equation of the Navier-Stokes family.
C. A Hamiltonian Monte Carlo (HMC) trajectory.
D. A deterministic Ordinary Differential Equation (ODE) controlled by score matching without a Wiener process.

51 Variational Autoencoders (VAEs) optimize the Evidence Lower Bound (ELBO), defined as . What phenomenon occurs if the KL divergence term is heavily penalized (e.g., in a -VAE with )?

Generative Models Hard
A. The discriminator network overpowers the generator, leading to mode collapse.
B. The decoder deterministically maps every point in the latent space to the dataset mean.
C. The reconstruction quality becomes near-perfect, but the latent space becomes highly entangled.
D. The model suffers from posterior collapse, where the latent representation ignores the input and unconditionally matches the prior.

52 To solve the mode collapse issue and vanishing gradients in standard GANs, the Wasserstein GAN (WGAN) introduces the Earth Mover's Distance. What strict constraint must be enforced on the discriminator (critic) for the WGAN formulation to be mathematically valid?

Generative Models Hard
A. It must be an invertible normalizing flow network.
B. It must be a 1-Lipschitz continuous function.
C. It must output bounded probabilities in the range using a sigmoid activation.
D. It must utilize Batch Normalization in all hidden layers to preserve variance.

53 A standard Vision Transformer (ViT) splits an image of size into patches of size . What is the computational complexity of the global self-attention mechanism with respect to the input image dimensions, assuming embedding dimension ?

Vision transformers Hard
A.
B.
C.
D.

54 When fine-tuning a Vision Transformer (ViT) on higher-resolution images than it was pre-trained on (keeping patch size constant), the sequence length increases. How is the pre-trained positional embedding typically adapted to handle the new sequence length?

Vision transformers Hard
A. By applying bicubic interpolation to the 1D flattened positional embeddings directly.
B. By reshaping the positional embeddings into a 2D grid matching the original patch layout, interpolating to the new grid size, and flattening back to 1D.
C. By padding the new positional embeddings with zeros to preserve the learned magnitude.
D. By freezing the positional embeddings and applying a recurrent neural network to extrapolate the missing positions.

55 The Swin Transformer computes self-attention within local windows and introduces a shifted window partitioning in alternating layers. Mathematically, what is the primary purpose of this shifted window mechanism?

Vision transformers Hard
A. To augment the data by translation, rendering the transformer perfectly translation-invariant.
B. To allow the direct computation of multiscale features without needing to merge patches in deeper layers.
C. To provide cross-window connections, expanding the receptive field hierarchically while maintaining linear computational complexity relative to image size.
D. To reduce the computational complexity of self-attention from quadratic to strictly logarithmic.

56 CLIP (Contrastive Language-Image Pre-training) is trained using the InfoNCE loss to align image and text representations. If and are -normalized image and text embeddings in a batch of size , and is a learnable temperature parameter, what is the symmetric loss formulation minimized during training?

Multimodal vision models Hard
A.
B.
C.
D.

57 In the Segment Anything Model (SAM), how does the architecture intrinsically handle the ambiguity of a single point prompt (e.g., a point placed on a person's shirt could mean the shirt, or the whole person)?

Large vision models Hard
A. By utilizing a recurrent neural network to request a second point from the user.
B. By implicitly decoding the point into text using CLIP and relying on the language model to guess the intended scale.
C. By aggressively applying conditional random fields (CRFs) to force the mask to snap to the largest semantic boundary.
D. By predicting multiple valid masks (e.g., whole, part, and subpart) along with a confidence score for each mask.

58 DINOv2, a self-supervised large vision model, utilizes a student-teacher knowledge distillation framework without labels. To prevent the notorious feature collapse problem (where the model outputs a constant vector), which two opposing mechanisms are explicitly applied to the teacher's outputs?

Large vision models Hard
A. Centering (subtracting a moving average) and Sharpening (using a low temperature in the softmax).
B. Instance normalization and layer normalization.
C. Dropout and Stochastic Depth.
D. Weight decay and gradient clipping.

59 In Vision-Language Models like Flamingo, how does the Perceiver Resampler module efficiently bridge the high-resolution, variable-length output of the vision encoder to the fixed context window of the Large Language Model?

Vision-Language Models Hard
A. By interpolating the sequence of visual tokens to exactly match the length of the text prompt tokens.
B. By utilizing a dynamic routing capsule network that drops visual tokens with low semantic entropy.
C. By applying an average pooling operation over the spatial dimensions, reducing any image to a 1x1 embedding.
D. By taking a fixed number of learnable latent queries and applying cross-attention to the flattened vision features, yielding a fixed number of visual tokens.

60 Models like BLIP-2 introduce a Q-Former to align visual features with the text space. During the representation learning pre-training stage of the Q-Former, three distinct objectives are optimized simultaneously. Which of the following is NOT one of these objectives?

Vision-Language Models Hard
A. Masked Image Modeling (MIM)
B. Image-Text Contrastive Learning (ITC)
C. Image-Text Matching (ITM)
D. Image-Grounded Text Generation (ITG)