Unit6 - Subjective Questions
INT345 • Practice Questions with Detailed Answers
Define image segmentation. What are the primary objectives and common applications of image segmentation in computer vision?
Image Segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects or superpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze.\n\nPrimary Objectives:\n- To separate objects of interest from the background.\n- To group pixels that share similar characteristics (such as color, intensity, or texture).\n- To find boundaries (lines, curves) in images.\n\nCommon Applications:\n- Medical Imaging: Locating tumors, measuring tissue volume, and guiding computer-integrated surgery.\n- Autonomous Vehicles: Identifying roads, pedestrians, traffic signs, and other vehicles.\n- Facial Recognition: Segmenting the face from the background for feature extraction.\n- Satellite Imaging: Land cover classification and object detection (e.g., buildings, roads).
Explain thresholding-based segmentation. Distinguish between global and local (adaptive) thresholding.
Thresholding-based Segmentation is one of the simplest techniques where pixels are partitioned depending on their intensity value relative to a threshold value, . If a pixel intensity , it is assigned to one class (e.g., object), otherwise to another (e.g., background).\n\nGlobal Thresholding:\n- Uses a single threshold value for the entire image.\n- It is effective when there is a distinct contrast between the objects and the background, resulting in a bimodal histogram.\n- Mathematical representation: \n\nLocal (Adaptive) Thresholding:\n- Uses varying threshold values across different regions of the image.\n- The threshold is calculated dynamically based on the local neighborhood of the pixel (e.g., local mean or variance).\n- It is highly effective for images with varying illumination or shadows where global thresholding fails.
Describe Otsu's method for image thresholding. What is the mathematical objective it tries to optimize?
Otsu's Method is an algorithm used to automatically perform clustering-based image thresholding. It assumes that the image contains two classes of pixels following a bi-modal histogram (foreground and background) and calculates the optimum threshold separating the two classes.\n\nMathematical Objective:\nThe algorithm searches for the threshold that minimizes the intra-class variance (the variance within the foreground and background pixels) or, equivalently, maximizes the inter-class variance.\n\nLet the probabilities of the two classes separated by a threshold be and , and their respective means be and .\n\nThe inter-class variance is given by:\n\n\nOtsu's method iterates through all possible threshold values (0 to 255 for an 8-bit image) and finds the threshold that maximizes .
What is edge-based segmentation? Discuss its advantages and limitations.
Edge-based Segmentation relies on identifying the edges or boundaries between different regions in an image. An edge is essentially a sharp change in pixel intensity. Edge detection filters (like Sobel, Prewitt, or Canny) are used to compute the gradient of the image.\n\nProcess:\n1. Edge Detection: Apply derivative filters to find areas of rapid intensity change.\n2. Edge Linking: Connect fragmented edges to form continuous boundaries enclosing regions.\n\nAdvantages:\n- Preserves the structural properties of objects well.\n- Works well in images with high contrast between regions.\n\nLimitations:\n- Highly susceptible to noise, as noise also produces sharp intensity changes.\n- Often produces disconnected or broken edges (edge gaps), requiring complex edge-linking algorithms.\n- Does not work well if object boundaries are blurred or have smooth intensity transitions.
Explain the Region Growing algorithm in the context of region-based segmentation.
Region Growing is a region-based image segmentation technique that groups pixels or subregions into larger regions based on predefined criteria (such as intensity, color, or texture).\n\nAlgorithm Steps:\n1. Seed Selection: Start with a set of "seed" points. These can be selected manually or automatically based on certain properties.\n2. Similarity Criteria: Define a condition for including neighboring pixels (e.g., the difference in pixel intensity between the neighbor and the seed must be less than a threshold ).\n3. Growing: Examine the neighboring pixels of the initial seeds. If they satisfy the similarity criteria, add them to the region.\n4. Iteration: Repeat step 3 for the newly added pixels.\n5. Stopping Rule: Stop when no more pixels can be added to any region.\n\nPros and Cons:\n- Pros: Can correctly separate regions that have the same properties but are spatially separated.\n- Cons: Computationally expensive and highly sensitive to noise and the initial choice of seeds.
Describe the Region Splitting and Merging technique. How does the Quadtree structure aid this process?
Region Splitting and Merging is a top-down and bottom-up approach to region-based segmentation.\n\nProcess:\n1. Splitting (Top-down): Start with the entire image as a single region. Let represent the region and be a logical predicate (e.g., variance of intensity ). If is false, split the region into four equal non-overlapping sub-regions.\n2. Quadtree Representation: This recursive splitting is naturally represented by a Quadtree structure, where the root is the whole image, and every node has exactly four children or no children.\n3. Merging (Bottom-up): Once splitting is complete, adjacent regions and are merged if they satisfy the predicate, i.e., is true.\n4. Stopping: The process terminates when no further splitting or merging is possible.\n\nAdvantages: It does not require manual seed selection like region growing and guarantees connected regions.
How is K-means clustering utilized for image segmentation? Discuss the steps involved.
K-means Clustering is an unsupervised learning algorithm used to partition an image into distinct regions based on pixel features like color or intensity.\n\nSteps for Image Segmentation:\n1. Initialization: Choose initial cluster centers (centroids) randomly. For a color image, a centroid is a point in the RGB color space.\n2. Assignment: Calculate the distance (usually Euclidean) between every pixel and each of the centroids. Assign each pixel to the cluster with the nearest centroid.\n3. Update: Recalculate the centroid of each cluster by taking the mean of all pixels currently assigned to that cluster. Let be the cluster and its centroid:\n \n4. Iteration: Repeat the assignment and update steps until convergence (i.e., when centroids do not change significantly).\n\nPros/Cons: It is simple and fast but requires knowing in advance and is sensitive to the initial placement of centroids.
Explain the concept of Mean Shift clustering and its application in image segmentation.
Mean Shift Clustering is a non-parametric, density-based clustering algorithm that does not require knowing the number of clusters beforehand. It works by shifting data points toward regions of higher density.\n\nAlgorithm Steps:\n1. Feature Space: Map the image pixels into a feature space (e.g., spatial coordinates + color space, or RGB).\n2. Window Initialization: For each data point, define a window (or kernel, usually Gaussian) of radius (bandwidth).\n3. Mean Shift Calculation: Calculate the mean of all data points within the window.\n4. Shifting: Shift the center of the window to this new mean.\n5. Convergence: Repeat steps 3 and 4 until the window stops moving (converges to a local density maximum or mode).\n6. Segmentation: Group all pixels that converge to the same mode into a single cluster/segment.\n\nAdvantages: Robust to outliers, handles arbitrarily shaped clusters, and automatically determines the number of segments.\nDisadvantages: Computationally very expensive, especially for high-resolution images.
What are Generative Models in computer vision? Briefly distinguish between VAEs and GANs.
Generative Models in computer vision aim to learn the underlying true data distribution of a training set so they can generate new, previously unseen data points (images) with some variations.\n\nVariational Autoencoders (VAEs):\n- Concept: They use an encoder to map input images to a latent space distribution (described by mean and variance) and a decoder to reconstruct images by sampling from this distribution.\n- Optimization: They maximize a variational lower bound (ELBO), balancing reconstruction accuracy and latent space regularization (KL divergence).\n- Output: Tends to produce somewhat blurry images due to the probabilistic nature and reconstruction loss.\n\nGenerative Adversarial Networks (GANs):\n- Concept: Consists of two networks—a Generator (creates fake images) and a Discriminator (tries to distinguish fake from real). They are trained simultaneously in a minimax game.\n- Optimization: Implicitly models the distribution without defining an explicit density function.\n- Output: Tends to produce highly realistic and sharp images but suffers from training instability and mode collapse.
Derive the loss function for Generative Adversarial Networks (GANs) and explain the minimax game concept.
GAN Loss Function and Minimax Game:\nA GAN consists of a Generator and a Discriminator . \n- represents the probability that came from the real data distribution .\n- is the output of the generator given noise from distribution .\n\nDiscriminator's Goal: Maximize the probability of correctly classifying real images as real () and fake images as fake ().\nGenerator's Goal: Minimize the probability that the discriminator gets it right. It wants to fool , thereby minimizing .\n\nThis creates a two-player minimax game with the value function :\n\n\nExplanation:\n- The inner maximization over trains the discriminator to be as accurate as possible.\n- The outer minimization over trains the generator to create samples that the discriminator evaluates as highly likely to be real (i.e., ).
Describe the architecture of Vision Transformers (ViT). How do they process an image compared to traditional CNNs?
Vision Transformers (ViT) adapt the Transformer architecture, originally designed for NLP, directly to images without using convolutions.\n\nArchitecture and Processing:\n1. Patch Extraction: Unlike CNNs that process pixels through local receptive fields, ViT splits the input image into fixed-size, non-overlapping patches (e.g., pixels).\n2. Linear Projection (Patch Embedding): Each patch is flattened into a 1D vector and passed through a trainable linear projection layer to create "patch embeddings" (similar to word embeddings in NLP).\n3. Positional Encoding: Since Transformers lack an inherent sense of order, learnable positional embeddings are added to the patch embeddings to retain spatial information.\n4. Class Token: A learnable [CLS] token is prepended to the sequence of patches. Its state at the output of the Transformer encoder serves as the global image representation for classification.\n5. Transformer Encoder: The sequence is processed through alternating layers of Multi-Head Self-Attention (MHSA) and Multi-Layer Perceptrons (MLP), with Layer Normalization and residual connections.\n\nComparison to CNNs: ViTs use global self-attention across the whole image from the very first layer, lacking the inductive biases (translation invariance, local neighborhood structure) inherent to CNNs. Thus, ViTs usually require massive datasets for pre-training to perform well.
What is the Self-Attention mechanism in Vision Transformers? Formulate it mathematically.
Self-Attention allows a model to weigh the importance of different parts of the input sequence (in ViT, image patches) relative to a specific patch, enabling the capture of long-range dependencies.\n\nMechanism:\n1. Each input embedding is linearly projected to create three vectors: Query (), Key (), and Value ().\n2. The attention score between a query and a key determines how much focus to put on the corresponding value.\n\nMathematical Formulation:\nLet the matrices , , and represent the queries, keys, and values for all patches, and be the dimension of the keys.\n\nThe Attention output is computed as:\n\n\n- computes the raw similarity scores (dot products) between all pairs of patches.\n- Scaling by prevents the dot products from growing too large and pushing the softmax into regions with near-zero gradients.\n- The softmax function normalizes the scores to a probability distribution (summing to 1).\n- Multiplying by produces the final output, which is a weighted sum of the values.
What are Multimodal Vision Models? Discuss the primary strategies for fusing different modalities.
Multimodal Vision Models are AI architectures capable of processing and understanding multiple types of data modalities simultaneously, such as images, text, audio, and video, to build a holistic representation of the input.\n\nFusion Strategies:\n1. Early Fusion (Data-level): Different modalities are combined at the raw data or feature extraction level before being processed by the model. Pros: Captures low-level interactions. Cons: Difficult to align vastly different data types (e.g., text vs pixels).\n2. Late Fusion (Decision-level): Each modality is processed independently by separate networks to produce individual predictions or high-level features. These outputs are then combined (e.g., averaging, voting, or a linear layer) at the very end. Pros: Easy to implement. Cons: Ignores complex cross-modal interactions during feature learning.\n3. Intermediate (Deep) Fusion: Modalities are processed independently initially, but their latent representations are fused iteratively at intermediate layers of the network (often using Cross-Attention mechanisms). Pros: Highly effective at capturing deep cross-modal relationships (e.g., ViLBERT, Flamingo).
Explain the architecture and training objective of Contrastive Language-Image Pre-training (CLIP) as a Vision-Language Model.
CLIP (Contrastive Language-Image Pre-training) is a highly influential Vision-Language Model (VLM) developed by OpenAI. It learns to associate images and text by training on millions of image-text pairs from the internet.\n\nArchitecture:\n- Image Encoder: Can be a ResNet or a Vision Transformer (ViT). It extracts feature embeddings from images.\n- Text Encoder: A Transformer-based architecture that extracts feature embeddings from text descriptions.\n\nTraining Objective (Contrastive Learning):\nCLIP is trained using a symmetric contrastive loss. Given a batch of image-text pairs, it computes the cosine similarities between all possible combinations of image and text embeddings.\n- Positive Pairs: The correct pairings on the diagonal of the similarity matrix.\n- Negative Pairs: The incorrect pairings.\n\nThe model is trained to maximize the cosine similarity of the positive pairs while minimizing the cosine similarity of the negative pairs. Mathematically, it applies a cross-entropy loss over both the rows (image-to-text) and columns (text-to-image) of the similarity matrix.
How does Zero-Shot learning work in the context of Vision-Language Models like CLIP?
Zero-Shot Learning refers to the ability of a model to perform a task (like classifying an image) on classes it has never explicitly seen during supervised training.\n\nProcess in CLIP:\n1. Prompt Engineering: Instead of predicting a simple class label (e.g., "dog"), the target classes are transformed into text prompts, such as "A photo of a [CLASS]." If there are 100 possible classes, 100 text prompts are generated.\n2. Text Embedding: These 100 prompts are passed through the pre-trained Text Encoder to generate 100 text embeddings.\n3. Image Embedding: The target image is passed through the Image Encoder to generate a single image embedding.\n4. Similarity Calculation: The cosine similarity is computed between the image embedding and all 100 text embeddings.\n5. Prediction: The class whose text prompt yields the highest similarity score with the image embedding is selected as the prediction.\n\nThis allows the model to classify any image into any category without requiring re-training or fine-tuning, leveraging the broad semantic understanding gained during pre-training.
Discuss Large Vision Models (LVMs) / Vision Foundation Models. Provide examples of their capabilities.
Large Vision Models (LVMs), often referred to as Vision Foundation Models, are massive neural networks trained on vast amounts of unlabelled or weakly labeled visual data (often using self-supervised learning). They act as a foundational base that can be adapted (fine-tuned or prompted) for a wide variety of downstream computer vision tasks.\n\nCapabilities and Examples:\n- Segment Anything Model (SAM) by Meta: An LVM designed for promptable image segmentation. By providing a prompt (a click, a bounding box, or text), SAM can segment any object in an image zero-shot, even for objects it hasn't seen before.\n- DINO / DINOv2: Foundation models trained via self-supervised learning that produce highly robust visual features. They excel at dense tasks like depth estimation and semantic segmentation without requiring dense annotations.\n- Generalization: LVMs exhibit strong out-of-distribution robustness and can handle diverse visual domains (medical, satellite, natural images) far better than traditional task-specific models.
Explain the concept of Diffusion Models in generative computer vision. Discuss the forward and reverse processes.
Diffusion Models are a class of generative models that create data (images) by learning to reverse a gradual noising process. They have become the state-of-the-art for image generation (e.g., DALL-E 2, Stable Diffusion).\n\n1. Forward Process (Diffusion):\n- A Markov chain that gradually adds Gaussian noise to an original image over steps.\n- At step , the noisy image is defined as:\n \n where is a variance schedule.\n- At the final step , the image is essentially pure isotropic Gaussian noise.\n\n2. Reverse Process (Generation):\n- The goal is to learn the reverse distribution to denoise the image step-by-step.\n- A neural network (often a U-Net) is trained to predict the noise that was added to the image at timestep .\n- By starting with random noise and iteratively applying the trained reverse process, the model gradually denoises the input, eventually generating a high-quality, realistic image .
Compare Vision Transformers (ViT) and Convolutional Neural Networks (CNNs). Highlight the trade-offs between inductive biases and data requirements.
Comparison between ViTs and CNNs:\n\n1. Inductive Biases:\n- CNNs: Have strong inductive biases like translation invariance (a feature is the same no matter where it appears) and locality (pixels close to each other are related). This makes CNNs highly sample-efficient and good at learning from smaller datasets.\n- ViTs: Lack these inductive biases. They treat images as sequences of patches and use self-attention to learn global relationships from the start. They must learn translation invariance and locality entirely from the data.\n\n2. Data Requirements:\n- CNNs: Perform well even on moderately sized datasets (e.g., ImageNet with 1M images) due to their built-in priors.\n- ViTs: Require massive datasets (e.g., JFT-300M) for pre-training to outperform CNNs. When trained on small data, they easily overfit.\n\n3. Receptive Field:\n- CNNs: Receptive field grows gradually layer by layer.\n- ViTs: Have a global receptive field from the very first layer, capturing long-range dependencies better than CNNs.
What are the common evaluation metrics used for image segmentation tasks? Define Intersection over Union (IoU) and the Dice Coefficient.
Evaluation Metrics for Image Segmentation:\nTo measure how well a segmentation algorithm performs, we compare the predicted segmentation mask with the ground truth mask.\n\n1. Intersection over Union (IoU) / Jaccard Index:\n- IoU measures the overlap between the predicted mask () and the ground truth mask ().\n- Formula: \n- It is the area of overlap divided by the area of union. A score of 1 indicates perfect overlap, and 0 indicates no overlap.\n\n2. Dice Coefficient (F1 Score):\n- The Dice coefficient is heavily used in medical image segmentation. It gives twice the weight to the intersection.\n- Formula: \n- Like IoU, it ranges from 0 to 1.\n\nRelationship: Both metrics are positively correlated. IoU tends to penalize single instances of bad classification more than the Dice coefficient.
Explain the Watershed algorithm for image segmentation. What is the common problem associated with it, and how is it resolved?
Watershed Algorithm is a region-based segmentation technique that treats pixel values as a local topography or elevation. \n\nConcept:\n- High-intensity pixels are considered "peaks" (ridges), and low-intensity pixels are "valleys" (catchment basins).\n- The algorithm simulates flooding the topography from the local minima. As the "water" rises, it merges at the boundaries of different basins.\n- To prevent the water from different basins from mixing, "dams" (watershed lines) are built. These lines define the segmentation boundaries.\n\nCommon Problem: Oversegmentation\nBecause real-world images contain noise and local irregularities, the algorithm finds too many local minima, leading to severe oversegmentation (the image is shattered into hundreds of tiny regions).\n\nResolution: Marker-controlled Watershed\nTo solve this, markers are introduced to guide the flooding process:\n- Internal markers: Placed inside the objects of interest (using morphology or thresholding).\n- External markers: Placed in the background.\n- Flooding only begins from these specific, predefined marker locations, effectively preventing oversegmentation.