Unit 6 - Notes

INT345 6 min read

Unit 6: Image Segmentation and Advanced topics in computer vision

Part 1: Image Segmentation

1. Overview of Image Segmentation

Image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects or superpixels). The goal is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. It is heavily used in medical imaging, autonomous driving, and object detection.

Types of Segmentation:

Semantic Segmentation: Classifies each pixel into a predefined category (e.g., person, car, background). It does not distinguish between different instances of the same object.
Instance Segmentation: Identifies each distinct instance of an object (e.g., Car 1, Car 2, Car 3) and segments them individually.
Panoptic Segmentation: A hybrid of semantic and instance segmentation. It assigns a class label to every pixel (semantic) and uniquely identifies each instance of countable objects (instance).

2. Thresholding Based Segmentation

Thresholding is the simplest method of image segmentation. It converts a grayscale image into a binary image based on a threshold value.

Global Thresholding: A single threshold value is used for the entire image. If a pixel's intensity is greater than the threshold, it is set to the foreground (e.g., white); otherwise, it is set to the background (e.g., black).
Adaptive/Local Thresholding: Calculates thresholds for small regions of the image. This is highly effective for images with varying illumination conditions.
Otsu’s Method: An algorithm that automatically determines the optimal global threshold by maximizing the inter-class variance (or minimizing the intra-class variance) between the background and foreground pixels.

PYTHON

# Example of Global and Otsu's Thresholding using OpenCV
import cv2

image = cv2.imread('image.jpg', 0) # Read in grayscale

# Global Thresholding
ret, thresh_global = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)

# Otsu's Thresholding
ret_otsu, thresh_otsu = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

3. Edge-Based Segmentation

Edge-based segmentation relies on the detection of edges (boundaries where there is a sharp change in pixel intensity) to identify objects.

Edge Detectors: Algorithms like Sobel, Prewitt, and Canny are used to compute the image gradient.
Canny Edge Detection: The most popular multi-stage algorithm. It involves Gaussian blurring, gradient calculation, non-maximum suppression (thinning the edges), and hysteresis thresholding (connecting edges).
Edge Linking: Once edges are detected, they are often disjointed. Algorithms are used to link these edges into continuous, closed boundaries to successfully segment the object.

4. Region-Based Segmentation

Instead of finding boundaries, region-based methods group pixels together based on similarity criteria (e.g., color, intensity, or texture).

Region Growing: Starts with a set of "seed" points. The algorithm grows regions by appending neighboring pixels that have properties similar to the seed (e.g., intensity difference falls within a specified tolerance).
Region Splitting and Merging:
- Splitting: The entire image is initially treated as one region. If the region does not meet a homogeneity criterion, it is divided into four quadrants (Quadtrees).
- Merging: Adjacent regions that meet the homogeneity criteria are merged back together.

5. Clustering-Based Segmentation

Clustering algorithms partition the image data into clusters based on feature similarity. In this context, features can be pixel intensities, RGB values, or spatial coordinates.

K-Means Clustering:
1. Select $K$ random cluster centers.
2. Assign each pixel to the nearest cluster center based on Euclidean distance in the feature space.
3. Recalculate the cluster centers as the mean of the assigned pixels.
4. Repeat until convergence.
Mean Shift: A non-parametric, density-based clustering algorithm. It works by shifting data points toward the mode (the highest density of data points) in the feature space. It is excellent for preserving edge information while smoothing uniform regions and does not require the number of clusters ( $K$ ) a priori.

Part 2: Advanced Topics in Computer Vision

6. Generative Models

Generative models learn the underlying data distribution of the training set to generate new, synthetic data that resembles the original.

Variational Autoencoders (VAEs): Consist of an encoder that maps an image to a latent probabilistic space, and a decoder that reconstructs the image from this space. They are optimized using a reconstruction loss and a Kullback-Leibler (KL) divergence term.
Generative Adversarial Networks (GANs): Comprise two neural networks: a Generator (creates fake images) and a Discriminator (tries to distinguish real from fake). They are trained together in a min-max game. (e.g., StyleGAN, CycleGAN).
Diffusion Models: Currently state-of-the-art for image generation. They work by gradually adding Gaussian noise to an image (forward process) and training a neural network (usually a U-Net) to reverse this process and denoise the image step-by-step (reverse process).

7. Vision Transformers (ViT)

Introduced by Dosovitskiy et al. (2020), the Vision Transformer adapts the Transformer architecture (originally designed for Natural Language Processing) for image classification, challenging traditional Convolutional Neural Networks (CNNs).

Mechanism:
1. The image is split into fixed-size patches (e.g., 16x16 pixels).
2. These patches are linearly embedded into flat vectors (tokens).
3. Positional embeddings are added to retain spatial information.
4. The sequence of tokens is passed through standard Transformer encoder blocks using Self-Attention mechanisms, allowing the model to capture global dependencies across the entire image immediately, unlike the local receptive fields of CNNs.

8. Multimodal Vision Models

Multimodal models process and relate information from multiple data modalities, such as vision (images/video), text, audio, and sensor data (e.g., LiDAR).

Core Concept: Learning a joint representation space where corresponding inputs from different modalities are mapped close together.
Applications: Audio-visual speech recognition, emotion detection (combining facial expressions with tone of voice), and embodied AI (where robots use vision, touch, and audio to navigate environments).

9. Large Vision Models (LVMs)

Following the success of Large Language Models (LLMs), Large Vision Models are massive neural networks trained on gigantic, uncurated datasets using self-supervised learning techniques.

Self-Supervised Pre-training: Often uses Masked Image Modeling (MIM), where patches of an image are masked, and the model must predict the missing content (similar to Masked Language Modeling in BERT).
Segment Anything Model (SAM): A prominent example of a foundational vision model by Meta. It is a promptable segmentation system capable of zero-shot segmentation (segmenting objects it has never explicitly seen during training) based on point clicks, bounding boxes, or text prompts.

10. Vision-Language Models (VLMs)

VLMs are a specific, highly impactful subset of multimodal models that bridge computer vision and natural language processing.

Contrastive Language-Image Pretraining (CLIP): Developed by OpenAI, CLIP trains an image encoder (ResNet or ViT) and a text encoder simultaneously on millions of (image, text) pairs. It uses a contrastive loss function to maximize the cosine similarity between the embeddings of matching image-text pairs while minimizing it for incorrect pairs.
Capabilities:
- Zero-Shot Image Classification: CLIP can classify images into categories it was not explicitly trained on, simply by providing text prompts like "a photo of a [class]".
- Visual Question Answering (VQA): Answering natural language questions about the contents of an image.
- Image Captioning: Generating a descriptive text for a given image.
- Text-to-Image Generation Guidance: VLMs are foundational in guiding modern generative models like DALL-E, Midjourney, and Stable Diffusion by ensuring the generated image aligns with the text prompt.

Unit 5