Unit 5 - Notes

INT345 9 min read

Unit 5: Color Processing and Range Image Processing

Part 1: Color Processing

Importance of Color in Computer Vision

Color is a powerful visual descriptor that drastically simplifies object identification and extraction from a scene. While early computer vision heavily relied on grayscale images due to computational constraints, modern vision systems utilize color for several critical reasons:

Discriminative Power: Color adds multiple dimensions of data (usually three channels instead of one). Two objects might share the same shape and grayscale intensity but can be easily distinguished by their color (e.g., an apple vs. an orange).
Object Tracking and Segmentation: Algorithms like Mean-Shift or CamShift rely heavily on color histograms to track objects across video frames. Skin-color segmentation is foundational for hand gesture recognition and facial tracking.
Robustness to Occlusion: Color features (such as color histograms) are invariant to rotation, translation, and partial occlusion.
Biological Inspiration: Human vision heavily relies on color to distinguish boundaries, gauge material properties, and quickly identify semantic meaning (e.g., a red stop sign).

Color Models

A color model (or color space) is a mathematical system used to represent colors as tuples of numbers. Different models serve different hardware and processing needs.

RGB (Red, Green, Blue):
- Concept: An additive color model primarily used for digital displays and camera sensors. Colors are created by mixing red, green, and blue light.
- Drawback: High correlation between channels. A change in lighting affects all three channels, making it non-ideal for computer vision tasks like segmentation based strictly on "color."
HSV / HSI / HSL (Hue, Saturation, Value/Intensity/Lightness):
- Concept: Models color closer to human perception by separating chrominance (color information) from luminance (intensity).
- Components:
  - Hue: The dominant color (0-360 degrees).
  - Saturation: The purity or vividness of the color (0-100%).
  - Value/Lightness: The brightness of the color (0-100%).
- Advantage: Highly robust to lighting changes. To track a red ball under varying shadows, one only needs to threshold the Hue channel.
CMYK (Cyan, Magenta, Yellow, Key/Black):
- Concept: A subtractive model used primarily in color printing.
YUV / YCbCr:
- Concept: Used in video broadcasting and image compression (like JPEG). It separates luminance (Y) from chrominance (U/Cb and V/Cr).
- Advantage: Human eyes are more sensitive to brightness (Y) than color. Compression algorithms downsample the color channels to save bandwidth without a noticeable loss in visual quality.
CIE L*a*b*:
- Concept: Designed to be perceptually uniform, meaning a given numerical change corresponds to a similar perceived change in color.
- L*: Lightness, a*: Green to Red, b*: Blue to Yellow.

Conversion Between Color Spaces

Converting between color spaces involves mathematical transformations. In practice, this is crucial for extracting the most relevant features for a given machine learning or computer vision algorithm.

RGB to Grayscale Example:
A common conversion uses weighted averages based on human eye sensitivity (we see green better than blue):
Grayscale = 0.299*R + 0.587*G + 0.114*B

Practical implementation using OpenCV:

PYTHON

import cv2

# Load an image in default BGR format
image_bgr = cv2.imread('image.jpg')

# Convert BGR to RGB
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)

# Convert BGR to HSV for color segmentation
image_hsv = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2HSV)

# Convert BGR to Grayscale
image_gray = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2GRAY)

Color Augmentation Techniques

In modern Deep Learning, training sets must be artificially expanded to prevent models from overfitting to the specific lighting conditions of the training data. Color augmentation slightly alters the color properties of an image.

Color Jittering: Randomly perturbing the brightness, contrast, saturation, and hue of an image within a specified range.
Histogram Equalization: Enhancing the contrast of an image by stretching out the most frequent intensity values. Adaptive techniques (like CLAHE) are often used to prevent noise amplification.
PCA Color Augmentation (Fancy PCA): Introduced in the AlexNet paper. It performs Principal Component Analysis on the RGB pixel values throughout the training set and adds multiples of the principal components to images. This ensures color shifts follow the natural variance of the dataset.
Channel Shuffling: Randomly swapping the R, G, and B channels to force neural networks to rely on edges and shapes rather than specific colors.

Color Constancy

Color Constancy is the ability to perceive the intrinsic color of an object regardless of the color of the light source illuminating it. For example, a white piece of paper looks white under a yellow tungsten bulb and under blue sky, even though the light hitting the camera sensor is vastly different.

Computer Vision algorithms attempt to mimic this biological mechanism to remove color cast (white balancing).

Gray World Assumption: Assumes that the average color of an entire scene is neutral gray. If the average color of an image is skewed toward red, the algorithm determines that the scene is under red illumination and scales the channels inversely to balance it.
- Formula: $R_{new} = R \times \frac{Gray}{R_{avg}}$ , $G_{new} = G \times \frac{Gray}{G_{avg}}$ , $B_{new} = B \times \frac{Gray}{B_{avg}}$
White Patch (Max-RGB) Retinex: Assumes that the maximum value in an image corresponds to a perfectly reflecting white surface. The algorithm scales all pixels such that the maximum values in R, G, and B become pure white (255).
Gamut Mapping: A more complex algorithm that restricts the observed colors (image gamut) to a known set of colors under standard illumination (canonical gamut).
Deep Learning Approaches: Convolutional Neural Networks (CNNs) can be trained to estimate the illuminant of a scene directly and apply the necessary chromatic adaptation matrix to correct the image.

Part 2: Range Image Processing

Introduction to Range Images

A Range Image (also known as a depth map or depth image) is a 2D array of pixels where each pixel value represents the physical distance (depth) from the sensor to the surface of an object in the scene. Instead of capturing visual intensity or color, range images capture 3D geometric properties from a specific viewpoint. They are typically visualized in grayscale, where darker pixels are closer and lighter pixels are further away, or via a false-color heat map.

Difference Between 2D Intensity Images and 3D Range Images

Feature	2D Intensity Image	3D Range Image (Depth Map)
Pixel Value Meaning	Amount of reflected light (brightness/color).	Physical distance (Z-depth) from the sensor.
Invariance to Illumination	Highly sensitive to shadows, lighting, and reflections.	Completely invariant to ambient lighting and shadows.
Invariance to Texture	Sensitive to surface patterns (a printed photo of a face looks like a real face).	Ignores 2D texture (a printed photo of a face is totally flat).
Dimensionality	Captures X, Y space.	Captures X, Y, and Z (depth) space.
Processing Goal	Edge detection, object recognition, texture analysis.	Surface normal estimation, 3D reconstruction, obstacle avoidance.

Active Range Sensors

Range images are typically generated using active sensors, which emit energy (light or sound) and measure the reflection, as opposed to passive sensors (like standard cameras) that only receive ambient light.

Time-of-Flight (ToF) Cameras:
- Mechanism: Emits an infrared light pulse and measures the exact time it takes for the light to travel to the object and bounce back to the sensor. Since the speed of light is constant, distance is calculated as $d = \frac{c \times t}{2}$ .
- Pros/Cons: High frame rates, good for indoor use. Can suffer from multi-path interference (light bouncing off multiple surfaces).
LiDAR (Light Detection and Ranging):
- Mechanism: Uses a rotating laser beam to scan the environment, calculating distance via Time-of-Flight or phase shift. Generates a highly accurate 3D point cloud, which can be projected into a 2D range image.
- Pros/Cons: Very high accuracy, long-range, works outdoors. Historically expensive and mechanically complex (though solid-state LiDAR is changing this).
Structured Light (e.g., Microsoft Kinect v1, Apple FaceID):
- Mechanism: Projects a known infrared pattern (like a grid or dots) onto the scene. A secondary infrared camera views the pattern. The deformation of the pattern as it strikes 3D objects is used to calculate depth via triangulation.
- Pros/Cons: Excellent for near-field, indoor scanning. Struggles outdoors due to interference from the sun's infrared light.
Radar and Sonar: Use radio waves or sound waves, respectively. Useful for very long-range or underwater applications, though they typically yield much lower resolution than optical methods.

Preprocessing of Range Data

Raw range data is notoriously noisy and prone to specific types of errors not found in standard cameras. Preprocessing is vital before extracting higher-level features.

Handling Missing Data (Dropouts): Some surfaces (like black objects, transparent glass, or highly reflective mirrors) absorb or scatter the sensor's active light, resulting in "holes" or NaN (Not a Number) values in the depth map.
- Solution: Morphological closing operations, interpolation, or inpainting algorithms are used to estimate the missing depth based on neighboring pixels.
Noise Filtering: Range images often suffer from Gaussian noise or "flying pixels" (artifacts found at the sharp boundaries between foreground and background objects).
- Median Filtering: Excellent for removing salt-and-pepper noise and spikes without blurring sharp depth edges.
- Bilateral Filtering: Smoothes flat surfaces while preserving depth discontinuities (edges), vital for keeping object boundaries sharp.
Coordinate Transformation: Converting the 2D depth map (depth relative to pixel coordinates) into a true 3D Point Cloud using the camera's intrinsic matrix (focal length and optical center).

Applications of Range Data

Autonomous Vehicles & ADAS: LiDAR and ToF sensors generate range maps to detect obstacles, pedestrians, and road boundaries in real-time, independent of daytime/nighttime lighting.
Robotics and Automation: Used for robotic grasping (calculating the exact physical shape and orientation of a part) and Simultaneous Localization and Mapping (SLAM) for robot navigation.
Biometrics & Security: Facial recognition systems (like FaceID) use structured light range maps to prevent spoofing with 2D photographs.
Augmented Reality (AR): Depth maps allow digital objects to realistically interact with the physical world, including proper occlusion (e.g., a virtual character hiding behind a real desk).
3D Reconstruction: Combining range images from multiple viewpoints to create accurate 3D models for reverse engineering, medical imaging, or digital twin creation.

Unit 4

Unit 6