Unit5 - Subjective Questions
INT345 • Practice Questions with Detailed Answers
Discuss the importance of color in computer vision applications. How does it enhance object recognition compared to grayscale images?
Importance of Color in Computer Vision:
Color is a fundamental attribute of the visual world that provides critical information for many computer vision tasks. Its importance can be summarized as follows:
- Enhanced Feature Extraction: Color provides an extra dimension of information (usually 3 channels instead of 1), making it easier to segment images, track objects, and extract distinguishing features.
- Robust Object Recognition: Many objects (e.g., a stop sign, an apple, a specific uniform) are heavily characterized by their color. Color helps differentiate objects that might have identical shapes or textures but different colors.
- Material and Surface Analysis: Variations in color and shading provide clues about material properties, lighting conditions, and surface geometries.
- Human-Computer Interaction: For applications mimicking human perception, processing color is essential because human vision relies heavily on chromatic information to interpret scenes.
Advantage over Grayscale:
While grayscale images rely solely on intensity (brightness) gradients to detect edges and shapes, color images allow algorithms to detect "color edges"—boundaries where the intensity might be identical, but the hue or saturation changes drastically. This resolves ambiguities that frequently occur in purely grayscale processing.
Explain the RGB and HSV color models. Why is the HSV color space often preferred over RGB in computer vision tasks like object tracking?
RGB Color Model:
- The RGB (Red, Green, Blue) model is an additive color model primarily used for displaying images on screens.
- Colors are created by combining varying intensities of red, green, and blue light.
- It is represented as a 3D Cartesian coordinate system (a cube) where each axis corresponds to R, G, and B values (typically 0-255).
HSV Color Model:
- The HSV (Hue, Saturation, Value) model represents color in a way that aligns closer to human perception.
- Hue: Represents the color type (e.g., red, blue, yellow) as an angle from 0 to 360 degrees on a color circle.
- Saturation: Represents the purity or intensity of the color (0% is gray, 100% is the pure color).
- Value: Represents the brightness of the color (0% is black, 100% is full brightness).
Why HSV is Preferred in Computer Vision:
- Illumination Invariance: In the RGB model, a change in lighting affects all three channels (R, G, and B) simultaneously. In HSV, changes in lighting primarily affect only the 'Value' channel. This makes it much easier to isolate colors under varying lighting conditions by thresholding the 'Hue' channel.
- Intuitive Color Specification: It is easier for humans to specify a color range (e.g., "find all green objects") by selecting a range of Hue values, rather than guessing the correct combinations of R, G, and B.
Describe the CMY and CMYK color models. What is the mathematical relationship between RGB and CMY?
CMY and CMYK Color Models:
- CMY (Cyan, Magenta, Yellow): This is a subtractive color model used primarily in color printing. Unlike RGB, which starts with black and adds light, CMY starts with white (e.g., paper) and subtracts light through ink. Cyan absorbs red, Magenta absorbs green, and Yellow absorbs blue.
- CMYK (Cyan, Magenta, Yellow, Key/Black): In practice, mixing pure C, M, and Y inks rarely produces a perfect, deep black; it usually results in a muddy brown. Therefore, a fourth "Key" (Black) ink is added to produce true blacks, improve image contrast, and save on expensive colored inks.
Mathematical Relationship (RGB to CMY):
Assuming RGB values are normalized to the range [0, 1], the conversion to CMY is a simple subtraction from 1:
Conversely, converting back from CMY to RGB is:
Derive or state the standard mathematical formula for converting an RGB image to a grayscale intensity image. Explain the significance of the specific weights used in the formula.
Conversion Formula:
To convert a true-color RGB image to a grayscale intensity image, a weighted sum of the Red, Green, and Blue channels is calculated for each pixel. The standard (ITU-R BT.601) formula is:
Where:
- is the resulting grayscale luminance (intensity).
- are the respective red, green, and blue pixel values.
Significance of the Weights:
The weights do not sum to arbitrary numbers; they sum to $1.0$. They are chosen based on human visual perception.
- The human eye's retina contains different types of cone cells (L, M, S) that are sensitive to long, medium, and short wavelengths.
- Human vision is overwhelmingly most sensitive to green light, moderately sensitive to red light, and least sensitive to blue light.
- Therefore, the Green channel is given the highest weight ($0.587$), Red gets the middle weight ($0.299$), and Blue gets the lowest weight ($0.114$). This ensures that the resulting grayscale image matches the perceived brightness of the original color image as seen by a human.
Detail the algorithmic steps to convert a pixel from the RGB color space to the HSV color space.
RGB to HSV Conversion Algorithm:
Given an RGB color where values are normalized to the range :
Step 1: Find the maximum and minimum RGB values.
Step 2: Calculate Hue (H).
Hue represents the angle on the color wheel .
- If , then (the color is a shade of gray).
- If , then
- If , then
- If , then
(Note: If , add to make it positive.)
Step 3: Calculate Saturation (S).
Saturation represents color purity, range .
- If , then
- If , then
Step 4: Calculate Value (V).
Value represents brightness, range .
The resulting tuple represents the color in the HSV color space.
What is color augmentation in computer vision? List and explain three common color augmentation techniques used to train deep learning models.
Color Augmentation:
Color augmentation is a data preprocessing technique used in deep learning to artificially expand the size and diversity of a training dataset by applying random color transformations to images. This helps the model generalize better and become invariant to lighting and camera sensor variations.
Three Common Techniques:
- Color Jittering: This involves randomly changing the brightness, contrast, saturation, and hue of an image by small margins. For example, multiplying the RGB channels by a random scalar alters brightness, teaching the model to recognize objects in both well-lit and dim environments.
- Channel Shuffling: This technique randomly swaps the RGB channels of an image (e.g., turning an RGB image into BGR or GRB). While it changes the semantic color of objects (making a red car blue), it forces the network to rely on structural features (edges, shapes) rather than memorizing color associations, which is useful for certain shape-recognition tasks.
- Grayscale Conversion: Randomly converting a percentage of the training batch to grayscale. This forces the neural network to learn spatial and textural features that do not rely on chromatic information, making the model robust against color noise and variations.
Define color constancy. Why is achieving color constancy a computationally challenging problem in computer vision?
Definition:
Color constancy is the ability of a visual system (human or machine) to perceive the true color of an object despite variations in the color of the light source illuminating it. For example, a white piece of paper looks white to humans whether it is viewed under yellowish incandescent indoor light or bluish natural daylight.
Why it is Computationally Challenging:
The color recorded by a camera sensor is a product of three factors:
- The spectral power distribution of the illuminant (light source).
- The surface reflectance properties of the object.
- The spectral sensitivity of the camera sensor.
Mathematically, the sensor response is an integral:
Where is illumination, is reflectance, and is sensor sensitivity across wavelength .
The challenge is an ill-posed inverse problem: computer vision algorithms only receive the final pixel value . They must estimate the object's true reflectance without knowing the illuminant . Because multiple combinations of lighting and surface colors can produce the exact same pixel value, the system must rely on assumptions and heuristics (like the Grey World assumption) to estimate and discount the illuminant.
Explain the 'Grey World' and 'White Patch' algorithms used for achieving computational color constancy.
1. Grey World Algorithm:
- Assumption: The Grey World algorithm is based on the assumption that given an image with a sufficient amount of color variations, the average color of the entire scene should be a neutral, achromatic grey.
- Mechanism: If the average color of an image deviates from this neutral grey, the algorithm assumes this deviation is entirely caused by the color of the illuminant (the light source).
- Correction: It calculates the average values for the R, G, and B channels. It then applies a scaling factor to each channel independently so that the new averages equal a predefined grey value. This effectively shifts the color balance to cancel out a color cast.
2. White Patch (Max-RGB) Algorithm:
- Assumption: The White Patch algorithm assumes that the brightest patch or pixel in an image corresponds to a glossy surface or a white object that is reflecting the maximum amount of light, and therefore reflects the true color of the light source.
- Mechanism: It finds the maximum pixel value in each of the R, G, and B channels across the image.
- Correction: It scales all pixels in the image such that the maximum R, G, and B values map to pure white (e.g., 255 in an 8-bit image). By normalizing the image relative to these maximum values, the color cast introduced by the illuminant is removed.
What are range images? Distinguish between a 2D intensity image and a 3D range image based on what their pixel values represent.
Range Images (Definition):
A range image (also known as a depth map or depth image) is a 2D array of pixels where each pixel value represents the physical distance (depth or range) from a reference point (usually the sensor/camera) to a specific point on an object in the 3D scene.
Distinction:
| Feature | 2D Intensity Image | 3D Range Image |
|---|---|---|
| Pixel Value Meaning | Represents the amount of light reflected from the object's surface (brightness/color). | Represents the physical geometric distance (depth) from the sensor to the object surface. |
| Illumination Dependency | Highly dependent on external lighting, shadows, and surface texture. | Generally independent of ambient lighting and surface texture (especially if active sensors are used). |
| Geometry | Projects 3D space onto a 2D plane, losing scale and depth information. | Directly captures the 3D geometric structure and scale of the scene. |
| Visualization | Viewed naturally as a photograph. | Viewed as a grayscale map (closer objects are brighter/darker depending on mapping) or as a 3D point cloud. |
Describe the basic concept of Active Range Sensors. List the three broad categories of active range sensing techniques.
Concept of Active Range Sensors:
Active range sensors are devices that do not rely on naturally occurring ambient light to capture scene information. Instead, they emit their own controlled source of energy (such as laser light, infrared, or ultrasonic waves) into the environment and detect the reflection or backscatter from objects. By analyzing the properties of the returned signal (e.g., time taken, phase shift, or spatial distortion), the sensor calculates the precise distance to the object.
Three Broad Categories of Active Range Sensing Techniques:
- Time of Flight (ToF): Measures the time it takes for an emitted pulse of light/sound to travel to an object and reflect back to the sensor (e.g., LiDAR, Ultrasonic sensors).
- Triangulation: Emits a beam of light (like a laser) at a known angle. A camera at a known baseline distance captures the reflection. The depth is calculated using geometric triangles.
- Structured Light: Projects a known pattern (often grids or parallel stripes) of light onto the scene. The deformation of this pattern on the surfaces of objects is captured by a camera to calculate depth.
Explain the working principle of a Time-of-Flight (ToF) range sensor. Derive the fundamental equation used to calculate the distance.
Working Principle of ToF Sensors:
A Time-of-Flight (ToF) sensor works by illuminating a scene with a modulated light source (usually a laser or infrared LED) and measuring the time it takes for the light to travel to an object and reflect back to a receiver (sensor).
There are two main types: direct ToF (dToF), which measures the exact time of a single pulse, and indirect ToF (iToF), which measures the phase shift of a continuous modulated light wave.
Derivation of Fundamental Equation (Direct ToF):
- Let be the distance from the sensor to the object.
- Let be the speed of the emitted signal (for light, m/s).
- The emitted pulse travels to the object (distance ) and reflects back to the sensor (distance ). Thus, the total distance traveled by the pulse is .
- Let be the total round-trip time measured by the sensor.
Since Velocity = Distance / Time, we have:
Rearranging to solve for distance :
By measuring the extremely short time interval with high precision, the sensor accurately calculates the depth for each pixel, creating a range image.
How does a Structured Light scanner acquire a range image? Explain how it solves the correspondence problem inherent in stereo vision.
Acquisition of Range Image via Structured Light:
- A structured light scanner consists of a projector and a camera positioned at a known baseline distance from each other.
- The projector projects a sequence of known 2D patterns (usually alternating black and white stripes, grids, or phase-shifted sinusoids) onto the target object.
- Due to the 3D surface geometry of the object, these projected patterns appear distorted or displaced when viewed from the perspective of the camera.
- The camera captures images of these distorted patterns.
- Using the known geometry of the system (baseline, angles) and the amount of distortion in the pattern, the system calculates the depth of each pixel using optical triangulation.
Solving the Correspondence Problem:
In traditional passive stereo vision (using two cameras), the "correspondence problem" refers to the difficulty of matching a pixel in the left camera's image with the exact same physical point in the right camera's image, especially on textureless surfaces.
Structured light solves this by replacing one camera with a projector. The projector actively projects a unique, uniquely identifiable code (via the light pattern) onto every point in the scene. The camera simply reads this code. Because the pattern is artificially overlaid, even completely plain, textureless objects acquire distinguishable features, making correspondence mapping direct and computationally efficient.
What are the common sources of noise in range images acquired by active sensors? Mention two preprocessing techniques to mitigate them.
Common Sources of Noise in Range Images:
- Sensor Noise/Thermal Noise: Inherent electronic noise in the photodetectors (e.g., shot noise in ToF sensors).
- Multipath Interference: Light bounces off multiple surfaces (e.g., in corners) before returning to the sensor, causing the sensor to calculate an artificially long distance.
- Specular Reflections: Highly reflective (shiny) surfaces bounce the emitted signal away from the sensor, resulting in "missing" data (holes) or false depths.
- Ambient Light Interference: Strong ambient light (like direct sunlight) can overwhelm the active signal (especially IR light), reducing the signal-to-noise ratio.
- Motion Blur: If the sensor or the object moves during the scan, edge artifacts and depth smearing occur.
Preprocessing Techniques:
- Median Filtering: Replaces each pixel's depth value with the median of its neighbors. Excellent for removing salt-and-pepper (spiky) noise without severely blurring sharp geometric edges.
- Statistical Outlier Removal (SOR): Computes the mean distance of each point to its neighbors. Points that fall outside a standard deviation threshold are considered outliers (noise) and are removed.
Explain the process of point cloud/range image filtering using Statistical Outlier Removal (SOR).
Statistical Outlier Removal (SOR) Process:
SOR is a widely used preprocessing technique to clean noisy range data (often represented as 3D point clouds). The process operates on the spatial distribution of the data:
- Neighborhood Selection: For every point in the range image/point cloud, the algorithm identifies its nearest neighbors (where is a user-defined parameter).
- Distance Calculation: It calculates the average Euclidean distance from point to all its neighbors. Let's call this mean distance .
- Global Statistics: Once is computed for all points, the algorithm calculates the global mean () and standard deviation () of these distances across the entire dataset.
- Thresholding: A distance threshold is defined as , where is a multiplier (e.g., 1.0 or 2.0).
- Outlier Rejection: The algorithm iterates through the points again. If a point's average neighborhood distance is greater than the threshold , it is deemed an isolated outlier (noise) and is removed from the dataset. Otherwise, it is kept as a valid point (inlier).
Discuss the application of range data in autonomous robotics and navigation. How does it improve upon traditional 2D vision?
Applications in Autonomous Robotics and Navigation:
Range data (such as that generated by LiDAR or RGB-D cameras) is foundational to modern robotics.
- Obstacle Avoidance: Robots can instantly detect the exact distance to obstacles in their path, preventing collisions.
- Simultaneous Localization and Mapping (SLAM): Robots use range data to continuously map an unknown environment in 3D while simultaneously determining their own location within that map.
- Terrain Assessment: Autonomous rovers and drones analyze depth maps to assess terrain roughness, finding safe paths over uneven surfaces.
- Grasping and Manipulation: Robotic arms use short-range depth sensors to precisely calculate the geometry, orientation, and distance of objects they need to pick up.
Improvement over Traditional 2D Vision:
Standard 2D cameras struggle with scale ambiguity (a small object close up looks identical to a large object far away). They also fail easily in poor lighting or when encountering uniform, textureless walls. Range data provides absolute metric scale (exact measurements in meters), works independently of ambient lighting (if active sensors are used), and explicitly defines free space versus occupied space, making navigation vastly safer and more reliable.
How are range images utilized in 3D object modeling and biometric face recognition?
1. 3D Object Modeling:
- Range images provide a direct mapping of surface geometry. To create a full 3D model (like in reverse engineering or cultural heritage preservation), range images of an object are captured from multiple different angles.
- These overlapping depth maps are aligned and merged (using algorithms like Iterative Closest Point - ICP) to form a complete 3D point cloud or mesh.
- Because range data captures actual physical dimensions, the resulting 3D models are metrically accurate.
2. Biometric Face Recognition:
- Traditional 2D face recognition is vulnerable to "spoofing" (e.g., presenting a photograph or video to the camera) and is sensitive to changes in lighting and makeup.
- Range images capture the 3D topology of a face (the depth of eye sockets, the protrusion of the nose, jawline contours).
- This 3D geometric signature is unique to individuals and cannot be fooled by flat photographs. It works reliably in the dark (using IR depth sensors) and provides high-security authentication (e.g., Apple's FaceID).
What is the tristimulus theory of color perception? How does this biological concept form the basis of the RGB color model in computer vision?
Tristimulus Theory of Color Perception:
The tristimulus theory (or Young-Helmholtz theory) posits that human color vision is based on three distinct types of photoreceptor cells (cones) in the retina. Each type is sensitive to a different range of the visible light spectrum:
- S-cones: Sensitive to Short wavelengths (peak in blue).
- M-cones: Sensitive to Medium wavelengths (peak in green).
- L-cones: Sensitive to Long wavelengths (peak in red).
Any color perceived by a human is essentially a combination of the specific stimulation levels of these three cone types.
Basis of the RGB Color Model:
The RGB color model in computer vision and digital displays is a direct technological mimicking of this biological mechanism. Because human eyes map the infinite spectrum of light wavelengths into just three primary signals, displays only need to emit three specific wavelengths of light (Red, Green, and Blue) at varying intensities to trick the human brain into perceiving any color in the visual spectrum. Consequently, digital cameras capture images using a Bayer filter array with Red, Green, and Blue pixel sensors, forming the fundamental 3-channel (RGB) image matrix used in computer vision.
Explain the principle of active triangulation used in laser scanners for depth measurement. Provide the mathematical formulation to calculate depth.
Principle of Active Triangulation:
Active triangulation systems use a laser emitter and a camera separated by a known distance (the baseline). The laser emits a point or line of light onto an object. The camera views this point. Because the baseline distance and the angle of the camera are known, the depth to the object can be calculated using the geometry of triangles. As the object moves closer or further away, the position of the laser dot on the camera's image sensor shifts horizontally.
Mathematical Formulation:
Let:
- = baseline distance between the laser emitter and the camera lens center.
- = focal length of the camera.
- = angle of the laser emitter relative to the baseline.
- = the horizontal displacement of the laser dot on the camera's image sensor from the optical center.
- = the depth (distance to the object) we want to find.
Using similar triangles, the depth can be found. The lateral offset of the object point from the camera center is related to depth by:
From the laser side, the geometry dictates that:
Substituting into the equation:
Rearranging to solve for depth :
Thus, by measuring on the image sensor, the exact depth is calculated.
Compare Time-of-Flight (ToF) sensors and Structured Light sensors. Discuss their advantages and disadvantages for acquiring range images.
Comparison:
1. Time-of-Flight (ToF) Sensors:
- Pros:
- Excellent for long-range scanning (LiDAR can scan hundreds of meters).
- Computationally lightweight; depth is calculated directly from time measurements.
- Less affected by ambient light if narrow-band IR filters are used.
- Cons:
- Lower spatial resolution compared to structured light.
- Suffer from multi-path interference (light bouncing in corners).
2. Structured Light Sensors:
- Pros:
- High spatial resolution and extreme precision (sub-millimeter accuracy is possible), making them ideal for 3D modeling and face scanning.
- Captures the entire scene depth at once using the projected pattern (if using single-shot patterns).
- Cons:
- Limited to short ranges (projector light fades quickly over distance).
- Highly susceptible to ambient light (struggles immensely outdoors in bright sunlight).
- Computationally heavy to decode the projected patterns to find correspondences.
What is the Retinex theory in the context of color constancy? Briefly explain how Retinex algorithms attempt to restore true colors.
Retinex Theory:
Proposed by Edwin Land, the Retinex (a portmanteau of "Retina" and "Cortex") theory models how the human visual system perceives color and lightness robustly under varying illumination. It posits that color perception is not determined merely by the absolute light entering the eye, but by the relative reflectance of surfaces calculated locally across the scene.
How Retinex Algorithms Work:
Retinex algorithms aim to separate the image into two components: the illumination (the light falling on the scene) and the reflectance (the true physical color of the object).
- Assumption: Illumination varies smoothly and slowly across an image, while reflectance changes sharply at object edges.
- Estimation: The algorithm estimates the illumination map by heavily blurring the original image (often using multi-scale Gaussian filters) to remove sharp details and retain only the smooth lighting gradient.
- Restoration: It then isolates the true reflectance by dividing the original image by the estimated illumination map (often done by subtracting in the logarithmic domain):
- The result is an image where shadows and colored light casts are removed, restoring color constancy.