Unit4 - Subjective Questions
INT345 • Practice Questions with Detailed Answers
Define feature detection in computer vision. What are the key characteristics of a 'good' feature?
Feature detection is the process of automatically identifying specific points or regions of interest in an image, such as corners, edges, or blobs, which can be tracked or compared across different images.
A 'good' feature should possess the following characteristics:
- Repeatability: The same feature should be detectable in two or more images of the same scene, despite variations in viewing conditions.
- Distinctiveness (Saliency): The feature should stand out from its background and be easily distinguishable from other features.
- Invariance: The detection should be robust to transformations such as translation, rotation, scale changes, and affine distortions.
- Robustness: It should be insensitive to noise, changes in illumination, and minor blur.
- Locality: The feature should occupy a relatively small area to minimize the chance of occlusion.
Explain the mathematical formulation of the Harris Corner Detector. How does it distinguish between flat regions, edges, and corners?
The Harris Corner Detector is based on the local auto-correlation function. It considers a small window and measures the change in intensity when the window is shifted by a small amount .
The change in intensity is given by:
Using Taylor series expansion, this can be approximated as:
where is the structure tensor (second-moment matrix):
Here, and are image derivatives.
The eigenvalues of , and , determine the region type:
- Flat region: Both and are small. is almost constant in all directions.
- Edge: One eigenvalue is large, the other is small. changes only in one direction.
- Corner: Both and are large. increases significantly in all directions.
Harris proposed a response function to avoid explicit eigenvalue computation:
where , , and is an empirical constant (usually 0.04 to 0.06). A local maximum of above a threshold indicates a corner.
Describe the main steps involved in the Scale Invariant Feature Transform (SIFT) algorithm.
The Scale Invariant Feature Transform (SIFT) extracts highly distinctive features that are invariant to scale and rotation. The algorithm consists of four main steps:
-
Scale-Space Extrema Detection:
The image is convolved with Gaussian filters at different scales. The Difference of Gaussian (DoG) is computed by subtracting adjacent blurred images. Local extrema (maxima or minima) in the DoG across space and scales are identified as potential keypoints. -
Keypoint Localization:
The exact position and scale of each candidate are determined using a Taylor series expansion of scale-space. Low-contrast points and poorly localized edge responses (checked using the Hessian matrix) are eliminated, leaving only stable keypoints. -
Orientation Assignment:
To achieve rotation invariance, a dominant orientation is assigned to each keypoint. A histogram of gradient orientations is built from the neighborhood of the keypoint, weighted by gradient magnitudes and a Gaussian window. The highest peak dictates the keypoint's orientation. -
Keypoint Descriptor Generation:
A neighborhood around the keypoint is taken, divided into sixteen sub-blocks. For each sub-block, an 8-bin orientation histogram is created. This results in a -dimensional feature vector. The vector is normalized to enhance robustness against illumination changes.
How does SIFT achieve scale and rotation invariance?
Scale Invariance in SIFT:
SIFT achieves scale invariance by searching for keypoints across multiple scales using a scale-space representation. It builds a pyramid of blurred images using a Gaussian function and computes the Difference of Gaussian (DoG). Keypoints are selected as the local extrema across both spatial dimensions and the scale dimension . Because the feature is extracted at the scale where it exhibits a strong response, the descriptor remains invariant if the image is scaled.
Rotation Invariance in SIFT:
Rotation invariance is achieved by assigning a dominant orientation to each keypoint based on local image gradient directions. A histogram of local gradient orientations is computed around the keypoint. The peak of this histogram is assigned as the keypoint's reference orientation. When constructing the final descriptor, the local neighborhood grid and the gradient orientations are rotated relative to this assigned dominant orientation, making the resulting 128-dimensional vector invariant to image rotation.
Compare and contrast SURF (Speeded Up Robust Features) with SIFT. How does SURF achieve its computational speedup?
Comparison of SURF and SIFT:
- Feature: Both are scale and rotation invariant. SIFT is generally considered slightly more robust to severe transformations, while SURF is significantly faster.
- Descriptor Size: SIFT typically uses a 128-dimensional vector. SURF uses a 64-dimensional vector (with an extended 128-D version available).
- Mathematical Basis: SIFT uses the Difference of Gaussians (DoG) to approximate the Laplacian of Gaussian (LoG). SURF relies on the determinant of the Hessian matrix and uses integer approximation of determinants.
How SURF achieves computational speedup:
- Integral Images: SURF heavily utilizes integral images, which allow the sum of pixel values within any rectangular area to be computed in just four array references, regardless of the area's size. This drastically speeds up box filter operations.
- Box Filters: Instead of standard Gaussian filters, SURF approximates the second-order Gaussian derivatives with simple 2D box filters (Haar wavelets). Because of integral images, scaling the filter size has no computational penalty, eliminating the need to iteratively downsample the image.
- Descriptor Calculation: SURF uses Haar wavelet responses in horizontal and vertical directions to construct its descriptor, which is faster to compute than SIFT's gradient histograms.
Explain the concept of binary feature detectors. Describe the FAST (Features from Accelerated Segment Test) algorithm in detail.
Binary Feature Detectors:
Binary feature detectors and descriptors represent visual features using binary strings (0s and 1s) instead of floating-point vectors (like SIFT or SURF). They compare pixel intensities in a local patch. This makes them highly memory-efficient and incredibly fast to match using the Hamming distance (computed via a simple XOR operation followed by a bit count).
FAST (Features from Accelerated Segment Test):
FAST is a corner detection algorithm focused on real-time performance.
- Select a pixel: Choose a candidate pixel with intensity .
- Set a threshold: Choose an appropriate threshold value .
- Consider a circle: Consider a circle of 16 pixels surrounding the pixel (usually a Bresenham circle of radius 3).
- Corner Condition: The pixel is a corner if there exists a set of contiguous pixels in the circle (typically ) which are all brighter than or all darker than .
- High-speed test: To speed up the algorithm, a quick test is performed by examining only pixels 1, 9, 5, and 13. If is a corner, at least three of these four must be brighter than or darker than . If they are not, is rejected immediately.
- Non-maximal suppression: Applied later to remove adjacent corners.
What are the limitations of the standard FAST algorithm, and how is machine learning used to improve it?
Limitations of standard FAST:
- The high-speed test (checking pixels 1, 9, 5, 13) does not generalize well to all environments and corners; it assumes allows rejection, which isn't mathematically guaranteed for all cases.
- The efficiency depends heavily on the ordering of the questions (which pixels are tested first).
- Multiple adjacent features are often detected, requiring non-maximal suppression.
- It only detects corners; it does not produce a descriptor or handle scale/rotation invariance natively.
Improvement using Machine Learning:
Machine learning is used to optimize the decision tree for testing pixels.
- A large set of training images is taken.
- The 16 pixels around a candidate are classified into three states: darker, similar, or brighter compared to the center pixel.
- The ID3 algorithm (a decision tree algorithm) is applied to this data to select the pixel that yields the most information gain (highest entropy reduction) to test next.
- This creates an optimized decision tree for the specific domain, resulting in a much faster and more accurate evaluation than the heuristic 1-9-5-13 check.
Describe the BRIEF (Binary Robust Independent Elementary Features) descriptor. Why is it considered highly efficient?
BRIEF (Binary Robust Independent Elementary Features) is a feature descriptor (not a detector; it requires a detector like FAST). It relies on simple binary tests between pixels in a smoothed image patch.
Algorithm:
- Smoothing: The image patch is heavily smoothed (usually with a Gaussian kernel) to reduce noise sensitivity.
- Selecting Pairs: A set of pairs of pixels is chosen within the patch based on a predefined spatial distribution (e.g., an isotropic Gaussian distribution around the center).
- Binary Test: For each pair, their intensities are compared. If intensity at < intensity at , the bit is set to 1; otherwise, it is 0.
- Forming String: This is repeated for all pairs (usually 128, 256, or 512 times), resulting in an -bit binary string.
Efficiency:
BRIEF is highly efficient for two reasons:
- Creation: It only requires smoothing and simple pixel intensity comparisons, bypassing the costly gradient computations and histogram bins used in SIFT/SURF.
- Matching: Descriptors are matched using the Hamming distance (XORing two binary strings and counting the set bits using
popcount), which is executed natively and extremely fast on modern CPUs compared to Euclidean distance calculations for floating-point vectors.
What is ORB (Oriented FAST and Rotated BRIEF)? How does it build upon FAST and BRIEF to create a robust feature detector and descriptor?
ORB (Oriented FAST and Rotated BRIEF) is a fast, robust, and open-source alternative to SIFT and SURF. As the name implies, it combines a modified FAST detector and a modified BRIEF descriptor.
Building upon FAST (Oriented FAST):
Standard FAST lacks scale and rotation invariance.
- Scale Invariance: ORB achieves scale invariance by applying FAST across a multi-scale image pyramid.
- Rotation Invariance: ORB computes the intensity centroid of the patch (the center of mass of pixel intensities). The vector from the center of the patch to the centroid defines the orientation of the keypoint.
Building upon BRIEF (Rotated BRIEF):
Standard BRIEF is highly sensitive to in-plane rotation.
- Steering BRIEF: ORB rotates the set of predefined BRIEF pixel pairs according to the orientation angle found by Oriented FAST before performing the binary tests.
- rBRIEF (Learning good binary tests): Because rotating the pairs makes them more correlated (losing variance), ORB uses a greedy learning algorithm on a training dataset to select a set of 256 pixel pairs that have high variance and are largely uncorrelated.
By combining these modifications, ORB delivers performance comparable to SIFT and SURF but is orders of magnitude faster and patent-free.
Explain the computation steps for the Histogram of Oriented Gradients (HOG) descriptor. What is its primary application?
Histogram of Oriented Gradients (HOG) is a feature descriptor widely used for object detection.
Computation Steps:
- Preprocessing: The image is typically resized (e.g., to for pedestrian detection) and optionally normalized for gamma/color.
- Gradient Computation: The image gradients (magnitude and orientation) are calculated using simple derivative masks (like and its transpose).
- Cell Histograms: The image is divided into small spatial regions called "cells" (e.g., pixels). For each cell, a 1-D histogram of gradient directions (usually 9 bins from 0 to 180 degrees) is compiled. Each pixel votes for a bin based on its gradient angle, weighted by its gradient magnitude.
- Block Normalization: To account for changes in illumination, cells are grouped into larger, overlapping "blocks" (e.g., cells, or pixels). The histograms within the block are concatenated and normalized (using L2-norm).
- Feature Vector: The normalized block vectors are concatenated to form the final HOG descriptor. For a image with cells and blocks, this yields a vector of 3780 dimensions.
Primary Application:
HOG's primary and most famous application is pedestrian detection (human detection), usually paired with a Support Vector Machine (SVM) classifier, because it captures local shape and edge information excellently while ignoring flat regions.
Discuss various applications of feature descriptors in computer vision.
Feature descriptors provide a numerical representation of image patches, allowing computers to recognize and match scenes. Key applications include:
- Image Stitching (Panoramas): Extracting features from overlapping images, matching them, and finding the geometric transformation (homography) to warp and blend images together.
- Object Recognition and Detection: Training classifiers (like SVMs with HOG) to detect specific objects (cars, faces, pedestrians) or matching descriptors (SIFT/ORB) against a database to recognize a specific book cover or product.
- 3D Reconstruction and Stereo Vision: Matching features between two cameras (stereo) or moving cameras to estimate depth (disparity) and reconstruct 3D models of environments using Structure from Motion (SfM).
- Visual Tracking: Following specific features across video frames to track objects, vehicles, or camera movement.
- Simultaneous Localization and Mapping (SLAM): Used extensively in robotics and AR/VR. Robots extract features from their sensors to build a map of an unknown environment while simultaneously keeping track of their location within it.
- Image Retrieval: Finding similar images in a large database (Reverse Image Search) using Bag-of-Visual-Words models built on top of feature descriptors.
Define feature matching. Discuss the common similarity (or distance) measures used to compare feature descriptors.
Feature Matching is the process of establishing correspondences between features extracted from two or more images. Given a feature in Image A, matching finds the most similar feature in Image B based on their descriptor vectors.
Common Similarity/Distance Measures:
- Euclidean Distance (L2 Norm): Used for floating-point descriptors like SIFT and SURF. It is the straight-line distance between two vectors and :
Lower distance implies higher similarity. - Manhattan Distance (L1 Norm): Sometimes used as a faster alternative to L2. It is the sum of absolute differences:
- Hamming Distance: Used strictly for binary descriptors (BRIEF, ORB, FAST). It measures the number of positions at which the corresponding bits are different. It is calculated incredibly fast using XOR and a bit-count operation.
- Cosine Similarity: Measures the cosine of the angle between two vectors. Useful when magnitude does not matter, only the direction. A value of 1 means perfectly similar.
What is Brute-Force matching in feature matching? Discuss its advantages and its primary drawback.
Brute-Force (BF) Matching is the simplest method for feature matching. For every feature descriptor in the first set, the algorithm calculates the distance to every single feature descriptor in the second set. The closest one (the one with the minimum distance) is returned as the match.
Advantages:
- Accuracy: It guarantees finding the absolute closest mathematical match between the two sets because it exhaustively checks all possibilities. It yields the exact nearest neighbor.
- Simplicity: Extremely easy to implement. It does not require building any complex data structures prior to matching.
Primary Drawback:
- Computational Complexity (Speed): Its major flaw is scalability. If Image 1 has features and Image 2 has features, the time complexity is . In high-resolution images with thousands of features (e.g., ), it requires 25 million distance calculations, which is too slow for real-time applications, especially with high-dimensional descriptors like 128-D SIFT.
Explain the structure of a K-D tree. How is it used to speed up feature matching?
A K-D Tree (K-Dimensional Tree) is a space-partitioning data structure used for organizing points in a -dimensional space. It is essentially a binary search tree extended to multiple dimensions.
Structure:
- At the root, the space is divided into two halves by a hyperplane perpendicular to one of the coordinate axes (e.g., the 1st dimension of a descriptor).
- The points are split based on a median value along that dimension; points with smaller values go to the left child, larger to the right.
- At the next level, the algorithm chooses a different axis (e.g., the 2nd dimension) and splits the subspaces again.
- This process continues recursively, cycling through the dimensions, until the leaf nodes contain a small, predefined number of points.
Speeding up Feature Matching:
Instead of brute-force checking points (complexity ), finding a match for a query feature using a K-D tree drops the average time complexity to .
To find a match, the query descriptor traverses down the tree by comparing its values at the splitting dimensions, quickly narrowing down the search space to a specific region (leaf node). Only a small subset of features within that leaf (and sometimes adjacent leaves) are evaluated, drastically reducing the number of distance calculations required.
Describe the process of finding the nearest neighbor in a K-D tree. What is the main challenge when using K-D trees with high-dimensional descriptors like SIFT?
Finding the Nearest Neighbor in a K-D Tree:
- Traverse down: Start at the root and move down the tree just like a binary search, comparing the query point's coordinate with the node's splitting value until a leaf node is reached.
- Current Best: The point in this leaf node is saved as the "current best" nearest neighbor.
- Backtrack: Unwind the recursion. At each step up, check if the distance to the "current best" is greater than the perpendicular distance from the query point to the splitting hyperplane of the current node.
- Cross boundary: If it is greater, it means there could be a closer point on the other side of the hyperplane. The algorithm must cross over and search the other branch.
- If it is smaller, the other branch is safely ignored.
Challenge with High-Dimensional Descriptors (The Curse of Dimensionality):
When the dimensionality is high (like 128 for SIFT), the space becomes so vast that the query point's distance to the splitting hyperplanes is almost always smaller than the distance to the current best match. This forces the algorithm to backtrack and search almost every branch of the tree. Consequently, the performance degrades rapidly, and in high dimensions, an exact K-D tree search is often no faster than a Brute-Force search.
What is Locality-Sensitive Hashing (LSH)? How does it address the curse of dimensionality in nearest neighbor search?
Locality-Sensitive Hashing (LSH) is an algorithmic technique that hashes input items in such a way that similar items map to the same "buckets" with high probability (the number of buckets being much smaller than the universe of possible input items). Unlike cryptographic hashes where small changes cause drastically different hashes, LSH aims for hash collisions for similar items.
Addressing the Curse of Dimensionality:
Traditional tree-based structures like K-D trees fail in high-dimensional spaces because they rely on exact geometric partitioning, leading to exhaustive searches (the curse of dimensionality).
LSH addresses this by giving up the guarantee of finding the exact nearest neighbor in favor of an Approximate Nearest Neighbor (ANN).
- High-dimensional descriptors are passed through a family of locality-sensitive hash functions.
- These functions assign similar descriptors to the same hash buckets.
- At query time, the query descriptor is hashed, and the algorithm only computes the exact distance against the items residing in the same (or neighboring) buckets.
This dramatically reduces the search space in or sub-linear time, making it highly scalable for massive datasets of high-dimensional vectors, at the cost of occasionally missing the absolute best match.
Explain how hash functions are designed for LSH (Locality-Sensitive Hashing) to ensure similar items map to the same bucket.
Hash functions in LSH are specifically designed based on the distance metric being used (e.g., Cosine similarity, Euclidean distance, or Hamming distance).
Example: Random Projection for Cosine Similarity:
To hash a high-dimensional continuous vector:
- Generate a set of random hyperplanes (random vectors) that pass through the origin.
- For a given data point (feature descriptor), take the dot product with each random hyperplane.
- If the dot product is positive (the point is on one side of the hyperplane), output a '1'. If negative, output a '0'.
- Repeating this for random hyperplanes yields a -bit hash signature.
Why it works: If two vectors are very similar (angle between them is small), it is highly unlikely that a random hyperplane will fall exactly between them. Therefore, they will end up on the same side of most hyperplanes, resulting in identical or very similar hash signatures. Thus, similar items naturally fall into the same "bucket" (hash value).
Describe the RANSAC (Random Sample Consensus) algorithm. Why is it essential for robust feature matching and model fitting?
RANSAC (Random Sample Consensus) is an iterative algorithm used to estimate parameters of a mathematical model from a set of observed data that contains outliers.
Algorithm:
- Random Sampling: Randomly select the minimum number of data points required to define the mathematical model (e.g., 2 points for a line, 4 points for a homography).
- Model Fitting: Compute the model parameters using only this small, randomly selected subset.
- Consensus Counting: Test all other data points in the entire dataset against this fitted model. If a point fits the model within a predefined error tolerance, it is counted as an "inlier".
- Iteration: Repeat steps 1-3 for a fixed number of iterations.
- Best Model: Keep the model that has the highest number of inliers (the largest consensus set). Finally, recompute the model using all the inliers from the best set to get a refined result.
Why it is essential:
Feature matching algorithms (like SIFT + Brute Force) are not perfect and always produce false positive matches (outliers). If we attempt to fit a model (like a homography for image stitching) using least squares on all matches, even a few severe outliers will completely skew the result. RANSAC is robust; it can tolerate a dataset with up to 50% (and sometimes more) outliers, effectively ignoring them and calculating the geometry based strictly on the true matches.
Walk through the step-by-step process of using RANSAC to fit a line to a 2D dataset containing significant outliers.
Fitting a line using RANSAC follows these steps:
- Initialization: Assume a 2D dataset with noisy points and distinct outliers. Set a maximum number of iterations , and an error threshold (distance from a point to the line).
- Select Subset: Randomly pick exactly two points from the dataset (since 2 points define a line).
- Fit Model: Calculate the equation of the line passing through these two points: .
- Evaluate Inliers: For every other point in the dataset, calculate its perpendicular distance to this line. If the distance is less than the threshold , classify the point as an "inlier".
- Score Model: Record the number of inliers for this line. If this count is higher than any previous iteration, save this line as the "best model".
- Repeat: Repeat steps 2-5 for iterations.
- Final Refinement: After iterations, take the best model (the one with the most inliers). Optionally, perform a standard linear regression (least squares) using all the inliers from this best set to refine the final line parameters.
Compare Brute-Force matching, K-D tree-based matching, and LSH in the context of matching binary versus floating-point descriptors.
The choice of matching algorithm heavily depends on the type of descriptor (Floating-point like SIFT/SURF vs. Binary like ORB/BRIEF).
-
Brute-Force Matching:
- Floating-Point: Computes L2 (Euclidean) distance. Slow for large datasets (), but exact.
- Binary: Computes Hamming distance (XOR + bit count). Incredibly fast due to hardware-level optimizations, making Brute-Force perfectly viable for real-time binary descriptor matching on moderate datasets.
-
K-D Tree (e.g., FLANN with K-D trees):
- Floating-Point: Ideal for floating-point vectors if dimensions are low. For 128-D SIFT, randomized K-D trees are used to give approximate nearest neighbors, offering massive speedups over Brute Force L2 searches.
- Binary: Generally not suited for binary descriptors. The discrete nature of binary strings doesn't partition well geometrically in a K-D tree structure.
-
Locality-Sensitive Hashing (LSH):
- Floating-Point: Can be adapted (e.g., random projections), but K-D trees/clustering are often preferred in standard libraries for continuous vectors.
- Binary: Highly effective for binary descriptors. Multi-probe LSH is the standard approximate nearest neighbor search method for ORB/BRIEF in large databases, as hashing binary strings to find collisions is computationally natural and extremely fast.