Unit 4 - Notes

INT345 11 min read

Unit 4: Feature Detection and Description, Feature Matching and model Fitting

Part 1: Feature Detection and Description

Overview of Feature Detection

Feature detection is the process of identifying points of interest (or "features") in an image that carry distinct and crucial information. Good features should be easily recognizable, distinct from their surroundings, and repeatable across different images regardless of variations in scale, rotation, illumination, or viewing angle.

Keypoints (Interest Points): Specific spatial locations (e.g., corners, edges, blobs) in an image.
Descriptors: Mathematical vectors that describe the visual appearance of the neighborhood around a keypoint.
Properties of ideal features:
- Repeatability: The same feature can be found in two images of the same scene.
- Distinctiveness: The feature descriptor should vary significantly from others to allow correct matching.
- Locality: Features should occupy a small image patch, reducing the chance of occlusion.
- Invariance: Robustness to affine transformations, scale changes, rotation, and illumination.

Harris Corner Detector

The Harris Corner Detector is a mathematical operator that identifies corners based on the variation of intensity. A corner is defined as a point where the image gradient exhibits significant change in two orthogonal directions.

Mathematical Foundation:
Let $I(x, y)$ be the image intensity. The change in intensity $E(u, v)$ for a shift $(u, v)$ is:
$E(u, v) = \sum_{x, y} w(x, y) [I(x+u, y+v) - I(x, y)]^2$
Using Taylor expansion, this can be approximated as:
$E(u, v) \approx \begin{bmatrix} u & v \end{bmatrix} M \begin{bmatrix} u \\ v \end{bmatrix}$
Where $M$ is the structure tensor (or second-moment matrix):
$M = \sum_{x, y} w(x, y) \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}$
( $I_x$ and $I_y$ are image derivatives, $w(x, y)$ is a window function like a Gaussian).
Corner Response Function ( $R$ ):
Instead of calculating the exact eigenvalues ( $\lambda_1, \lambda_2$ ) of $M$ , Harris and Stephens proposed the response function:
$R = \det(M) - k(\text{trace}(M))^2$
Where $\det(M) = \lambda_1 \lambda_2$ , $\text{trace}(M) = \lambda_1 + \lambda_2$ , and $k$ is an empirical constant (usually 0.04 to 0.06).
Interpretation:
- $R > 0$ (large): Corner (both eigenvalues are large).
- $R < 0$ : Edge (one eigenvalue is large, the other is small).
- $|R|$ is small: Flat region.
Properties: Rotation invariant, but not scale-invariant.

Scale-Invariant Feature Transform (SIFT)

SIFT addresses the scale-variance of Harris corners. It extracts highly distinctive, scale- and rotation-invariant features.

Scale-Space Extrema Detection:
- Constructs a "scale space" by convolving the image with Gaussian blurs of increasing standard deviations ( $\sigma$ ).
- Computes the Difference of Gaussians (DoG) between adjacent blurred images to approximate the Laplacian of Gaussian (LoG).
- Finds local extrema (minima and maxima) in the 3D DoG space (2D spatial + 1D scale).
Keypoint Localization:
- Refines the location of extrema using sub-pixel interpolation.
- Eliminates low-contrast points and edge responses (using a Hessian matrix) to retain only stable keypoints.
Orientation Assignment:
- Computes gradient magnitude and direction for pixels around the keypoint.
- Creates an orientation histogram (36 bins covering 360 degrees). The peak determines the keypoint's dominant orientation, ensuring rotation invariance.
Keypoint Descriptor:
- Takes a $16 \times 16$ neighborhood around the keypoint, divided into $4 \times 4$ sub-blocks.
- For each sub-block, computes an 8-bin orientation histogram.
- Results in a $4 \times 4 \times 8 = 128$ -dimensional floating-point vector.
- Normalized to achieve illumination invariance.

Speeded Up Robust Features (SURF)

SURF is heavily inspired by SIFT but is designed to be significantly faster to compute, while maintaining comparable performance.

Detector: Uses the Fast Hessian matrix. Instead of creating a Gaussian scale space, SURF uses box filters (approximations of Gaussian second order derivatives) that can be computed incredibly fast using Integral Images.
Descriptor:
- Extracts Haar wavelet responses in $x$ and $y$ directions.
- Divides the neighborhood into $4 \times 4$ sub-regions.
- For each sub-region, computes a 4D vector: $( \sum dx, \sum dy, \sum |dx|, \sum |dy| )$ .
- Results in a 64-dimensional descriptor, making it faster to match than SIFT's 128-dimensional descriptor.

Binary Feature Detectors

Floating-point descriptors (like SIFT and SURF) require significant memory and matching time (computing Euclidean distances). Binary descriptors represent features as strings of 1s and 0s.

Advantages:
- Extremely fast matching using the Hamming distance (computable via bitwise XOR followed by a bit-count).
- Highly memory-efficient.
- Suitable for real-time applications on low-power devices (e.g., smartphones, drones).

FAST (Features from Accelerated Segment Test)

FAST is a corner detection algorithm optimized for real-time performance.

Algorithm:
1. Select a pixel $p$ with intensity $I_p$ .
2. Consider a circle of 16 pixels surrounding $p$ (radius of 3).
3. $p$ is a corner if there exists a set of $N$ contiguous pixels in the circle that are all brighter than $I_p + t$ , or all darker than $I_p - t$ (where $t$ is a threshold).
4. Usually, $N=12$ or $N=9$ is used.
High-Speed Test: Checks pixels 1, 9, 5, and 13 first to quickly reject non-corners.
Limitation: FAST does not produce a descriptor, nor is it scale or rotation invariant. It only detects keypoints.

BRIEF (Binary Robust Independent Elementary Features)

BRIEF is a feature descriptor (not a detector). It relies on a pre-existing detector (like FAST) to find keypoints.

Algorithm:
- Smooths the image patch around the keypoint to reduce noise sensitivity.
- Selects pairs of pixels $(X, Y)$ within the patch based on a predefined random spatial distribution (typically a Gaussian distribution centered on the keypoint).
- For each pair, if intensity $I(X) < I(Y)$ , the bit is 1; otherwise, 0.
- Repeats this $N$ times (e.g., 256 or 512 times) to create a 256-bit or 512-bit binary string.
Limitation: Not rotation invariant.

ORB (Oriented FAST and Rotated BRIEF)

ORB is a fusion of FAST and BRIEF, with added modifications to make it rotation invariant and resistant to noise. It was developed in OpenCV labs as a free, efficient alternative to SIFT and SURF (which were historically patented).

Detector (oFAST): Uses FAST to find keypoints, applies a Harris corner measure to rank and select the top $N$ features. Computes the intensity centroid of the patch to find the orientation of the keypoint.
Descriptor (rBRIEF): Steers (rotates) the BRIEF sampling pattern according to the orientation calculated by oFAST. It also uses a greedy search algorithm to find pixel pairs that have high variance and are uncorrelated, maximizing the distinctiveness of the binary string.

Histogram of Oriented Gradients (HOG)

Primarily used for object detection (especially human/pedestrian detection) rather than matching sparse keypoints. HOG captures local shape information by analyzing gradient distributions.

Algorithm Steps:
1. Gradient Computation: Calculate $x$ and $y$ gradients using simple masks (e.g., [-1, 0, 1]). Find gradient magnitude and orientation.
2. Cell Histograms: Divide the image into small spatial cells (e.g., $8 \times 8$ pixels). Accumulate gradient magnitudes into an orientation histogram (typically 9 bins spanning 0-180 degrees) for each cell.
3. Block Normalization: Group cells into larger, overlapping blocks (e.g., $2 \times 2$ cells). Normalize the histograms within the block to account for changes in lighting and contrast.
4. Feature Vector: Concatenate all normalized block histograms into a single dense feature vector, which is typically fed into a Support Vector Machine (SVM) classifier.

Applications of Descriptors

Image Stitching / Panoramas: Finding overlapping features between multiple images to warp and align them.
Object Recognition: Storing features of an object in a database and finding those same features in a target image.
Visual Tracking: Tracking features from frame to frame in video sequences (e.g., SLAM in robotics, optical flow).
3D Reconstruction: Using multi-view feature correspondences to estimate depth and structure (Structure from Motion).

Part 2: Feature Matching and Model Fitting

Overview of Feature Matching

Once descriptors are extracted from two images, Feature Matching is the process of establishing correspondences between them. This is done by measuring the "distance" or "similarity" between the descriptor vectors.

Similarity Measures

The choice of distance metric depends heavily on the type of descriptor used.

Sum of Squared Differences (SSD) / $L_2$ Norm (Euclidean Distance):
- Standard for floating-point descriptors (SIFT, SURF).
- $D(A, B) = \sqrt{\sum (A_i - B_i)^2}$
Sum of Absolute Differences (SAD) / $L_1$ Norm (Manhattan Distance):
- Faster to compute than SSD, sometimes used for performance optimization.
- $D(A, B) = \sum |A_i - B_i|$
Normalized Cross-Correlation (NCC):
- Measures the linear correlation between two vectors. Values range from -1 to 1. Higher is better.
Hamming Distance:
- Standard for binary descriptors (BRIEF, ORB).
- Defined as the number of positions at which the corresponding bits are different.
- Computed via XOR followed by POPCOUNT (population count of 1s).

Brute Force Matching

The simplest matching approach. For every feature in Image A, compute the distance to every feature in Image B, and select the one with the minimum distance.

Complexity: $O(M \times N)$ , where $M$ and $N$ are the number of features in images A and B. Slow for large feature sets.
Refinement Techniques:
- Cross-Check: A match $(i, j)$ is only accepted if feature $i$ in image A matches feature $j$ in image B, AND feature $j$ in image B matches feature $i$ in image A.
- Lowe’s Ratio Test: Proposed by David Lowe (creator of SIFT). Instead of just taking the nearest neighbor, take the closest and the second closest. If the ratio $D_{closest} / D_{second\_closest} < \text{threshold}$ (e.g., 0.75), accept the match. This eliminates ambiguous matches that happen in repetitive textures.

K-D Tree (K-Dimensional Tree)

To avoid the $O(M \times N)$ cost of Brute Force, we use spatial data structures for nearest-neighbor search.

Structure: A binary tree where every node is a $k$ -dimensional point. Every non-leaf node generates a splitting hyperplane that divides the space into two half-spaces based on a specific dimension.
Search Complexity: $O(\log N)$ on average for low dimensions.
Limitations: The "Curse of Dimensionality." When the dimensionality $k$ is high (e.g., 128 for SIFT), the search performance degrades to $O(N)$ , no better than brute force.
FLANN (Fast Library for Approximate Nearest Neighbors): Uses randomized k-d trees or hierarchical k-means to find approximate nearest neighbors much faster in high dimensions.

Locality-Sensitive Hashing (LSH)

LSH is an algorithm for approximate nearest neighbor search, highly effective for high-dimensional binary descriptors (like ORB).

Concept: Standard hash functions try to minimize collisions. LSH does the opposite: it designs hash functions such that similar items map to the same "buckets" with high probability.
Mechanism: It uses random projections to map high-dimensional data into lower-dimensional binary codes. If two descriptors are close in the high-dimensional space, their hash values will likely be identical.
Matching: Only search for matches within the same (or neighboring) hash buckets, drastically reducing the search space and time.

RANSAC for Robust Matching

Even with ratio tests and cross-checks, feature matching will always produce outliers (incorrect matches). When estimating a geometric model (like a homography matrix for image stitching or a fundamental matrix in stereo vision), even a single severe outlier can ruin the result using standard least-squares fitting.
RANSAC (Random Sample Consensus) is an iterative algorithm designed to robustly estimate mathematical model parameters from a dataset containing outliers.

RANSAC Algorithm Steps:
1. Hypothesize: Randomly select the minimum number of data points required to define the mathematical model (e.g., 4 point pairs for a Homography).
2. Estimate: Calculate the parameters of the model using only this minimal subset.
3. Evaluate: Apply the calculated model to the entire dataset. Data points that fit the model within a predefined tolerance (error threshold) are counted as inliers.
4. Iterate: Repeat steps 1-3 for a set number of iterations $k$ .
5. Select Best Model: The model with the highest number of inliers is selected as the best fit.
6. Refine (Optional): Recompute the model using standard least-squares fitting on all the inliers found by the best model.
Why RANSAC works: By selecting random minimal subsets, there is a high probability that eventually, a subset consisting entirely of true matches (inliers) will be chosen, resulting in an accurate model that aligns with the majority of the good data.

PYTHON

# Pseudo-code for RANSAC Homography Estimation
best_model = None
best_inlier_count = 0

for i in range(num_iterations):
    # 1. Randomly sample 4 corresponding point pairs
    sample_points = get_random_samples(matches, 4)
    
    # 2. Compute Homography matrix H from the 4 points
    H = compute_homography(sample_points)
    
    # 3. Evaluate model against all matches
    inliers = 0
    for match in matches:
        error = compute_reprojection_error(match, H)
        if error < threshold:
            inliers += 1
            
    # 4. Update best model
    if inliers > best_inlier_count:
        best_inlier_count = inliers
        best_model = H

return best_model

Unit 3

Unit 5