Unit 2 - Notes

INT345 8 min read

Unit 2: Camera Geometry and 2-D Projective Geometry

1. Camera Models and Geometry

1.1 Pinhole Cameras

The pinhole camera is the simplest mathematical model used to describe the geometric mapping from 3D space to a 2D image plane.

Concept: A light-proof box with a small hole (the center of projection). Light rays from a 3D object pass through this point and project an inverted image onto the image plane (the back of the box).
Geometry: If the center of projection is at the origin $(0,0,0)$ and the image plane is at $Z = f$ (where $f$ is the focal length), a 3D point $P = (X, Y, Z)$ projects to a 2D point $p = (x, y)$ on the image plane.
Equations: By similar triangles, the projection equations are:
$x = f \frac{X}{Z}, \quad y = f \frac{Y}{Z}$
Virtual Image Plane: To avoid dealing with inverted images, we often mathematically place a "virtual" image plane in front of the camera center at $Z = f$ , ensuring the image is upright.

1.2 Cameras with Lenses

Real cameras use lenses to gather more light than a tiny pinhole allows, reducing exposure time and diffraction.

Thin Lens Equation: Relates the focal length $f$ , the distance to the object $Z$ , and the distance to the image plane $z$ :
$\frac{1}{f} = \frac{1}{Z} + \frac{1}{z}$
Depth of Field: Because of the lens aperture, only objects at a specific distance are perfectly in focus. The range of distances where the image is acceptably sharp is the depth of field.
Lens Distortions:
- Radial Distortion: Light rays bend more near the edges of the lens than at the optical center, leading to barrel or pincushion distortion. Corrected using polynomial models (e.g., $x_{corrected} = x (1 + k_1 r^2 + k_2 r^4)$ ).
- Tangential Distortion: Occurs when the lens and image sensor are not perfectly strictly parallel.

1.3 CCD Cameras

Charge-Coupled Device (CCD) cameras digitize the light falling on the image plane.

Sensor Array: The image plane consists of a rectangular grid of photosensitive elements (pixels).
Digitization: Involves spatial sampling (pixels) and quantization (analog light intensity to digital pixel value).
Pixel Coordinates: Real sensors have a physical pixel size (e.g., $s_x$ and $s_y$ mm/pixel). The projection onto a CCD involves translating coordinates from the optical center (principal point, $c_x, c_y$ ) to the top-left corner of the image array.
Color representation: Most CCDs capture color using a Bayer filter mosaic, interpolating raw data to get RGB values for each pixel.

2. Advanced Camera Models

2.1 General Projective Cameras

The general projective camera model describes the full mapping from 3D world coordinates to 2D image pixel coordinates using a $3 \times 4$ projection matrix, $P$ .

Equation: $\mathbf{x} = P \mathbf{X}$ , where $\mathbf{X}$ is a 3D point in homogeneous coordinates $(X, Y, Z, 1)^T$ , and $\mathbf{x}$ is the 2D pixel coordinate $(wx, wy, w)^T$ .
Decomposition ( $P = K[R | \mathbf{t}]$ ):
- Intrinsic Parameters ( $K$ ): A $3 \times 3$ upper triangular matrix describing internal camera geometry.
  $K = \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$
  Where $f_x, f_y$ are focal lengths in pixel units, $c_x, c_y$ are the coordinates of the principal point, and $s$ is the skew coefficient (often zero).
- Extrinsic Parameters ( $[R | \mathbf{t}]$ ): A $3 \times 4$ matrix describing the camera's pose in the 3D world. $R$ is a $3 \times 3$ rotation matrix, and $\mathbf{t}$ is a $3 \times 1$ translation vector.

2.2 Affine Cameras

Affine cameras are a simplification of the projective camera model where the center of projection is assumed to be at infinity. This implies that light rays are parallel (orthographic or parallel projection).

Mathematical Representation: The last row of the projection matrix $P$ is $(0, 0, 0, 1)$ .
$P_{affine} = \begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ 0 & 0 & 0 & 1 \end{bmatrix}$
Types of Affine Cameras:
- Orthographic Projection: Drops the Z-coordinate.
- Weak Perspective (Scaled Orthographic): Assumes all points on a 3D object are at roughly the same depth $Z_0$ . Magnification is uniform across the object.
Characteristics: Parallelism is preserved (parallel lines in 3D remain parallel in the 2D image). Perspective effects (farther objects appearing smaller) are ignored.

3. Camera Calibration

Camera calibration is the process of estimating the intrinsic ( $K$ ) and extrinsic ( $R, t$ ) parameters of a camera, as well as lens distortion coefficients.

Purpose: Required for 3D reconstruction, metric measurements from 2D images, and stereo vision.
Methodology (Zhang's Method): The standard modern technique involves taking multiple images of a known planar pattern (e.g., a checkerboard) from different angles.
1. Detect feature points (corners) in the 2D images.
2. Establish correspondences between 3D world points (known from the checkerboard geometry) and 2D image points.
3. Compute a closed-form solution for the parameters.
4. Refine the parameters (including non-linear lens distortion) using optimization techniques like Levenberg-Marquardt to minimize the reprojection error.
Direct Linear Transform (DLT): A fundamental algorithm used in calibration to solve for the $3 \times 4$ camera matrix $P$ given a set of known 3D-2D point correspondences.

4. 2-D Projective Geometry

4.1 Planar Geometry

Planar geometry deals with points and lines on a 2D plane. In standard Euclidean geometry, we encounter issues like parallel lines never intersecting, which requires special case handling. Projective geometry resolves these exceptions.

4.2 Projective Spaces ( $\mathbb{P}^n$ )

Projective space extends Euclidean space by adding "points at infinity."
$\mathbb{P}^2$ is the 2D projective plane. In $\mathbb{P}^2$ , every pair of distinct lines intersects at exactly one point (parallel lines intersect at a point at infinity).
$\mathbb{P}^3$ is the 3D projective space, used to represent our 3D world in computer vision.

4.3 Representation in Projective Coordinates

Projective geometry relies heavily on homogeneous coordinates.

Points: A 2D point $(x, y)$ in Cartesian coordinates is represented as a 3-vector $(x, y, 1)$ or, more generally, $(wx, wy, w)$ where $w \neq 0$ . To convert back to Cartesian, divide by $w$ : $(x/w, y/w)$ .
Points at Infinity (Ideal Points): Represented as $(x, y, 0)$ . These indicate direction rather than a finite location.
Lines: A line $ax + by + c = 0$ is represented by the 3-vector $\mathbf{l} = (a, b, c)^T$ .
Point-Line Duality: In , points and lines are mathematically interchangeable.
- A point $\mathbf{x}$ lies on a line $\mathbf{l}$ if and only if $\mathbf{x}^T \mathbf{l} = 0$ (the dot product is zero).
- Intersection of two lines $\mathbf{l}_1, \mathbf{l}_2$ : The point of intersection is given by the cross product: $\mathbf{x} = \mathbf{l}_1 \times \mathbf{l}_2$ .
- Line through two points $\mathbf{x}_1, \mathbf{x}_2$ : The line is given by the cross product: $\mathbf{l} = \mathbf{x}_1 \times \mathbf{x}_2$ .

5. Homography and its Properties

5.1 Definition

A Homography (or projective transformation) is an invertible mapping from a projective plane to a projective plane that maps straight lines to straight lines.

Mathematical formulation: Given a 2D point $\mathbf{x}$ in image 1 and its corresponding point $\mathbf{x}'$ in image 2, the homography is a $3 \times 3$ matrix $H$ such that:
$\mathbf{x}' = H \mathbf{x}$
(Note: This equality is up to a non-zero scale factor, usually written as $\lambda \mathbf{x}' = H \mathbf{x}$ ).
Degrees of Freedom (DoF): The matrix $H$ has 9 entries, but because it is defined up to a scale factor, it has 8 degrees of freedom.

5.2 Properties of Homography

Collinearity Preserved: Points that lie on a line in the first image will map to points that lie on a line in the second image.
Cross-Ratio Invariance: The cross-ratio of four collinear points is preserved under projective transformations.
Composition: The product of two homographies is another homography ( $H_3 = H_2 H_1$ ).
Invertibility: Since it is a bijective mapping, the inverse $H^{-1}$ exists and maps $\mathbf{x}'$ back to $\mathbf{x}$ .

5.3 Estimation

To compute a homography $H$ , we need at least 4 point correspondences between two images (none of which are collinear). Each point correspondence provides 2 linear equations. Using the Direct Linear Transform (DLT) algorithm and Singular Value Decomposition (SVD), we can solve for the 8 unknowns.

6. Applications

6.1 Image Stitching (Panoramas)

When multiple images are taken from the exact same camera center but looking in different directions, or if the scene is perfectly planar, the images are related by homographies.

Process:
1. Detect feature points (e.g., SIFT, SURF) in both images.
2. Match features to find correspondences.
3. Compute the Homography matrix $H$ using RANSAC to reject outliers.
4. Warp one image into the coordinate frame of the other using $H$ .
5. Blend the overlapping regions to create a seamless panoramic image.

6.2 Perspective Correction

Also known as document scanning or "keystone correction."

When a camera captures a rectangular object (like a building facade, a painting, or a document) from an angle, perspective distortion causes it to appear as a trapezoid.
Process: By selecting the four corners of the distorted trapezoid in the image and defining their desired coordinates as a perfect rectangle, a homography can be computed. Applying this homography to the image warps it, making the object appear as though it were photographed perfectly straight-on.

6.3 Rectification

Rectification is a crucial preprocessing step in stereo computer vision.

Concept: Given two images of the same scene taken from different positions, searching for corresponding points (to compute depth) usually requires searching across the entire 2D image.
Epipolar Geometry: Simplifies this search to a 1D line (the epipolar line).
Rectification Process: Homographies are applied to both images to project them onto a common image plane. This warps the images such that corresponding epipolar lines become perfectly horizontal and collinear. Consequently, to find a matching point in the right image for a pixel in the left image, one only needs to search along the same horizontal row, drastically speeding up depth estimation algorithms.

Unit 1

Unit 3