Unit 1 - Notes

INT255 12 min read

Unit 1: Linear Algebra Foundations for Machine Learning

1. Vectors, Matrices, and Tensors in Machine Learning

Linear algebra is the study of vectors, matrices, and their transformations. In machine learning, these are the fundamental data structures used to represent and process data.

1.1. Vectors

Definition: A vector is an ordered, one-dimensional array of numbers. It can be interpreted geometrically as a point in space or a directed line segment from the origin to that point.
Notation: Typically denoted by a lowercase, bold letter (e.g., v). A vector with n elements is said to be in an n-dimensional space, denoted as .
- Column Vector:
  TEXT
```
        v = [v₁, v₂, ..., vₙ]ᵀ
        
```
  or
- Row Vector:
  $\mathbf{v} = [v_1, v_2, \dots, v_n]$
Machine Learning Context:
- Feature Vectors: The most common use. A single data point (an observation or sample) is represented as a vector. Each element of the vector corresponds to a feature.
  - Example: A model predicting house prices might represent a single house with a feature vector: [area_sq_ft, num_bedrooms, age_years, distance_to_city_center].
- Model Parameters: Weights in models like linear regression or neural networks are stored in vectors.
  - Example: In the linear regression model $y = \mathbf{w}^T\mathbf{x} + b$ , the weights w form a vector.
- Word Embeddings: In Natural Language Processing (NLP), words are represented as dense vectors in a high-dimensional space (e.g., Word2Vec, GloVe). Words with similar meanings are located close to each other in this vector space.

1.2. Matrices

Definition: A matrix is a two-dimensional array of numbers arranged in rows and columns.
Notation: Denoted by an uppercase, bold letter (e.g., A). A matrix with m rows and n columns is an $m \times n$ matrix, denoted as $\mathbf{A} \in \mathbb{R}^{m \times n}$ .
$\mathbf{A} = \begin{bmatrix} A_{1,1} & A_{1,2} & \dots & A_{1,n} \\ A_{2,1} & A_{2,2} & \dots & A_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ A_{m,1} & A_{m,2} & \dots & A_{m,n} \end{bmatrix}$
Machine Learning Context:
- Datasets: An entire dataset can be represented as a design matrix X, where each row is a feature vector for a single data point, and each column represents a single feature across all data points.
- Weight Matrices in Neural Networks: In a neural network, the weights connecting one layer of neurons to the next are stored in a matrix. If a layer has n inputs and m outputs, the weights form an $m \times n$ matrix.
- Image Representation: A grayscale image is naturally a matrix, where each element represents the intensity of a pixel.
- Covariance Matrix: A square matrix that describes the pairwise covariance between different features in a dataset. It is fundamental to algorithms like Principal Component Analysis (PCA).

1.3. Tensors

Definition: A tensor is a generalization of vectors and matrices to an arbitrary number of dimensions (also called axes).
- A scalar is a rank-0 tensor.
- A vector is a rank-1 tensor.
- A matrix is a rank-2 tensor.
Notation: Denoted by a bold, calligraphic or sans-serif uppercase letter (e.g., $\mathcal{T}$ or T). The shape of a tensor describes the size of each of its dimensions.
Machine Learning Context:
- Core Data Structure in Deep Learning: Frameworks like TensorFlow and PyTorch are built around tensor operations.
- Color Images: A color image is represented as a rank-3 tensor with shape (height, width, channels), where the channels are typically Red, Green, and Blue (RGB).
- Batches of Data: When training models, data is often processed in batches. A batch of data is represented by a tensor with an additional "batch" dimension.
  - A batch of 10 color images of size 256x256 would be a rank-4 tensor of shape (10, 256, 256, 3).
  - A batch of 32 sentences, each with 20 words represented by 100-dimensional embeddings, would be a rank-3 tensor of shape (32, 20, 100).
- Convolutional Neural Network (CNN) Kernels: The filters or kernels used in CNNs are rank-4 tensors with shape (output_channels, input_channels, kernel_height, kernel_width).

2. Vector Spaces and Subspaces

2.1. Vector Spaces

Definition: A vector space is a set of vectors V along with a field of scalars F (usually the real numbers ) that is closed under two operations: vector addition and scalar multiplication. This means that for any vectors u, v in V and any scalar c in F:
1. u + v is also in V.
2. *cu is also in V*.
These operations must satisfy a set of axioms (associativity, commutativity, existence of identity and inverse elements, etc.), which essentially ensure they behave as expected.
ML Relevance: The set of all possible feature vectors of a given dimension, $\mathbb{R}^n$ , forms a vector space. This provides a formal geometric framework for understanding data. Concepts like distance, similarity, and transformation are all defined within this space.

2.2. Linear Independence, Span, and Basis

Linear Combination: A vector v is a linear combination of a set of vectors {u₁, u₂, ..., uₖ} if it can be expressed as a weighted sum of those vectors:
$\mathbf{v} = c_1\mathbf{u}_1 + c_2\mathbf{u}_2 + \dots + c_k\mathbf{u}_k$
where $c_i$ are scalars.
Span: The span of a set of vectors is the set of all possible linear combinations of those vectors. The span of a set of vectors always forms a vector space (or a subspace of the original space).
Linear Independence: A set of vectors {u₁, u₂, ..., uₖ} is linearly independent if no vector in the set can be written as a linear combination of the others. Formally, the only solution to the equation $c_1\mathbf{u}_1 + c_2\mathbf{u}_2 + \dots + c_k\mathbf{u}_k = \mathbf{0}$ is $c_1 = c_2 = \dots = c_k = 0$ .
Basis and Dimension:
- A basis of a vector space is a set of linearly independent vectors that span the entire space.
- The dimension of a vector space is the number of vectors in its basis. The standard basis for $\mathbb{R}^n$ is the set of vectors where one element is 1 and all others are 0 (e.g., [1,0,0], [0,1,0], [0,0,1] for $\mathbb{R}^3$ ).
ML Relevance:
- Feature Redundancy: If one feature in a dataset can be expressed as a linear combination of others, the features are linearly dependent. This redundant feature adds no new information to a linear model.
- Dimensionality Reduction (PCA): PCA finds a new basis for the data. This new basis is chosen such that the first few basis vectors (the principal components) capture the maximum variance in the data. By representing the data using only these first few basis vectors, we achieve a lower-dimensional representation.

2.3. Subspaces

Definition: A subspace is a subset of a vector space that is itself a vector space. To verify if a subset S is a subspace, one must check three conditions:
1. It contains the zero vector ( $\mathbf{0} \in S$ ).
2. It is closed under vector addition (if u, v $\in S$ , then u + v $\in S$ ).
3. It is closed under scalar multiplication (if u $\in S$ and c is a scalar, then *c*u $\in S$ ).
ML Relevance:
- Manifold Hypothesis: Many complex, high-dimensional datasets in machine learning do not fill the entire ambient space. Instead, they lie on or near a lower-dimensional structure called a manifold. A subspace is the simplest form of a manifold (it's flat). PCA works by finding the best-fit linear subspace for the data.
- Solution Space for Linear Regression: The predicted values in a linear regression model, $\hat{\mathbf{y}}$ , must lie in the subspace spanned by the columns of the feature matrix X (the "column space"). The OLS algorithm finds the vector in this subspace that is closest to the actual target vector y.

3. Norms and Projections

3.1. Norms

A norm is a function that assigns a non-negative length or size to a vector. It formalizes the intuitive notion of distance. A norm $f(\mathbf{x}) = ||\mathbf{x}||$ must satisfy:

$||\mathbf{x}|| \geq 0$ (Non-negativity)
$||\mathbf{x}|| = 0$ if and only if $\mathbf{x} = \mathbf{0}$
$||c\mathbf{x}|| = |c| \cdot ||\mathbf{x}||_p$ for any scalar c (Absolute homogeneity)
$||\mathbf{x} + \mathbf{y}|| \leq ||\mathbf{x}|| + ||\mathbf{y}||_p$ (Triangle inequality)

The most common norms in machine learning are from the family of $L^p$ norms, defined as:
$||\mathbf{x}||_p = \left( \sum_{i=1}^{n} |x_i|^p \right)^{1/p}$

3.1.1. L2 Norm (Euclidean Norm)

Formula: For $p=2$ , this is the standard Euclidean distance from the origin.
$||\mathbf{x}||_2 = \sqrt{\sum_{i=1}^{n} x_i^2} = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2}$
Geometric Interpretation: A circle (in 2D), sphere (in 3D), or hypersphere represents the set of all vectors with a constant L2 norm.
ML Relevance:
- Loss Functions: The Mean Squared Error (MSE) loss function is derived from the squared L2 norm of the error vector: $L_{MSE} = \frac{1}{N} \sum (\hat{y}_i - y_i)^2 = \frac{1}{N} ||\mathbf{\hat{y}} - \mathbf{y}||_2^2$ . It heavily penalizes large errors.
- L2 Regularization (Ridge Regression / Weight Decay): Adds a penalty to the loss function proportional to the squared L2 norm of the model's weight vector: Loss = MSE + λ||w||₂². This encourages the model to learn smaller, more diffuse weights, which helps prevent overfitting by making the model less sensitive to individual data points.

3.1.2. L1 Norm (Manhattan Norm)

Formula: For $p=1$ , this is the sum of the absolute values of the vector components.
$||\mathbf{x}||_1 = \sum_{i=1}^{n} |x_i| = |x_1| + |x_2| + \dots + |x_n|$
Geometric Interpretation: A diamond (in 2D) or cross-polytope represents the set of all vectors with a constant L1 norm.
ML Relevance:
- Loss Functions: The Mean Absolute Error (MAE) loss function is the L1 norm of the error vector, divided by the number of samples. It is less sensitive to outliers than MSE.
- L1 Regularization (Lasso Regression): Adds a penalty proportional to the L1 norm of the weight vector: Loss = MSE + λ||w||₁. The "sharp corners" of the L1 norm's geometric shape make it likely that the optimal solution will lie on an axis, forcing some weights to be exactly zero. This induces sparsity and effectively performs automatic feature selection.

3.2. Projections

Definition: The projection of a vector v onto a vector u finds the vector in the direction of u that is closest to v. This is like finding the "shadow" of v cast on the line defined by u.
Formula (Vector Projection): The projection of vector v onto vector u is given by:
- The term in the parenthesis is a scalar value.
- The term $\mathbf{v} \cdot \mathbf{u}$ is the dot product.
ML Relevance:
- Principal Component Analysis (PCA): After identifying the principal components (which form a new basis), the original data is projected onto the subspace spanned by the first k principal components. This projection is the new, lower-dimensional representation of the data.
- Ordinary Least Squares (OLS) Regression: The goal of OLS is to find the coefficients β that minimize the squared error $||\mathbf{y} - \mathbf{X\beta}||_2^2$ . The solution, $\mathbf{\hat{y}} = \mathbf{X\beta}$ , is geometrically the orthogonal projection of the true target vector y onto the subspace spanned by the columns of the feature matrix X.

4. Linear Operators and Transformations in ML

4.1. Linear Transformations

Definition: A transformation (or mapping) T is a function that takes a vector as input and produces another vector as output. T is a linear transformation if it satisfies two properties for any vectors u, v and any scalar c:
1. T(\mathbf{u} + \mathbf{v}) = T(\mathbf{u}) + T(\mathbf{v}) (Additivity)
2. T(c*\mathbf{u}) = c*T(\mathbf{u}*) (Homogeneity)
Matrix Representation: Any linear transformation T from an n-dimensional space to an m-dimensional space can be represented by multiplication with an $m \times n$ matrix A.
$T(\mathbf{x}) = \mathbf{A}\mathbf{x}$
ML Relevance:
- Neural Network Layers: A fully connected layer in a neural network performs a linear transformation on its input, followed by a non-linear activation. The equation for this transformation is z = Wx + b, where W is the weight matrix representing the linear transformation.
- Data Preprocessing: Operations like scaling, rotation, and shearing data are all linear transformations that can be represented by matrices.
- Change of Basis: A linear transformation can be viewed as changing the coordinate system. PCA finds a transformation matrix that rotates the data into a new basis (the principal components) where the axes align with the directions of maximum variance.

4.2. Eigenvalues and Eigenvectors

Definition: For a square matrix A, an eigenvector v is a non-zero vector that, when transformed by A, only changes in scale, not in direction. The scalar scaling factor is the eigenvalue λ.
$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$
Intuition: Eigenvectors are the "axes of transformation" for the matrix A. They are the directions that remain unchanged (up to scaling) by the transformation. The eigenvalue tells you how much the eigenvector is stretched or compressed.
- λ > 1: Stretching
- 0 < λ < 1: Compression
- λ < 0: Stretching/compression with a direction reversal
ML Relevance:
- Principal Component Analysis (PCA): This is the canonical example. The principal components of a dataset are the eigenvectors of its covariance matrix. The eigenvalue corresponding to each eigenvector represents the amount of variance in the data along that principal component's direction. To perform dimensionality reduction, we keep the eigenvectors with the largest eigenvalues, as they capture the most information (variance).
- Spectral Clustering: This clustering technique uses the eigenvectors of a graph's Laplacian matrix to embed the data points into a lower-dimensional space where they are more easily separated by traditional clustering algorithms like K-Means.
- PageRank Algorithm: Google's original algorithm for ranking web pages can be formulated as an eigenvector problem. The PageRank vector is the principal eigenvector of a modified adjacency matrix of the web graph.

Unit 2