1

Define a vector, a matrix, and a tensor in the context of machine learning. Provide a brief example of each's application.

2

Explain how data is typically represented using vectors and matrices in machine learning. Provide a concrete example involving a tabular dataset.

In machine learning, data is fundamentally structured using vectors and matrices:

Vectors for Single Samples: Each individual data point or observation is commonly represented as a vector. If a data point has $n$ features, it can be represented as an $n$ -dimensional vector $\mathbf{x} = [x_1, x_2, \dots, x_n]^T$ , where $x_i$ is the value of the $i$ -th feature.
Matrices for Datasets: A collection of multiple data points forms a dataset, which is typically represented as a matrix. If we have $m$ data points, each with $n$ features, the dataset can be organized into an $m \times n$ matrix $\mathbf{X}$ where:
- Each row of $\mathbf{X}$ corresponds to a single data sample (a vector).
- Each column of $\mathbf{X}$ corresponds to a specific feature across all data samples.

Concrete Example (Tabular Dataset):
Consider a dataset of customer information for a marketing campaign, with features like 'Age', 'Income', and 'Number of Purchases'.

Customer ID	Age	Income ($)	Number of Purchases
001	30	50000	5
002	45	75000	12
003	22	30000	2

This tabular data can be represented as a matrix $\mathbf{X}$ :
$\mathbf{X} = \begin{pmatrix} 30 & 50000 & 5 \\ 45 & 75000 & 12 \\ 22 & 30000 & 2 \end{pmatrix}$

Here:

The first row $[30, 50000, 5]$ is a vector representing Customer 001.
The second column $\begin{pmatrix} 50000 \\ 75000 \\ 30000 \end{pmatrix}$ is a vector representing the 'Income' feature for all customers.

3

What is a vector space? List the ten axioms (properties) that a set must satisfy to be considered a vector space over a field of scalars.

4

Explain what a subspace is. Provide an example of a subspace of $\mathbb{R}^3$ that is not $\mathbb{R}^3$ itself or the zero vector.

5

Compare and contrast the L1 norm (Manhattan norm) and the L2 norm (Euclidean norm) of a vector. Include their mathematical definitions and typical applications in machine learning.

The L1 and L2 norms are two of the most common ways to measure the 'size' or 'magnitude' of a vector. They differ in their mathematical definition and their implications for machine learning.

1. L1 Norm (Manhattan Norm / Taxicab Norm):

Mathematical Definition: For a vector $\mathbf{x} = [x_1, x_2, \dots, x_n]^T$ , the L1 norm is defined as the sum of the absolute values of its components:
$\| \mathbf{x} \|_1 = \sum_{i=1}^n |x_i|$
Geometric Interpretation: Represents the sum of distances along each axis. Imagine navigating a city grid (like Manhattan); it's the total distance traveled along the streets.
Properties: Sensitive to sparse features; gives equal penalty for errors, regardless of magnitude.
Typical Applications in ML:
- L1 Regularization (Lasso Regression): Encourages sparsity in model coefficients, effectively performing feature selection by driving some coefficients exactly to zero. This is useful for building simpler, more interpretable models.
- Robustness to Outliers: Less sensitive to outliers compared to L2, as it doesn't heavily penalize large errors quadratically.

2. L2 Norm (Euclidean Norm):

Mathematical Definition: For a vector $\mathbf{x} = [x_1, x_2, \dots, x_n]^T$ , the L2 norm is defined as the square root of the sum of the squares of its components:
$\| \mathbf{x} \|_2 = \sqrt{\sum_{i=1}^n x_i^2}$
Geometric Interpretation: Represents the straight-line (Euclidean) distance from the origin to the point represented by the vector in an $n$ -dimensional space.
Properties: Smooth and differentiable; heavily penalizes large errors. The squared L2 norm ( $\| \mathbf{x} \|_2^2 = \sum x_i^2$ ) is often used because it simplifies calculations by removing the square root.
Typical Applications in ML:
- L2 Regularization (Ridge Regression): Prevents overfitting by penalizing large model coefficients, shrinking them towards zero but rarely making them exactly zero. It helps to reduce variance and improve generalization.
- Error Measurement: Used as the basis for Mean Squared Error (MSE), a common loss function in regression tasks.
- Distance Metric: Often used as the standard distance metric between two vectors (Euclidean distance).
- Vector Normalization: Used to normalize vectors to unit length, making them comparable regardless of magnitude.

Comparison Summary:	Feature	L1 Norm ( $\\| \mathbf{x} \\|_1$ )
Definition	Sum of absolute values	Square root of sum of squares
Geometric Path	Manhattan (taxicab) distance	Euclidean (straight-line) distance
Sparsity	Promotes sparsity (feature selection)	Does not promote sparsity
Outlier Effect	Less sensitive	More sensitive (heavy penalty)
Regularization	Lasso Regression	Ridge Regression
Differentiability	Not differentiable at 0	Differentiable (squared L2)

6

Define linear independence of a set of vectors. Why is linear independence a crucial concept in the context of vector spaces and basis formation?

A set of vectors $\{ \mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k \}$ in a vector space $V$ is said to be linearly independent if the only solution to the vector equation:
$c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \dots + c_k \mathbf{v}_k = \mathbf{0}$
is the trivial solution, i.e., $c_1 = c_2 = \dots = c_k = 0$ . If there is any non-trivial solution (where at least one $c_i \neq 0$ ), the vectors are linearly dependent.

In simpler terms, linearly independent vectors each contribute unique directional information; none of them can be expressed as a linear combination of the others.

Crucial Role in Vector Spaces and Basis Formation:
Linear independence is fundamental for several reasons:

Basis Formation: A basis for a vector space $V$ is a set of vectors that are both:
1. Linearly Independent: No vector in the set can be expressed as a linear combination of the others.
2. Span $V$ : Every vector in $V$ can be expressed as a linear combination of the vectors in the set.
  Linear independence ensures that each basis vector is essential and non-redundant in spanning the space. Without it, the set would be overdetermined.
Uniqueness of Representation: If a set of vectors forms a basis, then every vector in the space can be uniquely expressed as a linear combination of these basis vectors. This uniqueness is guaranteed by linear independence.
Dimension of a Vector Space: The number of vectors in any basis for a given vector space is always the same. This number is called the dimension of the vector space. Linear independence is key to establishing this consistent count.
Efficient Representation: In machine learning, a basis provides the most concise way to represent all possible data points within a vector space. Linearly dependent features introduce redundancy and can lead to issues like multicollinearity in models.
Understanding Transformations: The properties of linear transformations (like injectivity) are often related to the linear independence of their column vectors (basis vectors after transformation).

In summary, linear independence ensures that the components of a vector space are non-redundant and that we have a minimal set of vectors capable of generating the entire space.

7

Explain the concept of orthogonal projection of a vector onto another vector. Provide the formula and briefly describe its utility in machine learning.

8

Define a linear transformation (or linear operator). List the two key properties it must satisfy and provide a simple example of a linear transformation in $\mathbb{R}^2$ .

9

Discuss the importance of eigenvalues and eigenvectors in understanding linear transformations and their applications in dimensionality reduction techniques like Principal Component Analysis (PCA).

Eigenvalues and eigenvectors are fundamental concepts that reveal the intrinsic properties of linear transformations. For a linear transformation represented by a square matrix $A$ , an eigenvector $\mathbf{v}$ is a non-zero vector that, when $A$ acts upon it, only changes by a scalar factor. This scalar factor is called the eigenvalue $\lambda$ .

Mathematically, this relationship is expressed as:
$A \mathbf{v} = \lambda \mathbf{v}$

Importance in Understanding Linear Transformations:

Invariant Directions: Eigenvectors define the 'special' directions along which a linear transformation acts merely by stretching or shrinking, without any rotation or shear. The eigenvalue indicates the scaling factor in that specific direction.
Decomposition: Square matrices can often be decomposed into a set of their eigenvectors and eigenvalues, which simplifies many calculations involving powers of matrices or solving differential equations.
Stability Analysis: In dynamic systems, eigenvalues determine the stability and behavior of the system over time.

Applications in Dimensionality Reduction (PCA):
Principal Component Analysis (PCA) is a widely used dimensionality reduction technique that leverages eigenvalues and eigenvectors to transform high-dimensional data into a lower-dimensional space while retaining as much variance (information) as possible.

Covariance Matrix: PCA starts by computing the covariance matrix of the dataset. This symmetric matrix captures the relationships and variances between different features.
Eigen-Decomposition: The eigenvectors of the covariance matrix represent the principal components of the data. These principal components are orthogonal (uncorrelated) directions in the feature space.
Variance Explained: The corresponding eigenvalues indicate the amount of variance explained by each principal component. A larger eigenvalue means the corresponding eigenvector (principal component) captures more variance in the data.
Dimensionality Reduction: To reduce dimensionality, we select the top $k$ principal components (eigenvectors) corresponding to the $k$ largest eigenvalues. These components represent the directions along which the data varies the most.
Data Projection: The original data is then projected onto the subspace spanned by these selected principal components, effectively reducing the number of dimensions while preserving the most significant information.

In summary, eigenvalues and eigenvectors allow us to identify the inherent structure and dominant directions of variation within a dataset, making them invaluable for tasks like dimensionality reduction where we seek to simplify data while retaining its core characteristics.

10

Describe the Null Space (Kernel) and the Column Space (Image) of a matrix. Explain their significance in understanding the properties of a linear transformation.

11

How can tensors be seen as a generalization of scalars, vectors, and matrices? Provide examples of where higher-order tensors are used in deep learning.

Tensors are mathematical objects that generalize scalars, vectors, and matrices to an arbitrary number of dimensions (or ranks).

Scalar: A single number (e.g., 5, -3.14). It has zero dimensions, hence it's a rank-0 tensor.
Vector: An ordered list of numbers (e.g., $[1, 2, 3]^T$ ). It has one dimension, hence it's a rank-1 tensor.
Matrix: A 2D array of numbers (e.g., $\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}$ ). It has two dimensions, hence it's a rank-2 tensor.
Tensor: Any array of numbers with three or more dimensions. For example, a 3D array of numbers is a rank-3 tensor, a 4D array is a rank-4 tensor, and so on. The 'rank' or 'order' of a tensor refers to the number of indices required to identify each element.

Examples of Higher-Order Tensors in Deep Learning:

Image Data (Rank-3 or Rank-4):
- A single RGB image is commonly represented as a rank-3 tensor with dimensions (height, width, color channels), e.g., $(256, 256, 3)$ . Each element corresponds to the intensity of a specific pixel in a specific color channel.
- A batch of RGB images used in deep learning models (e.g., convolutional neural networks) is typically a rank-4 tensor with dimensions (batch size, height, width, color channels), e.g., $(32, 256, 256, 3)$ .
Video Data (Rank-5):
- A video sequence can be represented as a rank-5 tensor: (batch size, number of frames, height, width, color channels), e.g., $(16, 10, 128, 128, 3)$ .
Recurrent Neural Networks (RNNs) and Sequence Data (Rank-3):
- Input to an RNN for natural language processing might be a sequence of word embeddings. If each word is represented by a vector of dimension $d$ , a sequence of $L$ words in a batch of $B$ samples would be a rank-3 tensor: (batch size, sequence length, embedding dimension), e.g., $(64, 50, 300)$ .
Feature Maps in Convolutional Layers (Rank-4):
- Intermediate outputs in convolutional neural networks, often called 'feature maps', are rank-4 tensors (batch size, height, width, number of filters/channels). For example, after a convolutional layer, the output might be $(32, 64, 64, 128)$ , representing 128 feature maps of size $64 \times 64$ for each of the 32 images in the batch.

Tensors provide a flexible and powerful way to handle complex, multi-dimensional data structures prevalent in modern AI and deep learning.

12

What is a basis for a vector space? How does it differ from a spanning set? Illustrate with an example in $\mathbb{R}^2$ .

13

Explain the geometric interpretation of the L2 norm and its relation to Euclidean distance. How is it applied in machine learning for tasks like classification?

The L2 norm (or Euclidean norm) of a vector $\mathbf{x} = [x_1, x_2, \dots, x_n]^T$ is defined as:
$\| \mathbf{x} \|_2 = \sqrt{\sum_{i=1}^n x_i^2}$

Geometric Interpretation of L2 Norm:
Geometrically, the L2 norm of a vector represents the Euclidean distance from the origin $(0, 0, \dots, 0)$ to the point specified by the vector's coordinates in an $n$ -dimensional Euclidean space. It's the 'straight-line' distance.

For $\mathbf{x} \in \mathbb{R}^2$ , $\| \mathbf{x} \|_2 = \sqrt{x_1^2 + x_2^2}$ , which is the length of the hypotenuse of a right triangle.
For $\mathbf{x} \in \mathbb{R}^3$ , $\| \mathbf{x} \|_2 = \sqrt{x_1^2 + x_2^2 + x_3^2}$ , the length of the diagonal of a rectangular prism.

Relation to Euclidean Distance:
The Euclidean distance between two vectors (or points) $\mathbf{p} = [p_1, \dots, p_n]^T$ and $\mathbf{q} = [q_1, \dots, q_n]^T$ is defined as the L2 norm of their difference:
$d(\mathbf{p}, \mathbf{q}) = \| \mathbf{p} - \mathbf{q} \|_2 = \sqrt{\sum_{i=1}^n (p_i - q_i)^2}$
So, the L2 norm directly computes the length of a vector, and this concept extends to measuring the straight-line distance between any two points in space.

Application in Machine Learning (e.g., K-Nearest Neighbors - KNN):
The L2 norm (or Euclidean distance) is widely used as a distance metric in machine learning, particularly in algorithms that rely on similarity or proximity:

K-Nearest Neighbors (KNN): In a classification (or regression) task using KNN, to classify a new data point, the algorithm finds the training data points that are 'closest' to the new point. The 'closeness' is typically measured using Euclidean distance.
- For example, if we have a new customer and we want to predict if they will buy a product, KNN would find the $K$ most similar customers (based on features like age, income, etc.) using Euclidean distance. The new customer's class would then be determined by the majority class among those $K$ neighbors.
Clustering (K-Means): K-Means clustering uses Euclidean distance to assign data points to the nearest cluster centroid and to update centroids (minimizing the sum of squared distances).
Loss Functions: The squared L2 norm is often used in loss functions, such as Mean Squared Error (MSE), to quantify the difference between predicted and actual values in regression models. This is because minimizing MSE is equivalent to minimizing the Euclidean distance between the predicted and true values.

14

Discuss how L1 and L2 regularization (Lasso and Ridge regression) utilize these norms to prevent overfitting in machine learning models.

L1 and L2 regularization are techniques used in machine learning, particularly in linear models, to prevent overfitting by adding a penalty term to the loss function. This penalty term is based on the L1 or L2 norm of the model's coefficient vector.

1. L2 Regularization (Ridge Regression):

Penalty Term: Adds a penalty proportional to the squared L2 norm of the coefficient vector $\mathbf{w}$ to the loss function.
Loss Function (Ridge) = $\text{Original Loss} + \lambda \| \mathbf{w} \|_2^2 = \text{Original Loss} + \lambda \sum_{j=1}^p w_j^2$
where $\lambda \ge 0$ is the regularization strength.
Effect:
- Shrinks Coefficients: It drives the coefficients towards zero, but rarely exactly to zero. All features tend to be retained but with smaller magnitudes.
- Reduces Variance: By penalizing large coefficients, Ridge regression reduces the model's complexity and sensitivity to noise in the training data, thereby reducing variance and improving generalization.
- Handles Multicollinearity: It's effective when there are highly correlated features, as it can distribute the impact across them rather than picking just one.
Why it works: By penalizing the sum of squared coefficients, it prevents any single coefficient from becoming too large, which helps to stabilize the model and prevent it from fitting the training data too perfectly (overfitting).

2. L1 Regularization (Lasso Regression):

Penalty Term: Adds a penalty proportional to the L1 norm of the coefficient vector $\mathbf{w}$ to the loss function.
Loss Function (Lasso) = $\text{Original Loss} + \lambda \| \mathbf{w} \|_1 = \text{Original Loss} + \lambda \sum_{j=1}^p |w_j|$
where $\lambda \ge 0$ is the regularization strength.
Effect:
- Promotes Sparsity (Feature Selection): It has the unique property of driving some coefficients exactly to zero. This effectively performs automatic feature selection, as features with zero coefficients are excluded from the model.
- Simpler Models: By selecting a subset of important features, Lasso creates simpler and more interpretable models.
- Handles High-Dimensional Data: Particularly useful when dealing with datasets with a very large number of features, many of which might be irrelevant.
Why it works: The 'diamond-shaped' constraint region of the L1 norm often causes the optimal solution (where the contour of the original loss function touches the constraint boundary) to occur at the corners, which correspond to some coefficients being zero.

Summary:	Feature	L1 Regularization (Lasso)
Norm Used	L1 norm (sum of absolute values)	L2 norm (sum of squares)
Coefficient Effect	Shrinks coefficients to zero (sparsity)	Shrinks coefficients towards zero
Feature Selection	Yes (automatic)	No (all features typically retained)
Model Complexity	Reduces model complexity by selecting features	Reduces model complexity by shrinking coefficients
Use Case	When feature selection is desired, or high-dimensional data	When all features are relevant, or multicollinearity is present

Both techniques are vital tools for controlling model complexity and improving generalization performance, preventing models from memorizing the training data.

15

Describe the effects of common linear transformations such as scaling, rotation, and reflection on vectors in $\mathbb{R}^2$ . Provide the corresponding transformation matrices.

16

Define the general concept of a vector norm. Explain its purpose and list three properties that any valid vector norm must satisfy.

17

Explain the concept of an invertible linear transformation and its associated matrix. Under what conditions is a square matrix invertible?

18

What are the common operations performed on vectors and matrices relevant to machine learning? Illustrate with simple examples for at least four operations.

19

What is the rank of a matrix? How does it relate to the concepts of column space and null space?

20

What is the difference between an orthogonal vector and an orthonormal vector? Why are orthonormal bases particularly useful in linear algebra and machine learning?

The terms orthogonal and orthonormal describe relationships between vectors, specifically related to their dot product and magnitude.

1. Orthogonal Vectors:

Two non-zero vectors $\mathbf{u}$ and $\mathbf{v}$ are orthogonal if their dot product is zero:
$\mathbf{u} \cdot \mathbf{v} = 0$
Geometrically, orthogonal vectors are perpendicular to each other. The zero vector is considered orthogonal to every vector.
A set of vectors $\{ \mathbf{v}_1, \mathbf{v}_2, \dots, \mathbf{v}_k \}$ is an orthogonal set if every pair of distinct vectors in the set is orthogonal, i.e., $\mathbf{v}_i \cdot \mathbf{v}_j = 0$ for $i \neq j$ .

2. Orthonormal Vectors:

A set of vectors is orthonormal if they are an orthogonal set, AND each vector in the set is a unit vector (i.e., has an L2 norm of 1).
$\mathbf{u}_i \cdot \mathbf{u}_j = \begin{cases} 1 & \text{if } i=j \\ 0 & \text{if } i \neq j \end{cases}$
This can be compactly written using the Kronecker delta $\delta_{ij}$ as $\mathbf{u}_i \cdot \mathbf{u}_j = \delta_{ij}$ .
A set of orthonormal vectors is always linearly independent.

Why Orthonormal Bases are Useful:
An orthonormal basis is a basis consisting of orthonormal vectors. They offer significant advantages:

Simplified Coordinates: Expressing a vector $\mathbf{v}$ in terms of an orthonormal basis $\{ \mathbf{e}_1, \dots, \mathbf{e}_n \}$ is very simple: the coefficients $c_i$ are just the dot products of $\mathbf{v}$ with the basis vectors:
$\mathbf{v} = \sum_{i=1}^n (\mathbf{v} \cdot \mathbf{e}_i) \mathbf{e}_i$ .
This means computing coordinates requires no matrix inversion.
Preservation of Norm and Dot Product: Linear transformations represented by orthogonal matrices (whose columns form an orthonormal basis) preserve vector lengths (norms) and angles (dot products). This is crucial in applications where geometric properties of data must be maintained, such as rotations in computer graphics.
Numerical Stability: Computations involving orthonormal bases are generally more numerically stable because the vectors are 'well-separated' and scaled.
Diagonalization: For symmetric matrices (common in covariance matrices, kernel matrices), eigenvalues correspond to orthogonal eigenvectors. Normalizing these eigenvectors yields an orthonormal basis, which allows for simplified diagonalization ( $A = Q D Q^T$ , where $Q$ is orthogonal).
Dimensionality Reduction (PCA): The principal components derived from PCA are mutually orthogonal (and often normalized to be orthonormal). Projecting data onto these components simplifies the data representation and ensures that the new features are uncorrelated.
Easy Inverse/Transpose: For an orthogonal matrix $Q$ , its inverse is simply its transpose ( $Q^{-1} = Q^T$ ), simplifying many calculations.

Unit1 - Subjective Questions