Unit 5 - Notes
INT234
Unit 5: Dimensionality Reduction and Neural Networks
1. Dimensionality Reduction
1.1 Overview
In predictive analytics, datasets often contain a large number of features (variables). While more data can be beneficial, too many dimensions can lead to the Curse of Dimensionality. Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables.
Why is it necessary?
- Computational Efficiency: Fewer features mean faster training and inference times.
- Visualization: Data is easier to visualize in 2D or 3D.
- Overfitting: High-dimensional data often creates sparse datasets, leading models to learn noise rather than patterns.
- Collinearity: Removes redundant or highly correlated features.
1.2 Approaches
- Feature Selection: Selecting a subset of the original features (e.g., Lasso Regression, Recursive Feature Elimination).
- Feature Extraction: Transforming data into a new space with fewer dimensions. The new features are combinations of original features (e.g., PCA, t-SNE).
2. Principal Component Analysis (PCA)
PCA is a linear, unsupervised feature extraction technique used to project high-dimensional data onto a lower-dimensional subspace while preserving the maximum variance.
2.1 Core Concepts
- Principal Components (PCs): New, uncorrelated variables that maximize variance.
- PC1: Captures the most variance in the data.
- PC2: Orthogonal (perpendicular) to PC1 and captures the second most variance.
- Orthogonality: Ensures that Principal Components are uncorrelated.
2.2 The PCA Algorithm (Step-by-Step)
- Standardization: Scale the data so that each feature has a mean of 0 and a variance of 1. This prevents variables with large scales from dominating the analysis.
- Covariance Matrix Computation: Calculate the covariance matrix to understand how variables vary from the mean with respect to each other.
- Eigendecomposition: Compute Eigenvectors and Eigenvalues of the covariance matrix.
- Eigenvectors: Determine the direction of the new feature space.
- Eigenvalues: Determine the magnitude (amount of variance explained) of those directions.
- Sort and Select: Sort pairs by Eigenvalues in descending order. Select the top Eigenvectors (where is the desired number of dimensions).
- Projection: Construct a projection matrix from the selected Eigenvectors and transform the original dataset to obtain the new subspace .
2.3 Explained Variance Ratio
This metric quantifies how much information (variance) is attributed to each principal component. A cumulative plot is often used to determine the optimal number of components (e.g., retaining 95% of the variance).
# Conceptual Python Code for PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# 1. Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Apply PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)
# 3. Check Variance
print(pca.explained_variance_ratio_)
3. Feedforward Neural Networks (FFNN)
A Feedforward Neural Network is the foundational architecture of deep learning. It is a biologically inspired algorithm where connections between nodes do not form a cycle.
3.1 Architecture
- Input Layer: Receives the raw data features. No computation happens here.
- Hidden Layer(s): Where the computation and feature transformation occur.
- Output Layer: Produces the final prediction (e.g., class probability or continuous value).
3.2 The Perceptron (The Building Block)
A single neuron processes information via the following steps:
- Weighted Sum: Inputs () are multiplied by weights () and summed with a bias ().
- Activation Function: The sum is passed through a non-linear function to determine the output.
3.3 Activation Functions
Non-linearity is crucial; without it, a neural network is essentially just a linear regression model, regardless of depth.
- Sigmoid: Maps input to . Used in binary classification output. Susceptible to the "vanishing gradient" problem.
- ReLU (Rectified Linear Unit): . Most common in hidden layers. Computationally efficient and mitigates vanishing gradients.
- Softmax: Used in the output layer for multi-class classification to create a probability distribution.
4. Multi-layer Perceptron (MLP)
An MLP is a specific type of Feedforward Neural Network that consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer.
4.1 Characteristics
- Fully Connected (Dense): Each node in one layer connects to every node in the subsequent layer.
- Universal Approximation Theorem: An MLP with a single hidden layer (containing sufficient neurons) can approximate any continuous function.
4.2 Training MLPs: Backpropagation
The goal of training is to minimize a Loss Function (e.g., Mean Squared Error for regression, Cross-Entropy for classification).
The Training Loop:
- Forward Pass: Input data flows forward through the network to generate a prediction.
- Calculate Loss: Compare prediction with the actual target.
- Backward Pass (Backpropagation):
- Compute the gradient of the loss function with respect to the weights using the Chain Rule of calculus.
- This determines how much each weight contributed to the error.
- Optimizer Update: Update the weights using an optimization algorithm (e.g., Stochastic Gradient Descent - SGD, Adam).
(Where is the learning rate).
5. Convolutional Neural Networks (CNN)
CNNs are specialized neural networks designed for processing grid-like data topology, most notably images.
5.1 Why not MLP for Images?
- Parameter Explosion: Connecting every pixel in an image to every neuron creates massive computational overhead.
- Loss of Spatial Structure: Flattening an image into a vector for an MLP destroys the 2D spatial relationships between pixels.
5.2 Key Architecture Components
A. Convolutional Layer
The core building block. It uses Filters (Kernels) to scan the input.
- Filter/Kernel: A small matrix (e.g., ) of weights that slides over the image.
- Operation: Performs element-wise multiplication and summation (dot product) to produce a Feature Map.
- Stride: How many pixels the filter moves at a time.
- Padding: Adding zero-pixels around the border to preserve input dimensions.
- Function: Detects features like edges, textures, and shapes.
B. Pooling Layer (Downsampling)
Reduces the spatial dimensions (width and height) of the input volume for the next convolutional layer.
- Max Pooling: Selects the maximum value in a window (retains the most prominent feature).
- Average Pooling: Calculates the average value.
- Benefit: Reduces computation and controls overfitting.
C. Fully Connected Layer (FC)
After several convolution and pooling layers, the high-level reasoning is done via fully connected layers (like an MLP) to output class scores.
5.3 Hierarchical Learning
- Early Layers: Detect simple features (edges, lines).
- Middle Layers: Detect shapes (circles, squares).
- Deep Layers: Detect complex objects (faces, cars).
6. Recurrent Neural Networks (RNN)
RNNs are designed for sequential data where the order matters, such as Time Series Analysis, Natural Language Processing (NLP), and Speech Recognition.
6.1 The Memory Concept
Unlike Feedforward networks, RNNs have a "memory" which captures information about what has been calculated so far.
- Sequential Processing: Inputs are processed one by one.
- Hidden State: The output of the current step depends on the current input () AND the previous hidden state ().
The RNN Equation:
- : New hidden state.
- : Previous hidden state.
- : Current input.
6.2 Backpropagation Through Time (BPTT)
Training an RNN involves unfolding the network through time steps. BPTT is essentially backpropagation applied to this unfolded graph. Errors are propagated backward from the last time step to the first.
6.3 Limitations: The Vanishing Gradient Problem
In long sequences, gradients calculated during BPTT can become extremely small as they are multiplied backward through layers.
- Consequence: The network stops learning from earlier time steps (short-term memory only).
- Solution: Advanced architectures like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units) utilize "gates" to control the flow of information, allowing the network to decide what to remember and what to forget over long durations.