Unit 4 - Notes

INT428

Unit 4: Introduction to deep neural networks / Modern NLP

1. Introduction to Neural Networks

Artificial Neural Networks (ANNs) are computational models inspired by the human brain's biological neural networks. They are designed to identify patterns in data (such as images, sound, and text) that are too complex for rule-based programming.

The Perceptron

The Perceptron is the fundamental building block of a neural network, often called a single-layer neural network. It mimics a biological neuron.

Structure:
- Inputs ( $x$ ): Data fed into the neuron.
- Weights ( $w$ ): Determine the importance of each input.
- Bias ( $b$ ): Allows the activation function to be shifted (conceptually similar to the intercept in a linear equation).
- Weighted Sum: $\sum (x_i \cdot w_i) + b$
- Activation Function: A mathematical function (e.g., Step, Sigmoid, ReLU) that determines if the neuron "fires" (outputs a signal).
Limitation: A single perceptron can only solve linearly separable problems (e.g., AND/OR gates, but not XOR).

Multi-Layer Perceptron (MLP)

To solve complex, non-linear problems, neurons are stacked into layers.

Input Layer: Receives the raw data.
Hidden Layers: Intermediate layers where computation and feature extraction occur. "Deep Learning" refers to networks with multiple hidden layers.
Output Layer: Produces the final prediction or classification.
Backpropagation: The training algorithm. It calculates the error (loss) at the output and propagates it backward through the network to update weights and minimize error.

A detailed diagram showing the architecture of a Multi-Layer Perceptron (MLP). The image should feat... — AI-generated image — may contain inaccuracies

2. Advanced Neural Architectures

Different types of data require specialized network architectures.

Convolutional Neural Networks (CNN)

Primarily used for image processing and computer vision, though applicable to NLP.

Concept: Uses "filters" (kernels) to scan inputs and detect spatial hierarchies of features (edges -> shapes -> objects).
Key Layers: Convolutional Layer (feature extraction), Pooling Layer (dimensionality reduction), Fully Connected Layer (classification).

Recurrent Neural Networks (RNN)

Designed for sequential data (time series, text, audio).

Memory: Unlike Feed-forward networks (MLP), RNNs have an internal loop. The output from the previous step is fed as input to the current step.
Use Case: Predicting the next word in a sentence based on previous words.
Limitation: Vanishing Gradient Problem—RNNs struggle to remember information from long sequences (e.g., the beginning of a long paragraph).

Transformer Architecture

The foundation of modern NLP (introduced in the paper "Attention Is All You Need", 2017).

Parallelism: Unlike RNNs, Transformers process the entire sequence of data simultaneously, allowing for massive parallelization and faster training.
Encoder-Decoder Structure:
- Encoder: Processes the input text to understand context.
- Decoder: Generates output text based on the encoder's understanding.

A comparative block diagram illustrating the difference between RNN and Transformer data flows. Top ... — AI-generated image — may contain inaccuracies

3. Introduction to Natural Language Processing (NLP)

NLP is a subfield of AI focused on the interaction between computers and human language. The goal is to read, decipher, understand, and make sense of human language in a valuable way.

NLP Phases

Lexical Analysis: Analysis of structure and words (identifying paragraphs, sentences, words).
Syntactic Analysis (Parsing): Analyzing grammar and sentence structure.
Semantic Analysis: Determining the meaning of words and sentences.
Discourse Integration: Understanding the sentence in the context of preceding sentences.
Pragmatic Analysis: Deriving meaning from real-world knowledge and intent.

Core Components

Tokenization

The process of breaking down text into smaller units called tokens.

Word Tokenization: "AI is great" $\rightarrow$ ["AI", "is", "great"]
Sub-word Tokenization: Used in modern models (BERT/GPT) to handle unknown words. "Playing" $\rightarrow$ ["Play", "##ing"]

Embeddings

Converting tokens into continuous vector representations (numbers) where similar words have similar vector values.

One-Hot Encoding (Old): Sparse, high-dimensional vectors (mostly zeros).
Word Embeddings (New): Dense vectors (e.g., Word2Vec, GloVe). They capture semantic relationships.
- Analogy: Vector("King") - Vector("Man") + Vector("Woman") $\approx$ Vector("Queen").

A 3D scatter plot visualization of Word Embeddings. The diagram should show a 3D coordinate system (... — AI-generated image — may contain inaccuracies

4. Modern NLP and Language Models

Attention Mechanism

The breakthrough that allows models to focus on specific parts of the input sequence when generating output, regardless of the distance between words.

Self-Attention: Allows a word to look at other words in the same sentence to determine its own context (e.g., resolving what "it" refers to in "The animal didn't cross the street because it was too tired").

Large Language Models (LLMs)

BERT (Bidirectional Encoder Representations from Transformers)

Architecture: Encoder-only.
Mechanism: It looks at text bidirectionally (left-to-right and right-to-left) simultaneously.
Training Objective:
1. Masked LM: Hides 15% of words and asks the model to guess them.
2. Next Sentence Prediction: Predicts if sentence B naturally follows sentence A.
Application: Excellent for understanding, classification, and question answering.

GPT (Generative Pre-trained Transformer)

Architecture: Decoder-only.
Mechanism: Autoregressive (predicts the next token based on previous tokens).
Training Objective: Next-token prediction.
Application: Text generation, code generation, creative writing.

A structural diagram comparing BERT and GPT architectures side-by-side. Left side labeled "BERT (Enc... — AI-generated image — may contain inaccuracies

5. Applications and Use Cases

Building Chatbots and Digital Assistants

Modern chatbots move beyond keyword matching (Rule-based) to Contextual AI.

Pipeline: User Input $\rightarrow$ ASR (Speech to Text) $\rightarrow$ NLP (Intent Recognition + Entity Extraction) $\rightarrow$ Dialog Management $\rightarrow$ NLG (Natural Language Generation) $\rightarrow$ Output.
RAG (Retrieval-Augmented Generation): Connecting LLMs to external private databases to answer questions factually without hallucinating.

Key NLP Use Cases

1. Sentiment Analysis

Determining the emotional tone behind words.

Type: Classification task.
Example: Analyzing tweets to see if users are happy or angry about a product launch.
Output: Positive, Negative, Neutral.

2. Machine Translation

Translating text from one language to another automatically.

Model: Sequence-to-Sequence (Seq2Seq) Transformers.
Example: Google Translate, DeepL.

3. Text Summarization

Reducing a text document to a short version while preserving key information.

Extractive Summarization: Selects and stitches together important sentences from the original text (like a highlighter).
Abstractive Summarization: Generates entirely new sentences that capture the essence of the text (like a human explanation).