Unit4 - Subjective Questions
INT428 • Practice Questions with Detailed Answers
Define a Perceptron. Explain its mathematical model with a diagram and equations.
Definition:
A Perceptron is the simplest type of artificial neural network and a fundamental building block of deep learning. It is a binary classifier that maps input vectors to an output value using weights and a bias.
Mathematical Model:
- Inputs (): The perceptron takes inputs .
- Weights (): Each input is multiplied by a corresponding weight , representing the importance of the input.
- Weighted Sum (): The weighted inputs are summed up along with a bias term ().
- Activation Function: A step function (usually Heaviside or Sign function) is applied to to determine the output ().
Diagram Description:
Input nodes feed into a summation node (neuron), which passes the result through an activation function to produce the output.
Explain the architecture and working mechanism of a Multi-Layer Perceptron (MLP). How does it solve the limitations of a single Perceptron?
Architecture of MLP:
An MLP is a feedforward artificial neural network consisting of at least three layers of nodes:
- Input Layer: Receives the initial data signal.
- Hidden Layer(s): One or more layers where computational processing and feature extraction occur via weighted sums and non-linear activation functions.
- Output Layer: Produces the final prediction or classification.
Working Mechanism:
- Forward Propagation: Data flows from input to output. Each neuron calculates , where is a non-linear activation function (like Sigmoid or ReLU).
- Backpropagation: The network calculates the error (Loss) between the predicted output and actual target. This error is propagated backward to update weights using an optimization algorithm (like Gradient Descent) to minimize the loss.
Solving Limitations:
A single Perceptron can only classify linearly separable data (solving the XOR problem is impossible). MLP introduces non-linearity through hidden layers and activation functions, allowing it to learn complex boundaries and solve non-linearly separable problems.
Compare Biological Neurons with Artificial Neurons.
Comparison:
| Feature | Biological Neuron | Artificial Neuron |
|---|---|---|
| Input mechanism | Dendrites: Receive electrochemical signals from other neurons. | Inputs (): Numerical data provided to the model. |
| Processing | Soma (Cell Body): Sums up incoming signals. | Summation Function: Calculates weighted sum . |
| Output Trigger | Action Potential: Fires if threshold is reached. | Activation Function: Maps the sum to an output (e.g., sigmoid, ReLU). |
| Transmission | Axon: Transmits the signal to synapses. | Output (): Passed to the next layer or used as result. |
| Learning | Synaptic Plasticity: Strengthening/weakening connections physically. | Weight Adjustment: Updating numerical weights via backpropagation. |
Describe the common Activation Functions used in Neural Networks: Sigmoid, ReLU, and Tanh.
1. Sigmoid Function:
- Formula:
- Range: (0, 1)
- Usage: Often used in the output layer for binary classification. It suffers from the vanishing gradient problem in deep networks.
2. ReLU (Rectified Linear Unit):
- Formula:
- Range: [0, )
- Usage: Most popular for hidden layers. It is computationally efficient and reduces the vanishing gradient problem, though it can suffer from the "dying ReLU" problem (neurons outputting zero forever).
3. Tanh (Hyperbolic Tangent):
- Formula:
- Range: (-1, 1)
- Usage: Zero-centered, making it generally better than Sigmoid for hidden layers as optimization is easier, but still susceptible to vanishing gradients.
Explain the architecture of Convolutional Neural Networks (CNN) and describe the role of Convolutional and Pooling layers.
CNN Architecture:
CNNs are specialized neural networks used primarily for processing grid-like data, such as images. The architecture typically consists of a stack of layers:
-
Convolutional Layer (The Core):
- Role: Feature extraction. It uses a set of learnable filters (kernels) that slide (convolve) over the input image.
- Operation: Performs element-wise multiplication and summation to produce a Feature Map. This allows the network to detect edges, textures, and patterns.
- Key Concepts: Stride (step size of the filter), Padding (adding borders to preserve dimension).
-
Pooling Layer (Downsampling):
- Role: Dimensionality reduction. It reduces the spatial size of the representation to decrease the number of parameters and computation in the network.
- Types:
- Max Pooling: Takes the maximum value from a specific window (captures most prominent features).
- Average Pooling: Takes the average value.
-
Fully Connected (FC) Layer:
- Role: Classification. After feature extraction and downsampling, the data is flattened into a vector and passed through standard dense layers to output the final class probabilities.
Why are CNNs preferred over standard MLPs for image processing tasks?
CNNs are preferred over MLPs for image tasks due to:
- Spatial Invariance: CNNs can recognize features (like a cat's ear) regardless of where they appear in the image, thanks to weight sharing in filters. MLPs treat every pixel as a separate input tied to a specific location.
- Parameter Efficiency: MLPs connect every input pixel to every neuron in the hidden layer, leading to an explosion of parameters for large images (e.g., a image has 1 million inputs). CNNs share weights (filters), drastically reducing the parameter count.
- Local Connectivity: CNNs capture local spatial dependencies (neighboring pixels usually relate to each other), whereas MLPs ignore the 2D structure of the image by flattening it immediately.
What are Recurrent Neural Networks (RNNs)? Explain the Vanishing Gradient problem associated with them.
Recurrent Neural Networks (RNNs):
RNNs are designed for sequential data (time series, text, audio) where the order matters. Unlike feedforward networks, RNNs have a "memory" loop.
- The output of a hidden layer at time step depends not only on the input at () but also on the hidden state from the previous time step ().
- Formula:
The Vanishing Gradient Problem:
- Cause: During training (Backpropagation through Time), gradients are calculated by applying the chain rule across many time steps.
- Effect: If the weights are small (less than 1) or the activation function (like Sigmoid/Tanh) has small derivatives, the gradients shrink exponentially as they propagate back to earlier time steps.
- Result: The weights in the earlier layers do not update effectively, causing the RNN to forget long-range dependencies (e.g., forgetting the subject of a sentence appearing at the start of a long paragraph).
Define Natural Language Processing (NLP) and list its five major phases.
Definition:
Natural Language Processing (NLP) is a branch of Artificial Intelligence that gives computers the ability to understand, interpret, and generate human language (text or speech) in a way that is meaningful.
Major Phases of NLP:
- Lexical Analysis: The study of the structure of words (morphology) and breaking text into paragraphs, sentences, and words (tokenization).
- Syntactic Analysis (Parsing): Analyzing the grammatical structure of the sentence to check for conformity to the rules of grammar (e.g., creating parse trees).
- Semantic Analysis: Determining the meaning of the words and sentences. It focuses on the literal meaning irrespective of context.
- Discourse Integration: Analyzing the meaning of a sentence based on the sentences preceding it (contextual flow).
- Pragmatic Analysis: Deriving knowledge from external scenarios and understanding the intent or effect of the language in a real-world situation (reading between the lines).
What is Tokenization in NLP? Differentiate between Word Tokenization and Subword Tokenization.
Tokenization:
Tokenization is the process of breaking down a stream of text into smaller units called tokens. These tokens can be words, characters, or subwords. It is the first step in most NLP pipelines.
Difference:
-
Word Tokenization:
- Splits text based on whitespace or punctuation.
- Example: "I'm running" ["I", "'m", "running"]
- Limitation: Large vocabulary size; struggles with out-of-vocabulary (OOV) words.
-
Subword Tokenization (e.g., BPE, WordPiece):
- Splits words into frequent sub-units.
- Example: "running" ["run", "##ning"]
- Advantage: Handles rare words by breaking them into known parts; balances vocabulary size and semantic representation. Used in modern models like BERT and GPT.
Explain the concept of Word Embeddings. How is it superior to One-Hot Encoding?
Word Embeddings:
Word embeddings are dense vector representations of words where words with similar meanings are located close to each other in the vector space. Examples include Word2Vec, GloVe, and FastText.
Superiority over One-Hot Encoding:
- Dimensionality:
- One-Hot: Produces high-dimensional sparse vectors (size equals vocabulary size, e.g., 50,000 dimensions).
- Embeddings: Produces low-dimensional dense vectors (e.g., 100-300 dimensions).
- Semantic Meaning:
- One-Hot: Vectors are orthogonal; there is no mathematical relationship between "King" and "Queen".
- Embeddings: Capture semantic relationships. Operations like are possible.
- Generalization: Embeddings allow models to generalize better to unseen words if they are semantically similar to known words.
Explain the Transformer architecture with a focus on the Encoder and Decoder blocks.
Transformer Architecture:
Introduced in the paper "Attention is All You Need" (2017), the Transformer relies entirely on attention mechanisms, discarding recurrence and convolutions.
-
Encoder Block:
- Purpose: Processes the input sequence to create a deep understanding of the context.
- Components: A stack of identical layers. Each layer has two sub-layers:
- Multi-Head Self-Attention: allows the model to associate each word with every other word in the input.
- Feed-Forward Network: processes the attention outputs.
- Includes residual connections and layer normalization.
-
Decoder Block:
- Purpose: Generates the output sequence one element at a time.
- Components: Similar to the encoder but with an extra sub-layer:
- Masked Self-Attention: Ensures the prediction for position can only depend on known outputs at positions less than (prevents looking ahead).
- Encoder-Decoder Attention: Helps the decoder focus on relevant parts of the input sequence (from the Encoder stack).
-
Positional Encoding: Since Transformers process data in parallel, positional encodings are added to embeddings to retain the order of words.
What is the 'Attention Mechanism' in Deep Learning? Explain Self-Attention briefly.
Attention Mechanism:
The Attention mechanism mimics human cognitive attention. Instead of processing the whole input as a fixed context vector (as in older RNNs), it allows the model to assign different "weights" or importance to different parts of the input sequence when generating a specific part of the output.
Self-Attention:
Self-attention (or intra-attention) relates different positions of a single sequence to compute a representation of the sequence.
- Example: In the sentence "The animal didn't cross the street because it was too tired", self-attention allows the model to associate "it" strongly with "animal" rather than "street".
- Mechanism: It computes three vectors for each word: Query (Q), Key (K), and Value (V). The attention score is derived by comparing the Query of one word with the Keys of others.
Discuss BERT (Bidirectional Encoder Representations from Transformers). How is it trained?
BERT:
BERT is a transformer-based machine learning technique for NLP pre-training. It uses only the Encoder stack of the Transformer architecture. Its key innovation is bidirectionality, meaning it looks at the text context from both left-to-right and right-to-left simultaneously.
Training Objectives:
BERT is pre-trained on two unsupervised tasks:
- Masked Language Modeling (MLM): Randomly masks 15% of the words in the input (e.g., "The [MASK] sat on the mat") and asks the model to predict the masked word based on context.
- Next Sentence Prediction (NSP): The model receives pairs of sentences and predicts if the second sentence logically follows the first. This helps in understanding relationships between sentences (crucial for QA and summarization).
Compare BERT and GPT models in terms of architecture and directionality.
Comparison:
| Feature | BERT (Bidirectional Encoder Representations) | GPT (Generative Pre-trained Transformer) |
|---|---|---|
| Architecture | Encoder-only blocks of the Transformer. Designed for understanding. | Decoder-only blocks of the Transformer. Designed for generation. |
| Directionality | Bidirectional: Reads text from both directions simultaneously to understand full context. | Unidirectional (Autoregressive): Reads text left-to-right. Predicts the next token based only on previous tokens. |
| Objective | Masked Language Modeling (Fill in the blanks). | Causal Language Modeling (Predict the next word). |
| Primary Use | Classification, QA, Sentiment Analysis, Named Entity Recognition. | Text Generation, Chatbots, Code generation, creative writing. |
Explain the workflow of building a Chatbot or Digital Assistant.
Building a modern AI chatbot involves the following workflow:
- Define Scope: Determine the purpose (e.g., Customer Service, Booking) and domain.
- Data Collection: Gather conversation logs, FAQs, and intent examples.
- Preprocessing: Clean text, tokenize, and remove stop words.
- NLU (Natural Language Understanding) Component:
- Intent Recognition: Classify what the user wants (e.g., "BookFlight", "CheckBalance").
- Entity Extraction (NER): Extract specific variables (e.g., "New York", "tomorrow").
- Dialogue Management: Maintain the state of the conversation and determine the next action based on the intent and context.
- Response Generation:
- Retrieval-based: Select a pre-written response.
- Generative: Use LLMs (like GPT) to generate a dynamic response.
- Testing & Deployment: Test for edge cases and deploy on platforms (Web, WhatsApp, Slack).
What is Sentiment Analysis? How is it implemented using NLP?
Sentiment Analysis:
Also known as opinion mining, it is the process of determining the emotional tone behind a body of text. It classifies text as Positive, Negative, or Neutral (and sometimes intensity).
Implementation:
- Preprocessing: Tokenization, stop-word removal, lemmatization.
- Feature Extraction: Convert text to numbers using Bag of Words, TF-IDF, or Word Embeddings.
- Model Classification:
- Traditional: Use Naive Bayes, SVM, or Logistic Regression on extracted features.
- Deep Learning: Feed embeddings into RNNs (LSTMs) or Transformers (BERT) to capture context and sarcasm.
- Output: The model outputs a probability score for each sentiment class.
Explain the difference between Extractive and Abstractive Text Summarization.
1. Extractive Summarization:
- Method: Selects important sentences or phrases directly from the original text and stitches them together.
- Analogy: Like highlighting key sentences in a book.
- Technique: Uses statistical scoring (e.g., TF-IDF, Graph-based ranking) to identify high-importance sentences.
- Pros/Cons: Grammatically safer but may lack coherence or flow.
2. Abstractive Summarization:
- Method: Generates new sentences that capture the core meaning of the source text using new words and phrasing.
- Analogy: Like reading a book and writing a summary in your own words.
- Technique: Requires advanced Deep Learning models (Seq2Seq RNNs, Transformers like T5 or BART) to understand semantics and generate language.
- Pros/Cons: More human-like and concise but computationally expensive and prone to "hallucinations" (inventing facts).
Distinguish between Feed Forward Neural Networks (FFNN) and Recurrent Neural Networks (RNN).
| Feature | Feed Forward NN (FFNN) | Recurrent NN (RNN) |
|---|---|---|
| Data Flow | Unidirectional: Input Hidden Output. No loops. | Cyclic: Loops exist. Output of a layer is fed back as input to the same layer. |
| Memory | No memory of previous inputs. | Has internal memory (Hidden State) to store information about previous inputs. |
| Input Type | Fixed-size input vectors. | Variable-length sequences (Text, Audio, Time Series). |
| Use Cases | Image classification, tabular data regression. | Language translation, Speech recognition, Text generation. |
| Training | Standard Backpropagation. | Backpropagation Through Time (BTT). |
What are the common challenges faced in Natural Language Processing?
NLP is difficult because human language is ambiguous and complex. Common challenges include:
- Ambiguity:
- Lexical: Words with multiple meanings (e.g., "Bank" - river vs. finance).
- Syntactic: Sentences with multiple parse structures (e.g., "I saw the man with the telescope").
- Sarcasm and Irony: Machines struggle to detect when the literal meaning is opposite to the intended meaning.
- Slang and Idioms: Informal language and expressions (e.g., "It's raining cats and dogs") are hard to translate literally.
- Contextual Coference: Understanding "He" or "It" refers to which noun mentioned sentences ago.
- Domain Specificity: A model trained on news articles may fail to understand medical or legal documents.
Explain the concept of Loss Function and Optimizers in the context of Neural Network training.
Loss Function (Cost Function):
- It is a mathematical function that quantifies the error between the model's prediction () and the actual target ().
- Goal: The objective of training is to minimize this value.
- Examples:
- Mean Squared Error (MSE): Used for regression.
- Cross-Entropy Loss: Used for classification.
Optimizer:
- It is an algorithm used to update the weights of the neural network to minimize the Loss Function.
- It decides how much to change the weights and in which direction based on the gradients calculated via backpropagation.
- Examples:
- Gradient Descent: Updates weights by moving in the opposite direction of the gradient.
- Adam (Adaptive Moment Estimation): A popular optimizer that adjusts learning rates adaptively for each parameter, combining advantages of other extensions like RMSprop and Momentum.