1What is the fundamental building block of a neural network, inspired by the human brain?
Introduction to Neural Networks
Easy
A.Pixel
B.Algorithm
C.Transistor
D.Neuron (or Node)
Correct Answer: Neuron (or Node)
Explanation:
A neural network is composed of interconnected units called neurons or nodes, which process and transmit information, similar to biological neurons in the brain.
Incorrect! Try again.
2What kind of problems can a single-layer Perceptron solve?
Perceptron
Easy
A.Linearly separable problems
B.Non-linearly separable problems
C.All classification problems
D.Image recognition problems
Correct Answer: Linearly separable problems
Explanation:
A single-layer Perceptron can only learn to separate data that is linearly separable, meaning it can be divided by a single straight line or hyperplane.
Incorrect! Try again.
3What does MLP stand for in the context of neural networks?
MLP
Easy
A.Maximum Likelihood Program
B.Main Logic Processor
C.Multiple Linear Progression
D.Multi-Layer Perceptron
Correct Answer: Multi-Layer Perceptron
Explanation:
MLP stands for Multi-Layer Perceptron, which is a type of feedforward artificial neural network with one or more hidden layers between the input and output layers.
Incorrect! Try again.
4Which type of Deep Neural Network is primarily designed for processing grid-like data, such as images?
CNNs use special layers called convolutional layers that are highly effective at detecting patterns, features, and spatial hierarchies within images.
Incorrect! Try again.
5Recurrent Neural Networks (RNNs) are best suited for what type of data?
RNN
Easy
A.Sequential data (e.g., time series, text)
B.Image data
C.Static, independent data points
D.Tabular data
Correct Answer: Sequential data (e.g., time series, text)
Explanation:
RNNs have internal memory (loops) that allow them to process sequences of data, making them ideal for tasks where the order of information is important, like language or stock prices.
Incorrect! Try again.
6What is the key innovation in the Transformer architecture that allows it to process entire sequences at once and handle long-range dependencies effectively?
Transformer Architecture and Applications
Easy
A.Recurrent Loops
B.The Attention Mechanism
C.The Sigmoid Function
D.Convolutional Layers
Correct Answer: The Attention Mechanism
Explanation:
The self-attention mechanism is the core component of the Transformer, allowing the model to weigh the importance of different words in the input sequence when processing a specific word.
Incorrect! Try again.
7What is the main goal of Natural Language Processing (NLP)?
Modern NLP: Introduction to NLP
Easy
A.To enable computers to understand, interpret, and generate human language
B.To build faster computer hardware
C.To create realistic computer graphics
D.To optimize database queries
Correct Answer: To enable computers to understand, interpret, and generate human language
Explanation:
NLP is a field of AI focused on the interaction between computers and humans using natural language. Its primary goal is to make computers capable of processing and analyzing large amounts of language data.
Incorrect! Try again.
8Which phase of NLP involves analyzing the grammatical structure of a sentence and the relationships between words?
NLP phases
Easy
A.Semantic Analysis
B.Syntactic Analysis (Parsing)
C.Pragmatic Analysis
D.Lexical Analysis (Tokenization)
Correct Answer: Syntactic Analysis (Parsing)
Explanation:
Syntactic Analysis, or parsing, is the process of analyzing a string of symbols (a sentence) according to the rules of a formal grammar to understand its structure.
Incorrect! Try again.
9In NLP, what is the process of breaking down a text into smaller units like words or sentences called?
Tokenization
Easy
A.Classification
B.Summarization
C.Embedding
D.Tokenization
Correct Answer: Tokenization
Explanation:
Tokenization is a fundamental first step in many NLP tasks. It splits a piece of text into smaller pieces, called tokens, which can be words, characters, or subwords.
Incorrect! Try again.
10What is the purpose of a word embedding in NLP?
Embeddings
Easy
A.To represent words as dense numerical vectors
B.To correct spelling mistakes
C.To translate words into another language
D.To count the frequency of each word
Correct Answer: To represent words as dense numerical vectors
Explanation:
Word embeddings capture the semantic meaning and relationships between words by mapping them to vectors of real numbers in a multi-dimensional space.
Incorrect! Try again.
11In the context of deep learning models like Transformers, what does the 'attention' mechanism help the model to do?
Attention
Easy
A.Reduce the number of layers in the network
B.Focus on the most relevant parts of the input sequence
C.Increase the speed of model training
D.Convert text to speech
Correct Answer: Focus on the most relevant parts of the input sequence
Explanation:
The attention mechanism allows the model to dynamically weigh the importance of different parts of the input when making a prediction or generating output, improving its performance on long sequences.
Incorrect! Try again.
12What are BERT and GPT well-known examples of?
language models (BERT, GPT)
Easy
A.Image Classification Models
B.Large Language Models (LLMs)
C.Speech Recognition APIs
D.Database Management Systems
Correct Answer: Large Language Models (LLMs)
Explanation:
BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) are powerful, pre-trained large language models that form the basis for many modern NLP applications.
Incorrect! Try again.
13What is the primary function of a chatbot?
Building chatbots and digital assistants
Easy
A.To perform complex mathematical simulations
B.To simulate conversation with human users
C.To manage computer hardware resources
D.To analyze and visualize data
Correct Answer: To simulate conversation with human users
Explanation:
A chatbot is a software application designed to conduct a conversation with a human user in natural language through text or speech, for purposes like customer service or information retrieval.
Incorrect! Try again.
14Determining if a customer review is positive, negative, or neutral is an example of which NLP task?
NLP use cases (sentiment analysis, translation, summarization)
Easy
A.Sentiment Analysis
B.Named Entity Recognition
C.Text Summarization
D.Machine Translation
Correct Answer: Sentiment Analysis
Explanation:
Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text to determine the writer's attitude towards a particular topic.
Incorrect! Try again.
15Why is an activation function, such as ReLU or Sigmoid, necessary in a Multi-Layer Perceptron (MLP)?
MLP
Easy
A.To reduce the number of neurons
B.To make the model run faster
C.To introduce non-linearity into the model
D.To only work with positive numbers
Correct Answer: To introduce non-linearity into the model
Explanation:
Without a non-linear activation function, an MLP, no matter how many layers it has, would behave just like a single-layer perceptron because a series of linear transformations is equivalent to a single linear transformation.
Incorrect! Try again.
16In a CNN, what is the primary purpose of a 'pooling' layer (e.g., MaxPooling)?
CNN
Easy
A.To classify the image
B.To reduce the spatial dimensions (width and height) of the input volume
C.To increase the number of features
D.To apply a non-linear transformation
Correct Answer: To reduce the spatial dimensions (width and height) of the input volume
Explanation:
Pooling layers are used to progressively reduce the spatial size of the representation, which helps to decrease the amount of parameters and computation in the network, and also helps to control overfitting.
Incorrect! Try again.
17The task of automatically converting text from one language to another, like from English to Spanish, is called:
NLP use cases (sentiment analysis, translation, summarization)
Easy
A.Language Detection
B.Machine Translation
C.Sentiment Analysis
D.Text Generation
Correct Answer: Machine Translation
Explanation:
Machine Translation is a classic NLP use case that involves using software to translate text or speech from a source language to a target language.
Incorrect! Try again.
18The 'P' in GPT stands for 'Pre-trained'. What does this mean?
language models (BERT, GPT)
Easy
A.The model requires a person to train it manually
B.The model can only predict one word at a time
C.The model's parameters are permanently fixed
D.The model is trained on a massive dataset before being fine-tuned for specific tasks
Correct Answer: The model is trained on a massive dataset before being fine-tuned for specific tasks
Explanation:
Pre-training involves training the model on a vast corpus of text data to learn general language patterns. This pre-trained model can then be adapted (fine-tuned) for more specific downstream tasks with much less data.
Incorrect! Try again.
19Which NLP task is focused on creating a shorter version of a long document while retaining its most important information?
NLP use cases (sentiment analysis, translation, summarization)
Easy
A.Text Summarization
B.Question Answering
C.Machine Translation
D.Part-of-Speech Tagging
Correct Answer: Text Summarization
Explanation:
Text Summarization is the process of condensing a source text into a shorter version, providing a quick overview of the main points of the original document.
Incorrect! Try again.
20The 'vanishing gradient problem' is a common issue that can make it difficult to train which type of neural network on long sequences?
RNN
Easy
A.Autoencoder
B.Multi-Layer Perceptron (MLP)
C.Convolutional Neural Network (CNN)
D.Recurrent Neural Network (RNN)
Correct Answer: Recurrent Neural Network (RNN)
Explanation:
In deep networks or RNNs, gradients can become extremely small as they are propagated backward through time, making it hard for the network to learn long-range dependencies. Architectures like LSTMs and GRUs were developed to mitigate this issue.
Incorrect! Try again.
21A single-layer perceptron is a linear classifier. Which of the following problems can it not solve, and why?
Perceptron
Medium
A.The AND problem, because it involves multiple true conditions.
B.The XOR (exclusive OR) problem, because the data points are not linearly separable.
C.The OR problem, because the decision boundary is diagonal.
D.The NOT problem, because it involves inverting the input.
Correct Answer: The XOR (exclusive OR) problem, because the data points are not linearly separable.
Explanation:
A single-layer perceptron can only learn linearly separable patterns. The XOR function's data points (0,0)->0, (0,1)->1, (1,0)->1, (1,1)->0 cannot be separated by a single straight line in a 2D plane. This limitation was crucial in demonstrating the need for multi-layer networks.
Incorrect! Try again.
22In a Multi-Layer Perceptron (MLP), what is the primary consequence of removing all non-linear activation functions (like ReLU or sigmoid) from the hidden layers?
MLP
Medium
A.The network becomes unable to perform regression tasks.
B.The number of trainable parameters in the network is significantly reduced.
C.The network will train much faster but lose all accuracy.
D.The network collapses into a single linear transformation, making it no more powerful than a single-layer network.
Correct Answer: The network collapses into a single linear transformation, making it no more powerful than a single-layer network.
Explanation:
A composition of linear functions is itself a linear function. Without non-linear activation functions, the entire MLP, regardless of its depth, can be mathematically simplified to a single linear layer. The non-linearities are essential for learning complex, non-linear patterns.
Incorrect! Try again.
23You are training a very deep MLP and observe that the gradients for the earliest layers are almost zero, causing training to stall. What is this phenomenon called, and which activation function is known to help mitigate it?
MLP
Medium
A.Exploding Gradient Problem; Tanh
B.Vanishing Gradient Problem; ReLU
C.Overfitting; Sigmoid
D.Saddle Point Problem; Softmax
Correct Answer: Vanishing Gradient Problem; ReLU
Explanation:
The Vanishing Gradient Problem occurs when gradients become progressively smaller as they are backpropagated to earlier layers. Activation functions like sigmoid and tanh have derivatives that are less than 1, and multiplying these small numbers repeatedly causes the gradient to vanish. The Rectified Linear Unit (ReLU) has a derivative of 1 for positive inputs, which helps prevent the gradient from shrinking.
Incorrect! Try again.
24A 2D convolutional layer is applied to a 64x64 pixel grayscale image. The layer uses a 5x5 kernel, a stride of 2, and no padding. What will be the spatial dimensions (height x width) of the output feature map?
CNN
Medium
A.30x30
B.59x59
C.60x60
D.32x32
Correct Answer: 30x30
Explanation:
The formula for calculating the output size is , where W is input size, K is kernel size, P is padding, and S is stride. For this case: . Since we take the floor of the division, it becomes . So the output is 30x30.
Incorrect! Try again.
25Beyond reducing computational complexity, what is a key benefit of using a max-pooling layer in a Convolutional Neural Network (CNN)?
CNN
A.It provides a degree of translation invariance.
B.It introduces non-linearity into the network.
C.It increases the receptive field of subsequent layers.
D.It normalizes the feature map activations.
Correct Answer: It provides a degree of translation invariance.
Explanation:
Max-pooling takes the maximum value from a patch of the feature map. This means that even if a feature shifts slightly in position within the patch, the output of the max-pooling layer can remain the same. This makes the network more robust to small translations of features in the input image.
Incorrect! Try again.
26Why is a Long Short-Term Memory (LSTM) network often preferred over a standard Recurrent Neural Network (RNN) for tasks involving long sequences, such as paragraph-level text analysis?
RNN
Medium
A.LSTMs can be parallelized during training, unlike standard RNNs.
B.LSTMs use a more complex activation function that captures more features.
C.LSTMs use gating mechanisms to control information flow, mitigating the vanishing gradient problem.
D.LSTMs have fewer parameters, making them faster to train on long sequences.
Correct Answer: LSTMs use gating mechanisms to control information flow, mitigating the vanishing gradient problem.
Explanation:
Standard RNNs suffer from the vanishing gradient problem, making it difficult for them to learn dependencies between elements that are far apart in a sequence. LSTMs introduce a 'cell state' and 'gates' (input, forget, output) that regulate what information is stored, forgotten, or outputted, allowing gradients to flow over longer durations and enabling the network to capture long-range dependencies.
Incorrect! Try again.
27In a many-to-one RNN architecture used for text classification, what is the typical role of the final hidden state?
RNN
Medium
A.It is used to generate the first word of the output sequence.
B.It serves as a summary vector of the entire input sequence, which is then fed into a final classification layer.
C.It is discarded, as only the outputs from each time step are relevant.
D.It is averaged with the initial hidden state to normalize the network's memory.
Correct Answer: It serves as a summary vector of the entire input sequence, which is then fed into a final classification layer.
Explanation:
In a many-to-one architecture, the RNN processes the entire sequence of inputs (many), and the hidden state at the final time step is expected to have encoded a meaningful summary of the whole sequence. This final hidden state vector is then used as the input to a feed-forward layer (like a softmax classifier) to produce a single output (one), such as a class label.
Incorrect! Try again.
28What is the primary advantage of the self-attention mechanism in Transformers over the sequential processing of RNNs in terms of computational efficiency?
Transformer Architecture and Applications
Medium
A.It uses a simpler update rule than the gating mechanisms in LSTMs.
B.It requires fewer matrix multiplications per layer.
C.It has a constant-length path for information to travel between any two positions, preventing vanishing gradients.
D.It allows for parallel computation across all tokens in a sequence, as the relationship between any two tokens is calculated independently of their distance.
Correct Answer: It allows for parallel computation across all tokens in a sequence, as the relationship between any two tokens is calculated independently of their distance.
Explanation:
RNNs must process sequences token by token, as the hidden state of time step 't' depends on the hidden state of 't-1'. In contrast, self-attention calculates a score between every pair of tokens in the sequence simultaneously. This lack of sequential dependency allows for massive parallelization on modern hardware like GPUs, making it much faster to train on large sequences.
Incorrect! Try again.
29In the Transformer architecture, what is the specific purpose of the Positional Encoding step?
Transformer Architecture and Applications
Medium
A.To normalize the word embeddings before they enter the attention layers.
B.To reduce the dimensionality of the input embeddings to save computation.
C.To inject information about the relative or absolute position of tokens, since the self-attention mechanism itself is permutation-invariant.
D.To convert the input tokens into a continuous vector representation.
Correct Answer: To inject information about the relative or absolute position of tokens, since the self-attention mechanism itself is permutation-invariant.
Explanation:
The self-attention mechanism treats the input as a set of vectors, meaning it has no inherent sense of order. The sentences "The cat chased the dog" and "The dog chased the cat" would look identical to the attention mechanism without positional information. Positional Encodings are vectors added to the input embeddings to provide the model with a signal about the position of each token in the sequence.
Incorrect! Try again.
30You are developing an NLP model and must choose a tokenization strategy. Why might a subword tokenization algorithm like Byte-Pair Encoding (BPE) be superior to simple word-based tokenization with a fixed vocabulary?
Tokenization
Medium
A.It is a lossless compression algorithm that reduces model size.
B.It can handle out-of-vocabulary (OOV) words by breaking them into known subword units.
C.It always results in a shorter sequence of tokens, reducing computation time.
D.It guarantees that every word is broken into its morphological root and affixes.
Correct Answer: It can handle out-of-vocabulary (OOV) words by breaking them into known subword units.
Explanation:
Word-based tokenization fails when it encounters a word not in its vocabulary, mapping it to an 'UNK' token and losing its meaning. Subword tokenization, like BPE, can represent any word by breaking it down into smaller, learned pieces. For example, 'tokenization' might become 'token' and '##ization', allowing the model to reason about novel words and handle OOV issues gracefully.
Incorrect! Try again.
31If a word embedding model learns vectors such that vector('Paris') - vector('France') + vector('Italy') results in a vector very close to vector('Rome'), what does this demonstrate about the learned embedding space?
Embeddings
Medium
A.The model has captured semantic relationships (like capital city of a country) as geometric relationships in the vector space.
B.The model is only effective for proper nouns and geographical locations.
C.The vectors for all countries are parallel to each other.
D.The model has simply memorized geographical facts from the training data.
Correct Answer: The model has captured semantic relationships (like capital city of a country) as geometric relationships in the vector space.
Explanation:
This phenomenon, known as semantic arithmetic, is a key property of well-trained word embeddings. It shows that the spatial arrangement of vectors is not random but encodes complex semantic and syntactic relationships. The vector difference between 'Paris' and 'France' captures the concept of 'is the capital of', which can then be applied to 'Italy' to find its capital.
Incorrect! Try again.
32What is the primary motivation for using pre-trained embeddings (like GloVe or Word2Vec) when building an NLP model for a task with a relatively small dataset?
Embeddings
Medium
A.To ensure the model's vocabulary is limited to only the most common words in the English language.
B.To leverage the rich semantic knowledge learned from a massive text corpus, which provides a better model initialization and improves generalization.
C.To reduce the number of layers required in the neural network.
D.To completely eliminate the need for any task-specific training (fine-tuning).
Correct Answer: To leverage the rich semantic knowledge learned from a massive text corpus, which provides a better model initialization and improves generalization.
Explanation:
Training word embeddings from scratch requires a very large dataset to learn meaningful representations. By using pre-trained embeddings, a model can start with a high-quality understanding of word meanings and relationships. This 'transfer learning' is highly effective for improving performance and convergence speed, especially when the target task has limited training data.
Incorrect! Try again.
33In a sequence-to-sequence model with attention for machine translation, if a high attention weight is placed on an input word, what does it signify for the current output step?
Attention
Medium
A.That the input word is a grammatical stop word that should be ignored.
B.That the input word has the highest frequency in the training corpus.
C.That the input word is highly relevant for predicting the current output word.
D.That the input word is the last word of the source sentence.
Correct Answer: That the input word is highly relevant for predicting the current output word.
Explanation:
The attention mechanism computes a set of weights for each output step, indicating how much 'attention' to pay to each input word. A high weight means the model has determined that the context and meaning of that specific input word are most important for generating the current word in the translated sequence.
Incorrect! Try again.
34Which statement best describes a key architectural difference between BERT and GPT that influences their primary applications?
language models (BERT, GPT)
Medium
A.GPT uses an attention mechanism while BERT relies on recurrent layers, making GPT better for longer sequences.
B.GPT is trained on a larger vocabulary than BERT, making it more knowledgeable.
C.BERT is an encoder-only model that processes text bidirectionally, making it ideal for language understanding tasks, while GPT is a decoder-only model that processes text auto-regressively, making it ideal for language generation.
D.BERT must be fine-tuned for specific tasks, whereas GPT can be used directly for any task without fine-tuning.
Correct Answer: BERT is an encoder-only model that processes text bidirectionally, making it ideal for language understanding tasks, while GPT is a decoder-only model that processes text auto-regressively, making it ideal for language generation.
Explanation:
BERT's architecture (based on the Transformer encoder) allows it to see the entire input sentence at once, creating deep bidirectional representations perfect for tasks like sentiment analysis or question answering. GPT's architecture (based on the Transformer decoder) is designed to predict the next word given the previous words, making it a natural fit for generative tasks like writing essays or code.
Incorrect! Try again.
35What is the core idea behind the Masked Language Model (MLM) pre-training objective used for BERT?
language models (BERT, GPT)
Medium
A.To mask all nouns in a sentence and have the model predict them based on the verbs and adjectives.
B.To predict the next token in a sequence using only the previous tokens, which is a unidirectional approach.
C.To predict randomly masked tokens in a sequence by using both left and right context, forcing the model to learn a deep bidirectional understanding of language.
D.To predict the next sentence in a document, teaching the model about discourse coherence.
Correct Answer: To predict randomly masked tokens in a sequence by using both left and right context, forcing the model to learn a deep bidirectional understanding of language.
Explanation:
Unlike traditional language models that predict the next word (left-to-right), MLM allows the model to learn context from both directions. By masking about 15% of the tokens and training the model to predict them, BERT learns rich representations of words based on their full surrounding context.
Incorrect! Try again.
36An NLP system is designed to analyze a legal contract to identify the parties involved, their obligations, and the effective dates. This task goes beyond just parsing grammar. Which NLP phase is most central to this goal?
Modern NLP: Introduction to NLP, NLP phases
Medium
A.Syntactic Analysis (Parsing)
B.Semantic Analysis
C.Lexical Analysis
D.Morphological Analysis
Correct Answer: Semantic Analysis
Explanation:
Lexical, morphological, and syntactic analysis deal with words, their forms, and sentence structure, respectively. Semantic Analysis is the phase concerned with understanding the meaning of the text. Identifying entities like 'parties' and their 'obligations' requires interpreting the meaning and relationships within the text, which is the core of semantic analysis.
Incorrect! Try again.
37In the architecture of a task-oriented chatbot, what is the primary responsibility of the Dialogue Management (DM) component?
Building chatbots and digital assistants
Medium
A.To convert the chatbot's planned response into natural language (Text-to-Speech or NLG).
B.To convert the user's spoken words into text (Speech-to-Text).
C.To maintain the state of the conversation and decide the chatbot's next action.
D.To extract the user's intent and entities from their message.
Correct Answer: To maintain the state of the conversation and decide the chatbot's next action.
Explanation:
The Natural Language Understanding (NLU) component extracts intent and entities. The Dialogue Manager takes this structured information, tracks what has happened so far in the conversation (state tracking), and decides what to do next. This could be asking a clarifying question, querying a database via an API, or providing a final answer.
Incorrect! Try again.
38You are building a system to generate a short, one-paragraph summary of a long news article. This is an example of which NLP use case, and what is a primary challenge?
NLP use cases (sentiment analysis, translation, summarization)
Medium
A.Named Entity Recognition; identifying all the people and organizations mentioned.
B.Sentiment Analysis; determining if the article's tone is positive or negative.
C.Text Summarization; ensuring the summary is coherent and factually consistent with the source text.
D.Machine Translation; converting the article to another language.
Correct Answer: Text Summarization; ensuring the summary is coherent and factually consistent with the source text.
Explanation:
This task is a direct application of text summarization. A major challenge, especially with abstractive summarization (where the model generates new sentences), is avoiding 'hallucinations'—generating plausible but factually incorrect information that was not in the original article. Ensuring factual consistency is a key area of research.
Incorrect! Try again.
39A movie review website wants to automatically assign a 'thumbs up' or 'thumbs down' rating to user-submitted reviews based on the text. Which NLP task is most appropriate for this problem?
NLP use cases (sentiment analysis, translation, summarization)
Medium
A.Sentiment Analysis
B.Text Summarization
C.Topic Modeling
D.Question Answering
Correct Answer: Sentiment Analysis
Explanation:
Sentiment analysis (or opinion mining) is the NLP task focused on identifying and extracting subjective information from source materials. Classifying a text as positive ('thumbs up') or negative ('thumbs down') is a classic binary sentiment classification problem.
Incorrect! Try again.
40A neuron in a neural network has an input vector x = [2.0, 3.0], a weight vector w = [0.5, -1.5], and a bias b = 1.0. What is the output of this neuron if it uses a ReLU (Rectified Linear Unit) activation function?
Introduction to Neural Networks
Medium
A.0.0
B.1.0
C.-3.5
D.-2.5
Correct Answer: 0.0
Explanation:
First, calculate the pre-activation value, . This is . The ReLU activation function is defined as . Since z is -2.5, the output is .
Incorrect! Try again.
41A Multi-Layer Perceptron (MLP) is constructed with 5 hidden layers, but exclusively uses the linear activation function in all layers, including the output layer. The network is trained on a complex, non-linear classification task. What is the effective representational power of this network?
MLP
Hard
A.It will behave like a deep autoencoder, compressing and decompressing the input linearly.
B.It can approximate any continuous function, as per the Universal Approximation Theorem.
C.It can model complex non-linear functions, but training will be unstable due to the depth.
D.It is equivalent to a single-layer perceptron and can only model linearly separable data.
Correct Answer: It is equivalent to a single-layer perceptron and can only model linearly separable data.
Explanation:
A composition of linear functions is itself a linear function. No matter how many layers a neural network has, if it only uses linear activation functions, the entire network collapses into a single linear transformation (equivalent to a single layer). Therefore, its representational power is limited to that of a single-layer perceptron, which can only solve linearly separable problems.
Incorrect! Try again.
42In a Convolutional Neural Network, what is the primary purpose of using a 1x1 convolution (also known as a pointwise convolution), and how does it achieve this without altering the spatial dimensions (height and width) of the feature map?
CNN
Hard
A.To perform dimensionality reduction or expansion across the channel dimension while preserving spatial information.
B.To increase the receptive field of subsequent layers by combining information from a 1x1 spatial area.
C.To introduce non-linearity by applying an activation function to each pixel independently.
D.To act as a spatial pooling layer, reducing the height and width of the feature maps.
Correct Answer: To perform dimensionality reduction or expansion across the channel dimension while preserving spatial information.
Explanation:
A 1x1 convolution operates across all channels at a single pixel location. It's essentially a fully connected layer applied at every spatial position. Its main use is to change the number of channels (depth) of the feature map. By using fewer 1x1 filters than input channels, it performs dimensionality reduction (a 'bottleneck' layer). By using more, it expands the dimensionality. This is done without affecting the spatial dimensions because the kernel size is 1x1 and stride is typically 1.
Incorrect! Try again.
43Both LSTMs and GRUs are designed to mitigate the vanishing gradient problem in RNNs. Which statement accurately describes a key architectural difference and its performance implication?
RNN
Hard
A.GRUs have three gates (input, forget, output) while LSTMs have only two (reset, update), making LSTMs simpler and faster to train.
B.LSTMs have a separate cell state and hidden state, while GRUs combine them. This makes GRUs computationally more efficient but potentially less expressive for complex sequences.
C.The cell state in a GRU acts as a long-term memory conveyor belt, a feature that is absent in LSTMs.
D.LSTMs use a reset gate to discard irrelevant past information, whereas GRUs use a forget gate, which is a less effective mechanism.
Correct Answer: LSTMs have a separate cell state and hidden state, while GRUs combine them. This makes GRUs computationally more efficient but potentially less expressive for complex sequences.
Explanation:
The core difference is that an LSTM maintains two vectors passed between timesteps: the hidden state () and the cell state (). The cell state acts as a memory conveyor. A GRU combines these into a single hidden state vector (). It uses an update gate to control how much past information to keep versus new information to add, and a reset gate to control how much of the past state to forget. This simplification (fewer gates, one state vector) makes GRUs computationally cheaper, but the dedicated cell state in LSTMs can sometimes offer more powerful modeling capabilities.
Incorrect! Try again.
44The self-attention mechanism in the original Transformer model has a computational complexity of , where is the sequence length and is the model dimension. This makes it challenging for very long sequences. Which of the following is NOT a valid and commonly researched approach to mitigate this quadratic complexity?
Transformer Architecture and Applications
Hard
A.Drastically increasing the number of attention heads while reducing the dimension per head, such that the total computation becomes linear with respect to sequence length.
B.Replacing the Softmax function with a linear kernel, allowing the order of matrix multiplication to be rearranged and computed in (e.g., Linear Transformers).
C.Using a sliding window attention mechanism where each token only attends to a fixed number of neighboring tokens (e.g., Longformer).
D.Applying a fixed, sparse attention pattern (e.g., strided or dilated patterns) to reduce the number of attended-to tokens (e.g., Sparse Transformers).
Correct Answer: Drastically increasing the number of attention heads while reducing the dimension per head, such that the total computation becomes linear with respect to sequence length.
Explanation:
While multi-head attention is a core part of the Transformer, simply increasing the number of heads does not change the fundamental complexity with respect to sequence length. Each head still computes an attention matrix. The other options are all valid and popular research directions for creating more efficient Transformers: sliding window (Longformer), sparse patterns (Sparse Transformer), and kernel-based methods (Linear Transformers) all aim to break the quadratic bottleneck.
Incorrect! Try again.
45What is the primary architectural reason that a model like BERT is considered an 'encoder-only' architecture, while a model like GPT-3 is considered a 'decoder-only' architecture, and how does this influence their ideal use cases?
language models (BERT, GPT)
Hard
A.BERT uses bidirectional self-attention (seeing the whole sentence at once), making it an encoder ideal for understanding context (e.g., NLU tasks). GPT uses masked, unidirectional self-attention (seeing only past tokens), making it a decoder ideal for generating text (e.g., NLG tasks).
B.BERT processes tokens in parallel, which is characteristic of encoders, while GPT processes tokens sequentially, which is characteristic of decoders.
C.BERT uses absolute positional embeddings, suitable for encoding, while GPT uses relative positional embeddings, which are better for decoding sequences of varying lengths.
D.BERT is pre-trained with a Masked Language Model objective, which is an encoding task, while GPT is pre-trained with a Causal Language Model objective, a decoding task.
Correct Answer: BERT uses bidirectional self-attention (seeing the whole sentence at once), making it an encoder ideal for understanding context (e.g., NLU tasks). GPT uses masked, unidirectional self-attention (seeing only past tokens), making it a decoder ideal for generating text (e.g., NLG tasks).
Explanation:
The key distinction lies in the attention mask. In BERT's self-attention layers, a token can attend to all other tokens in the sequence (both to its left and right), which is why it's called bidirectional. This allows it to build a deep understanding of the full context, perfect for Natural Language Understanding (NLU) tasks like classification or question answering. In GPT, the self-attention is masked so that a token can only attend to previous tokens (and itself). This autoregressive property is essential for generating text one token at a time, making it a natural fit for Natural Language Generation (NLG).
Incorrect! Try again.
46Static word embeddings like Word2Vec or GloVe suffer from the problem of polysemy (a word having multiple meanings). How do contextualized embedding models like ELMo and BERT fundamentally address this limitation?
Embeddings
Hard
A.By training on a much larger corpus, they learn a single, more robust vector that averages all meanings of a word.
B.They use character-level convolutions to build word embeddings, which helps differentiate meanings based on morphology.
C.They maintain a predefined dictionary of vectors for each possible meaning of a word and use a classifier to select the correct one.
D.They generate a different embedding vector for a word each time it appears, based on its specific context in the sentence.
Correct Answer: They generate a different embedding vector for a word each time it appears, based on its specific context in the sentence.
Explanation:
Static embeddings assign a single, fixed vector to each word in the vocabulary (e.g., the vector for 'bank' is the same in 'river bank' and 'investment bank'). Contextualized models are deep neural networks (LSTMs for ELMo, Transformers for BERT) that process the entire input sentence. The embedding for a word is derived from the internal state of the network at that word's position. Since the internal state is influenced by all other words in the sentence, the resulting embedding is context-dependent. Therefore, 'bank' will have different vectors in the two example sentences, resolving the polysemy issue.
Incorrect! Try again.
47In the standard scaled dot-product attention formula, , what is the critical purpose of the scaling factor , where is the dimension of the key vectors?
Attention
Hard
A.It normalizes the variance of the dot products to prevent the softmax function from saturating into regions with extremely small gradients.
B.It acts as a regularization term to prevent overfitting by penalizing large dot product values.
C.It is a temperature parameter that controls the sharpness of the attention distribution, with larger values making the distribution softer.
D.It ensures that the dot product values are positive before being passed to the softmax function.
Correct Answer: It normalizes the variance of the dot products to prevent the softmax function from saturating into regions with extremely small gradients.
Explanation:
The authors of 'Attention Is All You Need' observed that for large values of , the dot products grow large in magnitude. When these large values are fed into the softmax function, the gradients can become vanishingly small, making learning difficult. By scaling the dot products by , the variance of the inputs to the softmax is kept at 1, regardless of . This ensures that the softmax function operates in a region with healthier gradients, stabilizing the training process.
Incorrect! Try again.
48Consider Byte-Pair Encoding (BPE) and WordPiece tokenization strategies. A key difference lies in how they select the next pair of tokens to merge during vocabulary creation. Which statement accurately describes this difference and its implication?
Tokenization
Hard
A.BPE always splits words at the rarest character pair, while WordPiece splits based on a predefined vocabulary of common prefixes and suffixes.
B.BPE merges the most frequently occurring pair of adjacent tokens, which can sometimes lead to suboptimal segmentation of common words. WordPiece merges the pair that maximizes the likelihood of the training data, often resulting in more intuitive subwords.
C.WordPiece is a character-based tokenizer, while BPE is subword-based, making WordPiece immune to out-of-vocabulary issues.
D.BPE merges based on raw frequency counts, while WordPiece uses a complex scoring system based on mutual information between token pairs.
Correct Answer: BPE merges the most frequently occurring pair of adjacent tokens, which can sometimes lead to suboptimal segmentation of common words. WordPiece merges the pair that maximizes the likelihood of the training data, often resulting in more intuitive subwords.
Explanation:
While both are subword tokenization algorithms, their merge criterion differs. BPE is simpler: it iteratively counts all adjacent symbol pairs and merges the most frequent one. WordPiece, used by BERT, is slightly more sophisticated. It builds a vocabulary and then picks the merge that increases the likelihood of the training corpus given a unigram language model. This likelihood-based approach often leads to subwords that align better with meaningful morphological units. For example, WordPiece is more likely to keep word stems intact and segment suffixes like '##ing' or '##ly'.
Incorrect! Try again.
49A convolutional layer has an input volume of , uses 32 filters of size , a stride of 2, and padding of 1. What is the total number of learnable parameters (weights and biases) in this layer?
CNN
Hard
A.12,832
B.40,992
C.1,310,720
D.12,800
Correct Answer: 12,832
Explanation:
The number of parameters in a convolutional layer is independent of the input volume's spatial dimensions (). It depends on the filter size, the number of input channels, and the number of filters.
Number of weights = (filter_height filter_width num_input_channels) num_filters
Number of weights = () 32 = 400 32 = 12,800.
Each filter has a single bias term associated with it.
Number of biases = num_filters = 32.
Total learnable parameters = Number of weights + Number of biases = 12,800 + 32 = 12,832.
Incorrect! Try again.
50When evaluating machine translation systems, the BLEU score is a common metric. However, it can be misleadingly high for a translation that is grammatically correct but semantically nonsensical or inaccurate. What fundamental limitation of the BLEU score causes this discrepancy?
NLP use cases (sentiment analysis, translation, summarization)
Hard
A.It requires multiple human-generated reference translations, which are often unavailable or inconsistent.
B.It is based on n-gram precision and a brevity penalty, rewarding lexical overlap with reference translations but failing to capture semantic meaning or sentence structure.
C.It primarily measures recall, checking if all words from the reference translation are present, but ignores precision.
D.It is computationally expensive and cannot be used during the training process as a loss function.
Correct Answer: It is based on n-gram precision and a brevity penalty, rewarding lexical overlap with reference translations but failing to capture semantic meaning or sentence structure.
Explanation:
BLEU (Bilingual Evaluation Understudy) works by matching n-grams (contiguous sequences of n words) in the candidate translation against the n-grams in one or more reference translations. It measures precision—how many of the candidate's n-grams appear in the references. While it includes a penalty for translations that are too short, its core mechanism is lexical overlap. It cannot understand if synonyms are used, if the sentence structure is logical, or if the core meaning is preserved. A translation could have high n-gram overlap with a reference but completely miss the nuance or even state the opposite meaning.
Incorrect! Try again.
51A single perceptron is trained on a 2D dataset using the standard perceptron learning rule. The dataset is NOT linearly separable. What will be the behavior of the learning algorithm during training?
Perceptron
Hard
A.The algorithm will quickly converge to a random decision boundary and stop updating.
B.The algorithm will never converge, and the weights of the perceptron will continue to be updated indefinitely.
C.The algorithm will raise a mathematical error because the loss function cannot be calculated for non-separable data.
D.The algorithm will converge to a decision boundary that minimizes the number of misclassified points.
Correct Answer: The algorithm will never converge, and the weights of the perceptron will continue to be updated indefinitely.
Explanation:
The Perceptron Convergence Theorem guarantees that the learning algorithm will find a separating hyperplane in a finite number of steps if and only if the data is linearly separable. If the data is not linearly separable, there is no solution for the algorithm to converge to. It will continue to find misclassified points and update its weights in an attempt to correct for them, effectively cycling through different decision boundaries forever without ever satisfying the condition of zero misclassifications.
Incorrect! Try again.
52What is the primary motivation for using Teacher Forcing during the training of recurrent neural networks for sequence generation tasks, and what is its main drawback?
RNN
Hard
A.Motivation: To enable the model to learn long-range dependencies. Drawback: It exacerbates the vanishing gradient problem.
B.Motivation: To speed up convergence by providing the network with ground-truth inputs at each timestep. Drawback: It can lead to a discrepancy between training and inference, causing instability when the model generates long sequences on its own.
C.Motivation: To reduce the memory footprint of the model during training. Drawback: It requires significantly more computation per training step.
D.Motivation: To prevent overfitting by introducing noise into the training process. Drawback: It slows down the training process significantly.
Correct Answer: Motivation: To speed up convergence by providing the network with ground-truth inputs at each timestep. Drawback: It can lead to a discrepancy between training and inference, causing instability when the model generates long sequences on its own.
Explanation:
Teacher forcing is a training technique where the model's own output from the previous timestep is replaced with the ground-truth value from the training data as input for the current timestep. This provides a stable, correct signal at each step, preventing the model from propagating its own errors, which makes training much faster and more stable. However, at inference time, there is no ground truth to feed the model; it must use its own (potentially imperfect) predictions as input for the next step. This mismatch between training and inference (exposure bias) can cause the model to perform poorly, as it was never trained to recover from its own mistakes.
Incorrect! Try again.
53In the Transformer architecture, positional encodings are added to the input embeddings. Why is this step strictly necessary for the model to process sequences, unlike in an RNN?
Transformer Architecture and Applications
Hard
A.To allow the model to handle sequences of variable lengths by encoding the absolute position of each token.
B.Because the self-attention mechanism is permutation-invariant; without positional information, the model would treat a sentence as an unordered bag of words.
C.To normalize the input embeddings before they are processed by the attention layers, improving training stability.
D.To provide a unique signal for the start and end of a sequence, which the attention mechanism cannot otherwise determine.
Correct Answer: Because the self-attention mechanism is permutation-invariant; without positional information, the model would treat a sentence as an unordered bag of words.
Explanation:
An RNN inherently processes a sequence token by token, so the order is naturally encoded in its recurrent state. The self-attention mechanism in a Transformer, however, calculates attention scores between all pairs of tokens in parallel. If you were to shuffle the input tokens, the resulting set of attention scores would be identical, just reordered. The model has no inherent sense of position. Positional encodings are fixed or learned vectors that are added to the token embeddings to give the model explicit information about the position of each token in the sequence, thus breaking the permutation invariance and allowing it to understand word order.
Incorrect! Try again.
54BERT's pre-training involves two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Later analysis (e.g., by the RoBERTa paper) suggested that NSP might be an ineffective pre-training task. What was the reasoning behind this conclusion?
language models (BERT, GPT)
Hard
A.The binary classification nature of NSP was found to be detrimental to the model's ability to generate nuanced text representations.
B.The MLM task already implicitly taught the model sentence relationships, making the NSP task redundant.
C.The model learned to focus on topic similarity between sentences rather than coherence and logical flow, as the negative examples were too easy to distinguish (randomly sampled sentences).
D.The NSP task was found to be computationally too expensive, providing marginal benefits for the high training cost.
Correct Answer: The model learned to focus on topic similarity between sentences rather than coherence and logical flow, as the negative examples were too easy to distinguish (randomly sampled sentences).
Explanation:
The original NSP task involved predicting if sentence B was the actual sentence following sentence A. Negative examples were created by pairing sentence A with a random sentence from a different document. Researchers found that this task was too simple. The model could often solve it just by checking if the two sentences shared the same topic (e.g., by looking for keyword overlap), without learning about the deeper logical and cohesive relationships between consecutive sentences. The RoBERTa model, for instance, removed the NSP task and showed improved performance on downstream tasks, suggesting NSP's limited utility.
Incorrect! Try again.
55In text summarization, what is the fundamental difference between an 'extractive' and an 'abstractive' approach, and what kind of neural network architecture is typically required for a purely abstractive model?
NLP use cases (sentiment analysis, translation, summarization)
Hard
A.Extractive summarization creates a summary that is shorter than the source text, while abstractive summarization can create a longer, more detailed summary. Both can be implemented with a simple classifier.
B.Extractive summarization is a form of supervised learning, while abstractive summarization is unsupervised. Abstractive models typically rely on Transformer-based encoders like BERT.
C.Extractive summarization selects important sentences from the source text, while abstractive summarization generates new sentences that capture the meaning. Abstractive models typically require an encoder-decoder architecture (e.g., Sequence-to-Sequence).
D.Extractive summarization uses rule-based systems to identify keywords, while abstractive summarization uses deep learning. Abstractive models require a CNN-based architecture.
Correct Answer: Extractive summarization selects important sentences from the source text, while abstractive summarization generates new sentences that capture the meaning. Abstractive models typically require an encoder-decoder architecture (e.g., Sequence-to-Sequence).
Explanation:
Extractive summarization is akin to highlighting. It's a classification task where the model decides which sentences or phrases from the original document are important enough to be included in the summary. The summary contains only verbatim excerpts. Abstractive summarization is more human-like; it involves 'understanding' the source text and generating a new summary in its own words, potentially using words and phrases not present in the original. This is a sequence generation task, which necessitates an encoder-decoder architecture (like an RNN-based Seq2Seq model or a Transformer) to first encode the meaning of the source text and then decode it into a new sequence of words.
Incorrect! Try again.
56When designing a task-oriented chatbot (e.g., for booking flights), what is the distinct role of 'Dialogue State Tracking' (DST) and why is it a more complex problem than simple 'Intent Recognition'?
Building chatbots and digital assistants
Hard
A.DST is responsible for generating the chatbot's response, while Intent Recognition decides which knowledge base to query. DST is harder because natural language generation is a complex task.
B.DST is the process of training the chatbot's language model, while Intent Recognition is the process of fine-tuning it for a specific task. DST is harder because it requires more data.
C.DST maintains a representation of the user's goal and collected information (slots) throughout a multi-turn conversation, while Intent Recognition is a single-turn classification of the user's immediate goal. DST is harder because it must handle context, ambiguity, and coreference over time.
D.Intent Recognition maps user input to a predefined action, while DST tracks the emotional state of the user to adjust the chatbot's tone. DST is harder due to the subjectivity of emotion.
Correct Answer: DST maintains a representation of the user's goal and collected information (slots) throughout a multi-turn conversation, while Intent Recognition is a single-turn classification of the user's immediate goal. DST is harder because it must handle context, ambiguity, and coreference over time.
Explanation:
Intent Recognition is a classification task for a single user utterance (e.g., 'I want to fly to Boston' -> intent: book_flight, entity: destination=Boston). Dialogue State Tracking, however, must manage the state of the conversation across multiple turns. If the user then says 'I want to go tomorrow', the DST component must update the dialogue state with date=tomorrow while remembering the destination from the previous turn. It has to accumulate information, handle corrections ('Actually, I meant Boston'), and resolve ambiguities, making it a much more complex and stateful problem than single-turn intent recognition.
Incorrect! Try again.
57Vector-space analogies like vec('king') - vec('man') + vec('woman') ≈ vec('queen') are a famous property of Word2Vec embeddings. This property suggests that semantic relationships are encoded as linear substructures in the embedding space. What is a known major limitation or failure mode of this analogical reasoning capability?
Embeddings
Hard
A.The geometric relationships are highly sensitive to the specific training corpus and hyperparameters, and often do not generalize well to relationships beyond simple gender or capital-city analogies.
B.This property only works for single words and fails completely when trying to perform analogies with phrases or sentences.
C.The vector arithmetic is not commutative, meaning vec('woman') - vec('man') + vec('king') would produce a completely different result.
D.The resulting vector is often not the closest vector to the target word (e.g., 'queen') and requires a separate classification step to identify the correct analogy.
Correct Answer: The geometric relationships are highly sensitive to the specific training corpus and hyperparameters, and often do not generalize well to relationships beyond simple gender or capital-city analogies.
Explanation:
While the king-queen analogy is a powerful demonstration, research has shown that this capability is quite brittle. The neat geometric parallels are often an artifact of the statistical patterns of specific, frequent relationships (like country-capital) present in the training data (e.g., Wikipedia). The method often fails on more nuanced or less frequently stated relationships. The success of these analogies is not a universal property of the embedding space but rather a localized phenomenon, making it an unreliable tool for general-purpose analogical reasoning.
Incorrect! Try again.
58In semantic segmentation tasks, a common architectural pattern is an 'encoder-decoder' structure (like U-Net) where the encoder uses strided convolutions or pooling, and the decoder uses upsampling or transposed convolutions. What is the critical role of 'skip connections' between the encoder and decoder in such architectures?
CNN
Hard
A.To facilitate gradient flow through the deep network, mitigating the vanishing gradient problem common in deep architectures.
B.To reduce the number of parameters in the decoder by reusing the weights from the corresponding encoder layers.
C.To enforce a bottleneck in the information flow, forcing the encoder to learn a compressed, salient representation of the input.
D.To combine low-level, high-resolution spatial information from the encoder with high-level, semantic information from the decoder, enabling precise localization.
Correct Answer: To combine low-level, high-resolution spatial information from the encoder with high-level, semantic information from the decoder, enabling precise localization.
Explanation:
As the input passes through the encoder (downsampling path), the network gains semantic information (what is in the image) but loses spatial information (where it is). The decoder (upsampling path) tries to recover this spatial information to produce a high-resolution segmentation map. Skip connections feed feature maps from the encoder directly to corresponding layers in the decoder. This provides the decoder with the fine-grained spatial details from earlier layers that would otherwise be lost, allowing it to produce much more precise and accurate segmentation boundaries.
Incorrect! Try again.
59What is the primary advantage of Multi-Head Self-Attention (MHSA) over using a single, large self-attention mechanism with the same total number of dimensions?
Attention
Hard
A.Each head can process a different segment of the input sequence, allowing for parallel processing of very long documents.
B.It breaks the quadratic complexity of self-attention with respect to sequence length, making it linear.
C.It is significantly more computationally efficient than a single large attention head, reducing the overall training time of the Transformer model.
D.It allows the model to jointly attend to information from different representation subspaces at different positions, effectively learning diverse types of relationships (e.g., syntactic, positional).
Correct Answer: It allows the model to jointly attend to information from different representation subspaces at different positions, effectively learning diverse types of relationships (e.g., syntactic, positional).
Explanation:
Instead of performing a single attention function, MHSA projects the queries, keys, and values into multiple, lower-dimensional subspaces and runs the attention mechanism in parallel in each subspace. The outputs are then concatenated and projected again. This allows each 'head' to specialize and learn different kinds of relationships between tokens. For example, one head might learn to track syntactic dependencies, while another might focus on which words refer to the same entity. A single large attention head would average all these different signals, potentially washing out a specific, useful relationship. MHSA allows the model to capture a richer set of features.
Incorrect! Try again.
60In the pipeline of NLP phases, consider the relationship between Syntactic Analysis (Parsing) and Semantic Analysis. Which statement best describes a scenario where a failure in syntactic analysis directly leads to an incorrect semantic interpretation?
NLP phases
Hard
A.In the sentence "The old man the boats," a parser failing to identify "man" as a verb (meaning to operate) would lead to a nonsensical semantic interpretation.
B.In the sentence "Colorless green ideas sleep furiously," the sentence is syntactically correct but semantically meaningless, showing the independence of the two phases.
C.In the sentence "The bank is on the river bank," a system failing to disambiguate the two meanings of "bank" is a failure of semantic analysis, independent of syntax.
D.A system that correctly identifies the subject, verb, and object in "The dog chased the cat" has completed syntactic analysis, but semantic analysis is still required to understand what 'chasing' means.
Correct Answer: In the sentence "The old man the boats," a parser failing to identify "man" as a verb (meaning to operate) would lead to a nonsensical semantic interpretation.
Explanation:
This is a classic example of a garden-path sentence. Correct syntactic parsing is crucial for correct semantic interpretation. A naive parser might identify 'The old man' as a noun phrase. However, the correct parse is that '[The] old' is a noun phrase (referring to old people) and 'man' is the main verb. The syntax determines the relationship between words: (Subject: The old) (Verb: man) (Object: the boats). If the parser fails to identify this structure, the semantic analysis stage will be unable to derive the correct meaning (Old people operate the boats) and will instead be left with a nonsensical fragment.