Unit 4 - Practice Quiz

INT428 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the fundamental building block of a neural network, inspired by the human brain?

Introduction to Neural Networks Easy
A. Neuron (or Node)
B. Pixel
C. Transistor
D. Algorithm

2 What kind of problems can a single-layer Perceptron solve?

Perceptron Easy
A. All classification problems
B. Image recognition problems
C. Non-linearly separable problems
D. Linearly separable problems

3 What does MLP stand for in the context of neural networks?

MLP Easy
A. Multi-Layer Perceptron
B. Multiple Linear Progression
C. Main Logic Processor
D. Maximum Likelihood Program

4 Which type of Deep Neural Network is primarily designed for processing grid-like data, such as images?

CNN Easy
A. Transformer
B. Multi-Layer Perceptron (MLP)
C. Recurrent Neural Network (RNN)
D. Convolutional Neural Network (CNN)

5 Recurrent Neural Networks (RNNs) are best suited for what type of data?

RNN Easy
A. Static, independent data points
B. Image data
C. Sequential data (e.g., time series, text)
D. Tabular data

6 What is the key innovation in the Transformer architecture that allows it to process entire sequences at once and handle long-range dependencies effectively?

Transformer Architecture and Applications Easy
A. The Attention Mechanism
B. Convolutional Layers
C. The Sigmoid Function
D. Recurrent Loops

7 What is the main goal of Natural Language Processing (NLP)?

Modern NLP: Introduction to NLP Easy
A. To enable computers to understand, interpret, and generate human language
B. To create realistic computer graphics
C. To optimize database queries
D. To build faster computer hardware

8 Which phase of NLP involves analyzing the grammatical structure of a sentence and the relationships between words?

NLP phases Easy
A. Pragmatic Analysis
B. Lexical Analysis (Tokenization)
C. Syntactic Analysis (Parsing)
D. Semantic Analysis

9 In NLP, what is the process of breaking down a text into smaller units like words or sentences called?

Tokenization Easy
A. Embedding
B. Summarization
C. Classification
D. Tokenization

10 What is the purpose of a word embedding in NLP?

Embeddings Easy
A. To correct spelling mistakes
B. To translate words into another language
C. To represent words as dense numerical vectors
D. To count the frequency of each word

11 In the context of deep learning models like Transformers, what does the 'attention' mechanism help the model to do?

Attention Easy
A. Focus on the most relevant parts of the input sequence
B. Increase the speed of model training
C. Reduce the number of layers in the network
D. Convert text to speech

12 What are BERT and GPT well-known examples of?

language models (BERT, GPT) Easy
A. Large Language Models (LLMs)
B. Database Management Systems
C. Image Classification Models
D. Speech Recognition APIs

13 What is the primary function of a chatbot?

Building chatbots and digital assistants Easy
A. To perform complex mathematical simulations
B. To simulate conversation with human users
C. To analyze and visualize data
D. To manage computer hardware resources

14 Determining if a customer review is positive, negative, or neutral is an example of which NLP task?

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Text Summarization
B. Sentiment Analysis
C. Machine Translation
D. Named Entity Recognition

15 Why is an activation function, such as ReLU or Sigmoid, necessary in a Multi-Layer Perceptron (MLP)?

MLP Easy
A. To only work with positive numbers
B. To reduce the number of neurons
C. To introduce non-linearity into the model
D. To make the model run faster

16 In a CNN, what is the primary purpose of a 'pooling' layer (e.g., MaxPooling)?

CNN Easy
A. To classify the image
B. To reduce the spatial dimensions (width and height) of the input volume
C. To apply a non-linear transformation
D. To increase the number of features

17 The task of automatically converting text from one language to another, like from English to Spanish, is called:

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Language Detection
B. Text Generation
C. Sentiment Analysis
D. Machine Translation

18 The 'P' in GPT stands for 'Pre-trained'. What does this mean?

language models (BERT, GPT) Easy
A. The model is trained on a massive dataset before being fine-tuned for specific tasks
B. The model can only predict one word at a time
C. The model requires a person to train it manually
D. The model's parameters are permanently fixed

19 Which NLP task is focused on creating a shorter version of a long document while retaining its most important information?

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Text Summarization
B. Part-of-Speech Tagging
C. Machine Translation
D. Question Answering

20 The 'vanishing gradient problem' is a common issue that can make it difficult to train which type of neural network on long sequences?

RNN Easy
A. Multi-Layer Perceptron (MLP)
B. Autoencoder
C. Recurrent Neural Network (RNN)
D. Convolutional Neural Network (CNN)

21 A single-layer perceptron is a linear classifier. Which of the following problems can it not solve, and why?

Perceptron Medium
A. The OR problem, because the decision boundary is diagonal.
B. The NOT problem, because it involves inverting the input.
C. The XOR (exclusive OR) problem, because the data points are not linearly separable.
D. The AND problem, because it involves multiple true conditions.

22 In a Multi-Layer Perceptron (MLP), what is the primary consequence of removing all non-linear activation functions (like ReLU or sigmoid) from the hidden layers?

MLP Medium
A. The number of trainable parameters in the network is significantly reduced.
B. The network collapses into a single linear transformation, making it no more powerful than a single-layer network.
C. The network will train much faster but lose all accuracy.
D. The network becomes unable to perform regression tasks.

23 You are training a very deep MLP and observe that the gradients for the earliest layers are almost zero, causing training to stall. What is this phenomenon called, and which activation function is known to help mitigate it?

MLP Medium
A. Exploding Gradient Problem; Tanh
B. Overfitting; Sigmoid
C. Vanishing Gradient Problem; ReLU
D. Saddle Point Problem; Softmax

24 A 2D convolutional layer is applied to a 64x64 pixel grayscale image. The layer uses a 5x5 kernel, a stride of 2, and no padding. What will be the spatial dimensions (height x width) of the output feature map?

CNN Medium
A. 59x59
B. 30x30
C. 60x60
D. 32x32

25 Beyond reducing computational complexity, what is a key benefit of using a max-pooling layer in a Convolutional Neural Network (CNN)?

CNN
A. It increases the receptive field of subsequent layers.
B. It provides a degree of translation invariance.
C. It introduces non-linearity into the network.
D. It normalizes the feature map activations.

26 Why is a Long Short-Term Memory (LSTM) network often preferred over a standard Recurrent Neural Network (RNN) for tasks involving long sequences, such as paragraph-level text analysis?

RNN Medium
A. LSTMs use gating mechanisms to control information flow, mitigating the vanishing gradient problem.
B. LSTMs use a more complex activation function that captures more features.
C. LSTMs can be parallelized during training, unlike standard RNNs.
D. LSTMs have fewer parameters, making them faster to train on long sequences.

27 In a many-to-one RNN architecture used for text classification, what is the typical role of the final hidden state?

RNN Medium
A. It is averaged with the initial hidden state to normalize the network's memory.
B. It is used to generate the first word of the output sequence.
C. It is discarded, as only the outputs from each time step are relevant.
D. It serves as a summary vector of the entire input sequence, which is then fed into a final classification layer.

28 What is the primary advantage of the self-attention mechanism in Transformers over the sequential processing of RNNs in terms of computational efficiency?

Transformer Architecture and Applications Medium
A. It has a constant-length path for information to travel between any two positions, preventing vanishing gradients.
B. It uses a simpler update rule than the gating mechanisms in LSTMs.
C. It allows for parallel computation across all tokens in a sequence, as the relationship between any two tokens is calculated independently of their distance.
D. It requires fewer matrix multiplications per layer.

29 In the Transformer architecture, what is the specific purpose of the Positional Encoding step?

Transformer Architecture and Applications Medium
A. To convert the input tokens into a continuous vector representation.
B. To normalize the word embeddings before they enter the attention layers.
C. To inject information about the relative or absolute position of tokens, since the self-attention mechanism itself is permutation-invariant.
D. To reduce the dimensionality of the input embeddings to save computation.

30 You are developing an NLP model and must choose a tokenization strategy. Why might a subword tokenization algorithm like Byte-Pair Encoding (BPE) be superior to simple word-based tokenization with a fixed vocabulary?

Tokenization Medium
A. It always results in a shorter sequence of tokens, reducing computation time.
B. It is a lossless compression algorithm that reduces model size.
C. It can handle out-of-vocabulary (OOV) words by breaking them into known subword units.
D. It guarantees that every word is broken into its morphological root and affixes.

31 If a word embedding model learns vectors such that vector('Paris') - vector('France') + vector('Italy') results in a vector very close to vector('Rome'), what does this demonstrate about the learned embedding space?

Embeddings Medium
A. The model has simply memorized geographical facts from the training data.
B. The model has captured semantic relationships (like capital city of a country) as geometric relationships in the vector space.
C. The vectors for all countries are parallel to each other.
D. The model is only effective for proper nouns and geographical locations.

32 What is the primary motivation for using pre-trained embeddings (like GloVe or Word2Vec) when building an NLP model for a task with a relatively small dataset?

Embeddings Medium
A. To reduce the number of layers required in the neural network.
B. To leverage the rich semantic knowledge learned from a massive text corpus, which provides a better model initialization and improves generalization.
C. To completely eliminate the need for any task-specific training (fine-tuning).
D. To ensure the model's vocabulary is limited to only the most common words in the English language.

33 In a sequence-to-sequence model with attention for machine translation, if a high attention weight is placed on an input word, what does it signify for the current output step?

Attention Medium
A. That the input word has the highest frequency in the training corpus.
B. That the input word is a grammatical stop word that should be ignored.
C. That the input word is highly relevant for predicting the current output word.
D. That the input word is the last word of the source sentence.

34 Which statement best describes a key architectural difference between BERT and GPT that influences their primary applications?

language models (BERT, GPT) Medium
A. GPT uses an attention mechanism while BERT relies on recurrent layers, making GPT better for longer sequences.
B. BERT is an encoder-only model that processes text bidirectionally, making it ideal for language understanding tasks, while GPT is a decoder-only model that processes text auto-regressively, making it ideal for language generation.
C. BERT must be fine-tuned for specific tasks, whereas GPT can be used directly for any task without fine-tuning.
D. GPT is trained on a larger vocabulary than BERT, making it more knowledgeable.

35 What is the core idea behind the Masked Language Model (MLM) pre-training objective used for BERT?

language models (BERT, GPT) Medium
A. To mask all nouns in a sentence and have the model predict them based on the verbs and adjectives.
B. To predict the next sentence in a document, teaching the model about discourse coherence.
C. To predict randomly masked tokens in a sequence by using both left and right context, forcing the model to learn a deep bidirectional understanding of language.
D. To predict the next token in a sequence using only the previous tokens, which is a unidirectional approach.

36 An NLP system is designed to analyze a legal contract to identify the parties involved, their obligations, and the effective dates. This task goes beyond just parsing grammar. Which NLP phase is most central to this goal?

Modern NLP: Introduction to NLP, NLP phases Medium
A. Lexical Analysis
B. Syntactic Analysis (Parsing)
C. Semantic Analysis
D. Morphological Analysis

37 In the architecture of a task-oriented chatbot, what is the primary responsibility of the Dialogue Management (DM) component?

Building chatbots and digital assistants Medium
A. To extract the user's intent and entities from their message.
B. To convert the user's spoken words into text (Speech-to-Text).
C. To convert the chatbot's planned response into natural language (Text-to-Speech or NLG).
D. To maintain the state of the conversation and decide the chatbot's next action.

38 You are building a system to generate a short, one-paragraph summary of a long news article. This is an example of which NLP use case, and what is a primary challenge?

NLP use cases (sentiment analysis, translation, summarization) Medium
A. Machine Translation; converting the article to another language.
B. Named Entity Recognition; identifying all the people and organizations mentioned.
C. Sentiment Analysis; determining if the article's tone is positive or negative.
D. Text Summarization; ensuring the summary is coherent and factually consistent with the source text.

39 A movie review website wants to automatically assign a 'thumbs up' or 'thumbs down' rating to user-submitted reviews based on the text. Which NLP task is most appropriate for this problem?

NLP use cases (sentiment analysis, translation, summarization) Medium
A. Sentiment Analysis
B. Text Summarization
C. Topic Modeling
D. Question Answering

40 A neuron in a neural network has an input vector x = [2.0, 3.0], a weight vector w = [0.5, -1.5], and a bias b = 1.0. What is the output of this neuron if it uses a ReLU (Rectified Linear Unit) activation function?

Introduction to Neural Networks Medium
A. -3.5
B. -2.5
C. 0.0
D. 1.0

41 A Multi-Layer Perceptron (MLP) is constructed with 5 hidden layers, but exclusively uses the linear activation function in all layers, including the output layer. The network is trained on a complex, non-linear classification task. What is the effective representational power of this network?

MLP Hard
A. It will behave like a deep autoencoder, compressing and decompressing the input linearly.
B. It can approximate any continuous function, as per the Universal Approximation Theorem.
C. It is equivalent to a single-layer perceptron and can only model linearly separable data.
D. It can model complex non-linear functions, but training will be unstable due to the depth.

42 In a Convolutional Neural Network, what is the primary purpose of using a 1x1 convolution (also known as a pointwise convolution), and how does it achieve this without altering the spatial dimensions (height and width) of the feature map?

CNN Hard
A. To act as a spatial pooling layer, reducing the height and width of the feature maps.
B. To increase the receptive field of subsequent layers by combining information from a 1x1 spatial area.
C. To introduce non-linearity by applying an activation function to each pixel independently.
D. To perform dimensionality reduction or expansion across the channel dimension while preserving spatial information.

43 Both LSTMs and GRUs are designed to mitigate the vanishing gradient problem in RNNs. Which statement accurately describes a key architectural difference and its performance implication?

RNN Hard
A. LSTMs have a separate cell state and hidden state, while GRUs combine them. This makes GRUs computationally more efficient but potentially less expressive for complex sequences.
B. GRUs have three gates (input, forget, output) while LSTMs have only two (reset, update), making LSTMs simpler and faster to train.
C. The cell state in a GRU acts as a long-term memory conveyor belt, a feature that is absent in LSTMs.
D. LSTMs use a reset gate to discard irrelevant past information, whereas GRUs use a forget gate, which is a less effective mechanism.

44 The self-attention mechanism in the original Transformer model has a computational complexity of , where is the sequence length and is the model dimension. This makes it challenging for very long sequences. Which of the following is NOT a valid and commonly researched approach to mitigate this quadratic complexity?

Transformer Architecture and Applications Hard
A. Using a sliding window attention mechanism where each token only attends to a fixed number of neighboring tokens (e.g., Longformer).
B. Applying a fixed, sparse attention pattern (e.g., strided or dilated patterns) to reduce the number of attended-to tokens (e.g., Sparse Transformers).
C. Replacing the Softmax function with a linear kernel, allowing the order of matrix multiplication to be rearranged and computed in (e.g., Linear Transformers).
D. Drastically increasing the number of attention heads while reducing the dimension per head, such that the total computation becomes linear with respect to sequence length.

45 What is the primary architectural reason that a model like BERT is considered an 'encoder-only' architecture, while a model like GPT-3 is considered a 'decoder-only' architecture, and how does this influence their ideal use cases?

language models (BERT, GPT) Hard
A. BERT uses bidirectional self-attention (seeing the whole sentence at once), making it an encoder ideal for understanding context (e.g., NLU tasks). GPT uses masked, unidirectional self-attention (seeing only past tokens), making it a decoder ideal for generating text (e.g., NLG tasks).
B. BERT processes tokens in parallel, which is characteristic of encoders, while GPT processes tokens sequentially, which is characteristic of decoders.
C. BERT is pre-trained with a Masked Language Model objective, which is an encoding task, while GPT is pre-trained with a Causal Language Model objective, a decoding task.
D. BERT uses absolute positional embeddings, suitable for encoding, while GPT uses relative positional embeddings, which are better for decoding sequences of varying lengths.

46 Static word embeddings like Word2Vec or GloVe suffer from the problem of polysemy (a word having multiple meanings). How do contextualized embedding models like ELMo and BERT fundamentally address this limitation?

Embeddings Hard
A. By training on a much larger corpus, they learn a single, more robust vector that averages all meanings of a word.
B. They generate a different embedding vector for a word each time it appears, based on its specific context in the sentence.
C. They maintain a predefined dictionary of vectors for each possible meaning of a word and use a classifier to select the correct one.
D. They use character-level convolutions to build word embeddings, which helps differentiate meanings based on morphology.

47 In the standard scaled dot-product attention formula, , what is the critical purpose of the scaling factor , where is the dimension of the key vectors?

Attention Hard
A. It ensures that the dot product values are positive before being passed to the softmax function.
B. It acts as a regularization term to prevent overfitting by penalizing large dot product values.
C. It is a temperature parameter that controls the sharpness of the attention distribution, with larger values making the distribution softer.
D. It normalizes the variance of the dot products to prevent the softmax function from saturating into regions with extremely small gradients.

48 Consider Byte-Pair Encoding (BPE) and WordPiece tokenization strategies. A key difference lies in how they select the next pair of tokens to merge during vocabulary creation. Which statement accurately describes this difference and its implication?

Tokenization Hard
A. BPE merges the most frequently occurring pair of adjacent tokens, which can sometimes lead to suboptimal segmentation of common words. WordPiece merges the pair that maximizes the likelihood of the training data, often resulting in more intuitive subwords.
B. BPE merges based on raw frequency counts, while WordPiece uses a complex scoring system based on mutual information between token pairs.
C. WordPiece is a character-based tokenizer, while BPE is subword-based, making WordPiece immune to out-of-vocabulary issues.
D. BPE always splits words at the rarest character pair, while WordPiece splits based on a predefined vocabulary of common prefixes and suffixes.

49 A convolutional layer has an input volume of , uses 32 filters of size , a stride of 2, and padding of 1. What is the total number of learnable parameters (weights and biases) in this layer?

CNN Hard
A. 12,832
B. 40,992
C. 12,800
D. 1,310,720

50 When evaluating machine translation systems, the BLEU score is a common metric. However, it can be misleadingly high for a translation that is grammatically correct but semantically nonsensical or inaccurate. What fundamental limitation of the BLEU score causes this discrepancy?

NLP use cases (sentiment analysis, translation, summarization) Hard
A. It is computationally expensive and cannot be used during the training process as a loss function.
B. It primarily measures recall, checking if all words from the reference translation are present, but ignores precision.
C. It is based on n-gram precision and a brevity penalty, rewarding lexical overlap with reference translations but failing to capture semantic meaning or sentence structure.
D. It requires multiple human-generated reference translations, which are often unavailable or inconsistent.

51 A single perceptron is trained on a 2D dataset using the standard perceptron learning rule. The dataset is NOT linearly separable. What will be the behavior of the learning algorithm during training?

Perceptron Hard
A. The algorithm will raise a mathematical error because the loss function cannot be calculated for non-separable data.
B. The algorithm will never converge, and the weights of the perceptron will continue to be updated indefinitely.
C. The algorithm will converge to a decision boundary that minimizes the number of misclassified points.
D. The algorithm will quickly converge to a random decision boundary and stop updating.

52 What is the primary motivation for using Teacher Forcing during the training of recurrent neural networks for sequence generation tasks, and what is its main drawback?

RNN Hard
A. Motivation: To speed up convergence by providing the network with ground-truth inputs at each timestep. Drawback: It can lead to a discrepancy between training and inference, causing instability when the model generates long sequences on its own.
B. Motivation: To reduce the memory footprint of the model during training. Drawback: It requires significantly more computation per training step.
C. Motivation: To enable the model to learn long-range dependencies. Drawback: It exacerbates the vanishing gradient problem.
D. Motivation: To prevent overfitting by introducing noise into the training process. Drawback: It slows down the training process significantly.

53 In the Transformer architecture, positional encodings are added to the input embeddings. Why is this step strictly necessary for the model to process sequences, unlike in an RNN?

Transformer Architecture and Applications Hard
A. To normalize the input embeddings before they are processed by the attention layers, improving training stability.
B. To allow the model to handle sequences of variable lengths by encoding the absolute position of each token.
C. Because the self-attention mechanism is permutation-invariant; without positional information, the model would treat a sentence as an unordered bag of words.
D. To provide a unique signal for the start and end of a sequence, which the attention mechanism cannot otherwise determine.

54 BERT's pre-training involves two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Later analysis (e.g., by the RoBERTa paper) suggested that NSP might be an ineffective pre-training task. What was the reasoning behind this conclusion?

language models (BERT, GPT) Hard
A. The model learned to focus on topic similarity between sentences rather than coherence and logical flow, as the negative examples were too easy to distinguish (randomly sampled sentences).
B. The NSP task was found to be computationally too expensive, providing marginal benefits for the high training cost.
C. The binary classification nature of NSP was found to be detrimental to the model's ability to generate nuanced text representations.
D. The MLM task already implicitly taught the model sentence relationships, making the NSP task redundant.

55 In text summarization, what is the fundamental difference between an 'extractive' and an 'abstractive' approach, and what kind of neural network architecture is typically required for a purely abstractive model?

NLP use cases (sentiment analysis, translation, summarization) Hard
A. Extractive summarization is a form of supervised learning, while abstractive summarization is unsupervised. Abstractive models typically rely on Transformer-based encoders like BERT.
B. Extractive summarization uses rule-based systems to identify keywords, while abstractive summarization uses deep learning. Abstractive models require a CNN-based architecture.
C. Extractive summarization selects important sentences from the source text, while abstractive summarization generates new sentences that capture the meaning. Abstractive models typically require an encoder-decoder architecture (e.g., Sequence-to-Sequence).
D. Extractive summarization creates a summary that is shorter than the source text, while abstractive summarization can create a longer, more detailed summary. Both can be implemented with a simple classifier.

56 When designing a task-oriented chatbot (e.g., for booking flights), what is the distinct role of 'Dialogue State Tracking' (DST) and why is it a more complex problem than simple 'Intent Recognition'?

Building chatbots and digital assistants Hard
A. Intent Recognition maps user input to a predefined action, while DST tracks the emotional state of the user to adjust the chatbot's tone. DST is harder due to the subjectivity of emotion.
B. DST is responsible for generating the chatbot's response, while Intent Recognition decides which knowledge base to query. DST is harder because natural language generation is a complex task.
C. DST is the process of training the chatbot's language model, while Intent Recognition is the process of fine-tuning it for a specific task. DST is harder because it requires more data.
D. DST maintains a representation of the user's goal and collected information (slots) throughout a multi-turn conversation, while Intent Recognition is a single-turn classification of the user's immediate goal. DST is harder because it must handle context, ambiguity, and coreference over time.

57 Vector-space analogies like vec('king') - vec('man') + vec('woman') ≈ vec('queen') are a famous property of Word2Vec embeddings. This property suggests that semantic relationships are encoded as linear substructures in the embedding space. What is a known major limitation or failure mode of this analogical reasoning capability?

Embeddings Hard
A. This property only works for single words and fails completely when trying to perform analogies with phrases or sentences.
B. The resulting vector is often not the closest vector to the target word (e.g., 'queen') and requires a separate classification step to identify the correct analogy.
C. The vector arithmetic is not commutative, meaning vec('woman') - vec('man') + vec('king') would produce a completely different result.
D. The geometric relationships are highly sensitive to the specific training corpus and hyperparameters, and often do not generalize well to relationships beyond simple gender or capital-city analogies.

58 In semantic segmentation tasks, a common architectural pattern is an 'encoder-decoder' structure (like U-Net) where the encoder uses strided convolutions or pooling, and the decoder uses upsampling or transposed convolutions. What is the critical role of 'skip connections' between the encoder and decoder in such architectures?

CNN Hard
A. To combine low-level, high-resolution spatial information from the encoder with high-level, semantic information from the decoder, enabling precise localization.
B. To reduce the number of parameters in the decoder by reusing the weights from the corresponding encoder layers.
C. To facilitate gradient flow through the deep network, mitigating the vanishing gradient problem common in deep architectures.
D. To enforce a bottleneck in the information flow, forcing the encoder to learn a compressed, salient representation of the input.

59 What is the primary advantage of Multi-Head Self-Attention (MHSA) over using a single, large self-attention mechanism with the same total number of dimensions?

Attention Hard
A. It breaks the quadratic complexity of self-attention with respect to sequence length, making it linear.
B. Each head can process a different segment of the input sequence, allowing for parallel processing of very long documents.
C. It allows the model to jointly attend to information from different representation subspaces at different positions, effectively learning diverse types of relationships (e.g., syntactic, positional).
D. It is significantly more computationally efficient than a single large attention head, reducing the overall training time of the Transformer model.

60 In the pipeline of NLP phases, consider the relationship between Syntactic Analysis (Parsing) and Semantic Analysis. Which statement best describes a scenario where a failure in syntactic analysis directly leads to an incorrect semantic interpretation?

NLP phases Hard
A. In the sentence "Colorless green ideas sleep furiously," the sentence is syntactically correct but semantically meaningless, showing the independence of the two phases.
B. In the sentence "The old man the boats," a parser failing to identify "man" as a verb (meaning to operate) would lead to a nonsensical semantic interpretation.
C. In the sentence "The bank is on the river bank," a system failing to disambiguate the two meanings of "bank" is a failure of semantic analysis, independent of syntax.
D. A system that correctly identifies the subject, verb, and object in "The dog chased the cat" has completed syntactic analysis, but semantic analysis is still required to understand what 'chasing' means.