Unit 4 - Practice Quiz

INT428 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the fundamental building block of a neural network, inspired by the human brain?

Introduction to Neural Networks Easy
A. Pixel
B. Algorithm
C. Transistor
D. Neuron (or Node)

2 What kind of problems can a single-layer Perceptron solve?

Perceptron Easy
A. Linearly separable problems
B. Non-linearly separable problems
C. All classification problems
D. Image recognition problems

3 What does MLP stand for in the context of neural networks?

MLP Easy
A. Maximum Likelihood Program
B. Main Logic Processor
C. Multiple Linear Progression
D. Multi-Layer Perceptron

4 Which type of Deep Neural Network is primarily designed for processing grid-like data, such as images?

CNN Easy
A. Convolutional Neural Network (CNN)
B. Transformer
C. Recurrent Neural Network (RNN)
D. Multi-Layer Perceptron (MLP)

5 Recurrent Neural Networks (RNNs) are best suited for what type of data?

RNN Easy
A. Sequential data (e.g., time series, text)
B. Image data
C. Static, independent data points
D. Tabular data

6 What is the key innovation in the Transformer architecture that allows it to process entire sequences at once and handle long-range dependencies effectively?

Transformer Architecture and Applications Easy
A. Recurrent Loops
B. The Attention Mechanism
C. The Sigmoid Function
D. Convolutional Layers

7 What is the main goal of Natural Language Processing (NLP)?

Modern NLP: Introduction to NLP Easy
A. To enable computers to understand, interpret, and generate human language
B. To build faster computer hardware
C. To create realistic computer graphics
D. To optimize database queries

8 Which phase of NLP involves analyzing the grammatical structure of a sentence and the relationships between words?

NLP phases Easy
A. Semantic Analysis
B. Syntactic Analysis (Parsing)
C. Pragmatic Analysis
D. Lexical Analysis (Tokenization)

9 In NLP, what is the process of breaking down a text into smaller units like words or sentences called?

Tokenization Easy
A. Classification
B. Summarization
C. Embedding
D. Tokenization

10 What is the purpose of a word embedding in NLP?

Embeddings Easy
A. To represent words as dense numerical vectors
B. To correct spelling mistakes
C. To translate words into another language
D. To count the frequency of each word

11 In the context of deep learning models like Transformers, what does the 'attention' mechanism help the model to do?

Attention Easy
A. Reduce the number of layers in the network
B. Focus on the most relevant parts of the input sequence
C. Increase the speed of model training
D. Convert text to speech

12 What are BERT and GPT well-known examples of?

language models (BERT, GPT) Easy
A. Image Classification Models
B. Large Language Models (LLMs)
C. Speech Recognition APIs
D. Database Management Systems

13 What is the primary function of a chatbot?

Building chatbots and digital assistants Easy
A. To perform complex mathematical simulations
B. To simulate conversation with human users
C. To manage computer hardware resources
D. To analyze and visualize data

14 Determining if a customer review is positive, negative, or neutral is an example of which NLP task?

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Sentiment Analysis
B. Named Entity Recognition
C. Text Summarization
D. Machine Translation

15 Why is an activation function, such as ReLU or Sigmoid, necessary in a Multi-Layer Perceptron (MLP)?

MLP Easy
A. To reduce the number of neurons
B. To make the model run faster
C. To introduce non-linearity into the model
D. To only work with positive numbers

16 In a CNN, what is the primary purpose of a 'pooling' layer (e.g., MaxPooling)?

CNN Easy
A. To classify the image
B. To reduce the spatial dimensions (width and height) of the input volume
C. To increase the number of features
D. To apply a non-linear transformation

17 The task of automatically converting text from one language to another, like from English to Spanish, is called:

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Language Detection
B. Machine Translation
C. Sentiment Analysis
D. Text Generation

18 The 'P' in GPT stands for 'Pre-trained'. What does this mean?

language models (BERT, GPT) Easy
A. The model requires a person to train it manually
B. The model can only predict one word at a time
C. The model's parameters are permanently fixed
D. The model is trained on a massive dataset before being fine-tuned for specific tasks

19 Which NLP task is focused on creating a shorter version of a long document while retaining its most important information?

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Text Summarization
B. Question Answering
C. Machine Translation
D. Part-of-Speech Tagging

20 The 'vanishing gradient problem' is a common issue that can make it difficult to train which type of neural network on long sequences?

RNN Easy
A. Autoencoder
B. Multi-Layer Perceptron (MLP)
C. Convolutional Neural Network (CNN)
D. Recurrent Neural Network (RNN)

21 A single-layer perceptron is a linear classifier. Which of the following problems can it not solve, and why?

Perceptron Medium
A. The AND problem, because it involves multiple true conditions.
B. The XOR (exclusive OR) problem, because the data points are not linearly separable.
C. The OR problem, because the decision boundary is diagonal.
D. The NOT problem, because it involves inverting the input.

22 In a Multi-Layer Perceptron (MLP), what is the primary consequence of removing all non-linear activation functions (like ReLU or sigmoid) from the hidden layers?

MLP Medium
A. The network becomes unable to perform regression tasks.
B. The number of trainable parameters in the network is significantly reduced.
C. The network will train much faster but lose all accuracy.
D. The network collapses into a single linear transformation, making it no more powerful than a single-layer network.

23 You are training a very deep MLP and observe that the gradients for the earliest layers are almost zero, causing training to stall. What is this phenomenon called, and which activation function is known to help mitigate it?

MLP Medium
A. Exploding Gradient Problem; Tanh
B. Vanishing Gradient Problem; ReLU
C. Overfitting; Sigmoid
D. Saddle Point Problem; Softmax

24 A 2D convolutional layer is applied to a 64x64 pixel grayscale image. The layer uses a 5x5 kernel, a stride of 2, and no padding. What will be the spatial dimensions (height x width) of the output feature map?

CNN Medium
A. 30x30
B. 59x59
C. 60x60
D. 32x32

25 Beyond reducing computational complexity, what is a key benefit of using a max-pooling layer in a Convolutional Neural Network (CNN)?

CNN
A. It provides a degree of translation invariance.
B. It introduces non-linearity into the network.
C. It increases the receptive field of subsequent layers.
D. It normalizes the feature map activations.

26 Why is a Long Short-Term Memory (LSTM) network often preferred over a standard Recurrent Neural Network (RNN) for tasks involving long sequences, such as paragraph-level text analysis?

RNN Medium
A. LSTMs can be parallelized during training, unlike standard RNNs.
B. LSTMs use a more complex activation function that captures more features.
C. LSTMs use gating mechanisms to control information flow, mitigating the vanishing gradient problem.
D. LSTMs have fewer parameters, making them faster to train on long sequences.

27 In a many-to-one RNN architecture used for text classification, what is the typical role of the final hidden state?

RNN Medium
A. It is used to generate the first word of the output sequence.
B. It serves as a summary vector of the entire input sequence, which is then fed into a final classification layer.
C. It is discarded, as only the outputs from each time step are relevant.
D. It is averaged with the initial hidden state to normalize the network's memory.

28 What is the primary advantage of the self-attention mechanism in Transformers over the sequential processing of RNNs in terms of computational efficiency?

Transformer Architecture and Applications Medium
A. It uses a simpler update rule than the gating mechanisms in LSTMs.
B. It requires fewer matrix multiplications per layer.
C. It has a constant-length path for information to travel between any two positions, preventing vanishing gradients.
D. It allows for parallel computation across all tokens in a sequence, as the relationship between any two tokens is calculated independently of their distance.

29 In the Transformer architecture, what is the specific purpose of the Positional Encoding step?

Transformer Architecture and Applications Medium
A. To normalize the word embeddings before they enter the attention layers.
B. To reduce the dimensionality of the input embeddings to save computation.
C. To inject information about the relative or absolute position of tokens, since the self-attention mechanism itself is permutation-invariant.
D. To convert the input tokens into a continuous vector representation.

30 You are developing an NLP model and must choose a tokenization strategy. Why might a subword tokenization algorithm like Byte-Pair Encoding (BPE) be superior to simple word-based tokenization with a fixed vocabulary?

Tokenization Medium
A. It is a lossless compression algorithm that reduces model size.
B. It can handle out-of-vocabulary (OOV) words by breaking them into known subword units.
C. It always results in a shorter sequence of tokens, reducing computation time.
D. It guarantees that every word is broken into its morphological root and affixes.

31 If a word embedding model learns vectors such that vector('Paris') - vector('France') + vector('Italy') results in a vector very close to vector('Rome'), what does this demonstrate about the learned embedding space?

Embeddings Medium
A. The model has captured semantic relationships (like capital city of a country) as geometric relationships in the vector space.
B. The model is only effective for proper nouns and geographical locations.
C. The vectors for all countries are parallel to each other.
D. The model has simply memorized geographical facts from the training data.

32 What is the primary motivation for using pre-trained embeddings (like GloVe or Word2Vec) when building an NLP model for a task with a relatively small dataset?

Embeddings Medium
A. To ensure the model's vocabulary is limited to only the most common words in the English language.
B. To leverage the rich semantic knowledge learned from a massive text corpus, which provides a better model initialization and improves generalization.
C. To reduce the number of layers required in the neural network.
D. To completely eliminate the need for any task-specific training (fine-tuning).

33 In a sequence-to-sequence model with attention for machine translation, if a high attention weight is placed on an input word, what does it signify for the current output step?

Attention Medium
A. That the input word is a grammatical stop word that should be ignored.
B. That the input word has the highest frequency in the training corpus.
C. That the input word is highly relevant for predicting the current output word.
D. That the input word is the last word of the source sentence.

34 Which statement best describes a key architectural difference between BERT and GPT that influences their primary applications?

language models (BERT, GPT) Medium
A. GPT uses an attention mechanism while BERT relies on recurrent layers, making GPT better for longer sequences.
B. GPT is trained on a larger vocabulary than BERT, making it more knowledgeable.
C. BERT is an encoder-only model that processes text bidirectionally, making it ideal for language understanding tasks, while GPT is a decoder-only model that processes text auto-regressively, making it ideal for language generation.
D. BERT must be fine-tuned for specific tasks, whereas GPT can be used directly for any task without fine-tuning.

35 What is the core idea behind the Masked Language Model (MLM) pre-training objective used for BERT?

language models (BERT, GPT) Medium
A. To mask all nouns in a sentence and have the model predict them based on the verbs and adjectives.
B. To predict the next token in a sequence using only the previous tokens, which is a unidirectional approach.
C. To predict randomly masked tokens in a sequence by using both left and right context, forcing the model to learn a deep bidirectional understanding of language.
D. To predict the next sentence in a document, teaching the model about discourse coherence.

36 An NLP system is designed to analyze a legal contract to identify the parties involved, their obligations, and the effective dates. This task goes beyond just parsing grammar. Which NLP phase is most central to this goal?

Modern NLP: Introduction to NLP, NLP phases Medium
A. Syntactic Analysis (Parsing)
B. Semantic Analysis
C. Lexical Analysis
D. Morphological Analysis

37 In the architecture of a task-oriented chatbot, what is the primary responsibility of the Dialogue Management (DM) component?

Building chatbots and digital assistants Medium
A. To convert the chatbot's planned response into natural language (Text-to-Speech or NLG).
B. To convert the user's spoken words into text (Speech-to-Text).
C. To maintain the state of the conversation and decide the chatbot's next action.
D. To extract the user's intent and entities from their message.

38 You are building a system to generate a short, one-paragraph summary of a long news article. This is an example of which NLP use case, and what is a primary challenge?

NLP use cases (sentiment analysis, translation, summarization) Medium
A. Named Entity Recognition; identifying all the people and organizations mentioned.
B. Sentiment Analysis; determining if the article's tone is positive or negative.
C. Text Summarization; ensuring the summary is coherent and factually consistent with the source text.
D. Machine Translation; converting the article to another language.

39 A movie review website wants to automatically assign a 'thumbs up' or 'thumbs down' rating to user-submitted reviews based on the text. Which NLP task is most appropriate for this problem?

NLP use cases (sentiment analysis, translation, summarization) Medium
A. Sentiment Analysis
B. Text Summarization
C. Topic Modeling
D. Question Answering

40 A neuron in a neural network has an input vector x = [2.0, 3.0], a weight vector w = [0.5, -1.5], and a bias b = 1.0. What is the output of this neuron if it uses a ReLU (Rectified Linear Unit) activation function?

Introduction to Neural Networks Medium
A. 0.0
B. 1.0
C. -3.5
D. -2.5

41 A Multi-Layer Perceptron (MLP) is constructed with 5 hidden layers, but exclusively uses the linear activation function in all layers, including the output layer. The network is trained on a complex, non-linear classification task. What is the effective representational power of this network?

MLP Hard
A. It will behave like a deep autoencoder, compressing and decompressing the input linearly.
B. It can approximate any continuous function, as per the Universal Approximation Theorem.
C. It can model complex non-linear functions, but training will be unstable due to the depth.
D. It is equivalent to a single-layer perceptron and can only model linearly separable data.

42 In a Convolutional Neural Network, what is the primary purpose of using a 1x1 convolution (also known as a pointwise convolution), and how does it achieve this without altering the spatial dimensions (height and width) of the feature map?

CNN Hard
A. To perform dimensionality reduction or expansion across the channel dimension while preserving spatial information.
B. To increase the receptive field of subsequent layers by combining information from a 1x1 spatial area.
C. To introduce non-linearity by applying an activation function to each pixel independently.
D. To act as a spatial pooling layer, reducing the height and width of the feature maps.

43 Both LSTMs and GRUs are designed to mitigate the vanishing gradient problem in RNNs. Which statement accurately describes a key architectural difference and its performance implication?

RNN Hard
A. GRUs have three gates (input, forget, output) while LSTMs have only two (reset, update), making LSTMs simpler and faster to train.
B. LSTMs have a separate cell state and hidden state, while GRUs combine them. This makes GRUs computationally more efficient but potentially less expressive for complex sequences.
C. The cell state in a GRU acts as a long-term memory conveyor belt, a feature that is absent in LSTMs.
D. LSTMs use a reset gate to discard irrelevant past information, whereas GRUs use a forget gate, which is a less effective mechanism.

44 The self-attention mechanism in the original Transformer model has a computational complexity of , where is the sequence length and is the model dimension. This makes it challenging for very long sequences. Which of the following is NOT a valid and commonly researched approach to mitigate this quadratic complexity?

Transformer Architecture and Applications Hard
A. Drastically increasing the number of attention heads while reducing the dimension per head, such that the total computation becomes linear with respect to sequence length.
B. Replacing the Softmax function with a linear kernel, allowing the order of matrix multiplication to be rearranged and computed in (e.g., Linear Transformers).
C. Using a sliding window attention mechanism where each token only attends to a fixed number of neighboring tokens (e.g., Longformer).
D. Applying a fixed, sparse attention pattern (e.g., strided or dilated patterns) to reduce the number of attended-to tokens (e.g., Sparse Transformers).

45 What is the primary architectural reason that a model like BERT is considered an 'encoder-only' architecture, while a model like GPT-3 is considered a 'decoder-only' architecture, and how does this influence their ideal use cases?

language models (BERT, GPT) Hard
A. BERT uses bidirectional self-attention (seeing the whole sentence at once), making it an encoder ideal for understanding context (e.g., NLU tasks). GPT uses masked, unidirectional self-attention (seeing only past tokens), making it a decoder ideal for generating text (e.g., NLG tasks).
B. BERT processes tokens in parallel, which is characteristic of encoders, while GPT processes tokens sequentially, which is characteristic of decoders.
C. BERT uses absolute positional embeddings, suitable for encoding, while GPT uses relative positional embeddings, which are better for decoding sequences of varying lengths.
D. BERT is pre-trained with a Masked Language Model objective, which is an encoding task, while GPT is pre-trained with a Causal Language Model objective, a decoding task.

46 Static word embeddings like Word2Vec or GloVe suffer from the problem of polysemy (a word having multiple meanings). How do contextualized embedding models like ELMo and BERT fundamentally address this limitation?

Embeddings Hard
A. By training on a much larger corpus, they learn a single, more robust vector that averages all meanings of a word.
B. They use character-level convolutions to build word embeddings, which helps differentiate meanings based on morphology.
C. They maintain a predefined dictionary of vectors for each possible meaning of a word and use a classifier to select the correct one.
D. They generate a different embedding vector for a word each time it appears, based on its specific context in the sentence.

47 In the standard scaled dot-product attention formula, , what is the critical purpose of the scaling factor , where is the dimension of the key vectors?

Attention Hard
A. It normalizes the variance of the dot products to prevent the softmax function from saturating into regions with extremely small gradients.
B. It acts as a regularization term to prevent overfitting by penalizing large dot product values.
C. It is a temperature parameter that controls the sharpness of the attention distribution, with larger values making the distribution softer.
D. It ensures that the dot product values are positive before being passed to the softmax function.

48 Consider Byte-Pair Encoding (BPE) and WordPiece tokenization strategies. A key difference lies in how they select the next pair of tokens to merge during vocabulary creation. Which statement accurately describes this difference and its implication?

Tokenization Hard
A. BPE always splits words at the rarest character pair, while WordPiece splits based on a predefined vocabulary of common prefixes and suffixes.
B. BPE merges the most frequently occurring pair of adjacent tokens, which can sometimes lead to suboptimal segmentation of common words. WordPiece merges the pair that maximizes the likelihood of the training data, often resulting in more intuitive subwords.
C. WordPiece is a character-based tokenizer, while BPE is subword-based, making WordPiece immune to out-of-vocabulary issues.
D. BPE merges based on raw frequency counts, while WordPiece uses a complex scoring system based on mutual information between token pairs.

49 A convolutional layer has an input volume of , uses 32 filters of size , a stride of 2, and padding of 1. What is the total number of learnable parameters (weights and biases) in this layer?

CNN Hard
A. 12,832
B. 40,992
C. 1,310,720
D. 12,800

50 When evaluating machine translation systems, the BLEU score is a common metric. However, it can be misleadingly high for a translation that is grammatically correct but semantically nonsensical or inaccurate. What fundamental limitation of the BLEU score causes this discrepancy?

NLP use cases (sentiment analysis, translation, summarization) Hard
A. It requires multiple human-generated reference translations, which are often unavailable or inconsistent.
B. It is based on n-gram precision and a brevity penalty, rewarding lexical overlap with reference translations but failing to capture semantic meaning or sentence structure.
C. It primarily measures recall, checking if all words from the reference translation are present, but ignores precision.
D. It is computationally expensive and cannot be used during the training process as a loss function.

51 A single perceptron is trained on a 2D dataset using the standard perceptron learning rule. The dataset is NOT linearly separable. What will be the behavior of the learning algorithm during training?

Perceptron Hard
A. The algorithm will quickly converge to a random decision boundary and stop updating.
B. The algorithm will never converge, and the weights of the perceptron will continue to be updated indefinitely.
C. The algorithm will raise a mathematical error because the loss function cannot be calculated for non-separable data.
D. The algorithm will converge to a decision boundary that minimizes the number of misclassified points.

52 What is the primary motivation for using Teacher Forcing during the training of recurrent neural networks for sequence generation tasks, and what is its main drawback?

RNN Hard
A. Motivation: To enable the model to learn long-range dependencies. Drawback: It exacerbates the vanishing gradient problem.
B. Motivation: To speed up convergence by providing the network with ground-truth inputs at each timestep. Drawback: It can lead to a discrepancy between training and inference, causing instability when the model generates long sequences on its own.
C. Motivation: To reduce the memory footprint of the model during training. Drawback: It requires significantly more computation per training step.
D. Motivation: To prevent overfitting by introducing noise into the training process. Drawback: It slows down the training process significantly.

53 In the Transformer architecture, positional encodings are added to the input embeddings. Why is this step strictly necessary for the model to process sequences, unlike in an RNN?

Transformer Architecture and Applications Hard
A. To allow the model to handle sequences of variable lengths by encoding the absolute position of each token.
B. Because the self-attention mechanism is permutation-invariant; without positional information, the model would treat a sentence as an unordered bag of words.
C. To normalize the input embeddings before they are processed by the attention layers, improving training stability.
D. To provide a unique signal for the start and end of a sequence, which the attention mechanism cannot otherwise determine.

54 BERT's pre-training involves two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Later analysis (e.g., by the RoBERTa paper) suggested that NSP might be an ineffective pre-training task. What was the reasoning behind this conclusion?

language models (BERT, GPT) Hard
A. The binary classification nature of NSP was found to be detrimental to the model's ability to generate nuanced text representations.
B. The MLM task already implicitly taught the model sentence relationships, making the NSP task redundant.
C. The model learned to focus on topic similarity between sentences rather than coherence and logical flow, as the negative examples were too easy to distinguish (randomly sampled sentences).
D. The NSP task was found to be computationally too expensive, providing marginal benefits for the high training cost.

55 In text summarization, what is the fundamental difference between an 'extractive' and an 'abstractive' approach, and what kind of neural network architecture is typically required for a purely abstractive model?

NLP use cases (sentiment analysis, translation, summarization) Hard
A. Extractive summarization creates a summary that is shorter than the source text, while abstractive summarization can create a longer, more detailed summary. Both can be implemented with a simple classifier.
B. Extractive summarization is a form of supervised learning, while abstractive summarization is unsupervised. Abstractive models typically rely on Transformer-based encoders like BERT.
C. Extractive summarization selects important sentences from the source text, while abstractive summarization generates new sentences that capture the meaning. Abstractive models typically require an encoder-decoder architecture (e.g., Sequence-to-Sequence).
D. Extractive summarization uses rule-based systems to identify keywords, while abstractive summarization uses deep learning. Abstractive models require a CNN-based architecture.

56 When designing a task-oriented chatbot (e.g., for booking flights), what is the distinct role of 'Dialogue State Tracking' (DST) and why is it a more complex problem than simple 'Intent Recognition'?

Building chatbots and digital assistants Hard
A. DST is responsible for generating the chatbot's response, while Intent Recognition decides which knowledge base to query. DST is harder because natural language generation is a complex task.
B. DST is the process of training the chatbot's language model, while Intent Recognition is the process of fine-tuning it for a specific task. DST is harder because it requires more data.
C. DST maintains a representation of the user's goal and collected information (slots) throughout a multi-turn conversation, while Intent Recognition is a single-turn classification of the user's immediate goal. DST is harder because it must handle context, ambiguity, and coreference over time.
D. Intent Recognition maps user input to a predefined action, while DST tracks the emotional state of the user to adjust the chatbot's tone. DST is harder due to the subjectivity of emotion.

57 Vector-space analogies like vec('king') - vec('man') + vec('woman') ≈ vec('queen') are a famous property of Word2Vec embeddings. This property suggests that semantic relationships are encoded as linear substructures in the embedding space. What is a known major limitation or failure mode of this analogical reasoning capability?

Embeddings Hard
A. The geometric relationships are highly sensitive to the specific training corpus and hyperparameters, and often do not generalize well to relationships beyond simple gender or capital-city analogies.
B. This property only works for single words and fails completely when trying to perform analogies with phrases or sentences.
C. The vector arithmetic is not commutative, meaning vec('woman') - vec('man') + vec('king') would produce a completely different result.
D. The resulting vector is often not the closest vector to the target word (e.g., 'queen') and requires a separate classification step to identify the correct analogy.

58 In semantic segmentation tasks, a common architectural pattern is an 'encoder-decoder' structure (like U-Net) where the encoder uses strided convolutions or pooling, and the decoder uses upsampling or transposed convolutions. What is the critical role of 'skip connections' between the encoder and decoder in such architectures?

CNN Hard
A. To facilitate gradient flow through the deep network, mitigating the vanishing gradient problem common in deep architectures.
B. To reduce the number of parameters in the decoder by reusing the weights from the corresponding encoder layers.
C. To enforce a bottleneck in the information flow, forcing the encoder to learn a compressed, salient representation of the input.
D. To combine low-level, high-resolution spatial information from the encoder with high-level, semantic information from the decoder, enabling precise localization.

59 What is the primary advantage of Multi-Head Self-Attention (MHSA) over using a single, large self-attention mechanism with the same total number of dimensions?

Attention Hard
A. Each head can process a different segment of the input sequence, allowing for parallel processing of very long documents.
B. It breaks the quadratic complexity of self-attention with respect to sequence length, making it linear.
C. It is significantly more computationally efficient than a single large attention head, reducing the overall training time of the Transformer model.
D. It allows the model to jointly attend to information from different representation subspaces at different positions, effectively learning diverse types of relationships (e.g., syntactic, positional).

60 In the pipeline of NLP phases, consider the relationship between Syntactic Analysis (Parsing) and Semantic Analysis. Which statement best describes a scenario where a failure in syntactic analysis directly leads to an incorrect semantic interpretation?

NLP phases Hard
A. In the sentence "The old man the boats," a parser failing to identify "man" as a verb (meaning to operate) would lead to a nonsensical semantic interpretation.
B. In the sentence "Colorless green ideas sleep furiously," the sentence is syntactically correct but semantically meaningless, showing the independence of the two phases.
C. In the sentence "The bank is on the river bank," a system failing to disambiguate the two meanings of "bank" is a failure of semantic analysis, independent of syntax.
D. A system that correctly identifies the subject, verb, and object in "The dog chased the cat" has completed syntactic analysis, but semantic analysis is still required to understand what 'chasing' means.