Unit 4 - Practice Quiz

INT428 60 Questions
0 Correct 0 Wrong 60 Left
0/60

1 What is the fundamental building block of a neural network, inspired by the human brain?

Introduction to Neural Networks Easy
A. Transistor
B. Neuron (or Node)
C. Algorithm
D. Pixel

2 What kind of problems can a single-layer Perceptron solve?

Perceptron Easy
A. Linearly separable problems
B. Image recognition problems
C. Non-linearly separable problems
D. All classification problems

3 What does MLP stand for in the context of neural networks?

MLP Easy
A. Multi-Layer Perceptron
B. Main Logic Processor
C. Maximum Likelihood Program
D. Multiple Linear Progression

4 Which type of Deep Neural Network is primarily designed for processing grid-like data, such as images?

CNN Easy
A. Convolutional Neural Network (CNN)
B. Recurrent Neural Network (RNN)
C. Transformer
D. Multi-Layer Perceptron (MLP)

5 Recurrent Neural Networks (RNNs) are best suited for what type of data?

RNN Easy
A. Sequential data (e.g., time series, text)
B. Tabular data
C. Image data
D. Static, independent data points

6 What is the key innovation in the Transformer architecture that allows it to process entire sequences at once and handle long-range dependencies effectively?

Transformer Architecture and Applications Easy
A. Recurrent Loops
B. The Sigmoid Function
C. The Attention Mechanism
D. Convolutional Layers

7 What is the main goal of Natural Language Processing (NLP)?

Modern NLP: Introduction to NLP Easy
A. To build faster computer hardware
B. To create realistic computer graphics
C. To enable computers to understand, interpret, and generate human language
D. To optimize database queries

8 Which phase of NLP involves analyzing the grammatical structure of a sentence and the relationships between words?

NLP phases Easy
A. Semantic Analysis
B. Pragmatic Analysis
C. Lexical Analysis (Tokenization)
D. Syntactic Analysis (Parsing)

9 In NLP, what is the process of breaking down a text into smaller units like words or sentences called?

Tokenization Easy
A. Embedding
B. Classification
C. Tokenization
D. Summarization

10 What is the purpose of a word embedding in NLP?

Embeddings Easy
A. To translate words into another language
B. To correct spelling mistakes
C. To count the frequency of each word
D. To represent words as dense numerical vectors

11 In the context of deep learning models like Transformers, what does the 'attention' mechanism help the model to do?

Attention Easy
A. Convert text to speech
B. Increase the speed of model training
C. Focus on the most relevant parts of the input sequence
D. Reduce the number of layers in the network

12 What are BERT and GPT well-known examples of?

language models (BERT, GPT) Easy
A. Speech Recognition APIs
B. Database Management Systems
C. Image Classification Models
D. Large Language Models (LLMs)

13 What is the primary function of a chatbot?

Building chatbots and digital assistants Easy
A. To manage computer hardware resources
B. To simulate conversation with human users
C. To analyze and visualize data
D. To perform complex mathematical simulations

14 Determining if a customer review is positive, negative, or neutral is an example of which NLP task?

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Text Summarization
B. Named Entity Recognition
C. Sentiment Analysis
D. Machine Translation

15 Why is an activation function, such as ReLU or Sigmoid, necessary in a Multi-Layer Perceptron (MLP)?

MLP Easy
A. To only work with positive numbers
B. To reduce the number of neurons
C. To introduce non-linearity into the model
D. To make the model run faster

16 In a CNN, what is the primary purpose of a 'pooling' layer (e.g., MaxPooling)?

CNN Easy
A. To reduce the spatial dimensions (width and height) of the input volume
B. To increase the number of features
C. To classify the image
D. To apply a non-linear transformation

17 The task of automatically converting text from one language to another, like from English to Spanish, is called:

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Language Detection
B. Sentiment Analysis
C. Machine Translation
D. Text Generation

18 The 'P' in GPT stands for 'Pre-trained'. What does this mean?

language models (BERT, GPT) Easy
A. The model is trained on a massive dataset before being fine-tuned for specific tasks
B. The model can only predict one word at a time
C. The model requires a person to train it manually
D. The model's parameters are permanently fixed

19 Which NLP task is focused on creating a shorter version of a long document while retaining its most important information?

NLP use cases (sentiment analysis, translation, summarization) Easy
A. Part-of-Speech Tagging
B. Text Summarization
C. Question Answering
D. Machine Translation

20 The 'vanishing gradient problem' is a common issue that can make it difficult to train which type of neural network on long sequences?

RNN Easy
A. Recurrent Neural Network (RNN)
B. Multi-Layer Perceptron (MLP)
C. Autoencoder
D. Convolutional Neural Network (CNN)

21 A single-layer perceptron is a linear classifier. Which of the following problems can it not solve, and why?

Perceptron Medium
A. The OR problem, because the decision boundary is diagonal.
B. The XOR (exclusive OR) problem, because the data points are not linearly separable.
C. The AND problem, because it involves multiple true conditions.
D. The NOT problem, because it involves inverting the input.

22 In a Multi-Layer Perceptron (MLP), what is the primary consequence of removing all non-linear activation functions (like ReLU or sigmoid) from the hidden layers?

MLP Medium
A. The network becomes unable to perform regression tasks.
B. The number of trainable parameters in the network is significantly reduced.
C. The network collapses into a single linear transformation, making it no more powerful than a single-layer network.
D. The network will train much faster but lose all accuracy.

23 You are training a very deep MLP and observe that the gradients for the earliest layers are almost zero, causing training to stall. What is this phenomenon called, and which activation function is known to help mitigate it?

MLP Medium
A. Vanishing Gradient Problem; ReLU
B. Overfitting; Sigmoid
C. Exploding Gradient Problem; Tanh
D. Saddle Point Problem; Softmax

24 A 2D convolutional layer is applied to a 64x64 pixel grayscale image. The layer uses a 5x5 kernel, a stride of 2, and no padding. What will be the spatial dimensions (height x width) of the output feature map?

CNN Medium
A. 30x30
B. 32x32
C. 60x60
D. 59x59

25 Beyond reducing computational complexity, what is a key benefit of using a max-pooling layer in a Convolutional Neural Network (CNN)?

CNN
A. It introduces non-linearity into the network.
B. It increases the receptive field of subsequent layers.
C. It normalizes the feature map activations.
D. It provides a degree of translation invariance.

26 Why is a Long Short-Term Memory (LSTM) network often preferred over a standard Recurrent Neural Network (RNN) for tasks involving long sequences, such as paragraph-level text analysis?

RNN Medium
A. LSTMs use gating mechanisms to control information flow, mitigating the vanishing gradient problem.
B. LSTMs can be parallelized during training, unlike standard RNNs.
C. LSTMs have fewer parameters, making them faster to train on long sequences.
D. LSTMs use a more complex activation function that captures more features.

27 In a many-to-one RNN architecture used for text classification, what is the typical role of the final hidden state?

RNN Medium
A. It is used to generate the first word of the output sequence.
B. It is averaged with the initial hidden state to normalize the network's memory.
C. It serves as a summary vector of the entire input sequence, which is then fed into a final classification layer.
D. It is discarded, as only the outputs from each time step are relevant.

28 What is the primary advantage of the self-attention mechanism in Transformers over the sequential processing of RNNs in terms of computational efficiency?

Transformer Architecture and Applications Medium
A. It allows for parallel computation across all tokens in a sequence, as the relationship between any two tokens is calculated independently of their distance.
B. It uses a simpler update rule than the gating mechanisms in LSTMs.
C. It requires fewer matrix multiplications per layer.
D. It has a constant-length path for information to travel between any two positions, preventing vanishing gradients.

29 In the Transformer architecture, what is the specific purpose of the Positional Encoding step?

Transformer Architecture and Applications Medium
A. To reduce the dimensionality of the input embeddings to save computation.
B. To normalize the word embeddings before they enter the attention layers.
C. To convert the input tokens into a continuous vector representation.
D. To inject information about the relative or absolute position of tokens, since the self-attention mechanism itself is permutation-invariant.

30 You are developing an NLP model and must choose a tokenization strategy. Why might a subword tokenization algorithm like Byte-Pair Encoding (BPE) be superior to simple word-based tokenization with a fixed vocabulary?

Tokenization Medium
A. It guarantees that every word is broken into its morphological root and affixes.
B. It can handle out-of-vocabulary (OOV) words by breaking them into known subword units.
C. It is a lossless compression algorithm that reduces model size.
D. It always results in a shorter sequence of tokens, reducing computation time.

31 If a word embedding model learns vectors such that vector('Paris') - vector('France') + vector('Italy') results in a vector very close to vector('Rome'), what does this demonstrate about the learned embedding space?

Embeddings Medium
A. The model has simply memorized geographical facts from the training data.
B. The vectors for all countries are parallel to each other.
C. The model is only effective for proper nouns and geographical locations.
D. The model has captured semantic relationships (like capital city of a country) as geometric relationships in the vector space.

32 What is the primary motivation for using pre-trained embeddings (like GloVe or Word2Vec) when building an NLP model for a task with a relatively small dataset?

Embeddings Medium
A. To ensure the model's vocabulary is limited to only the most common words in the English language.
B. To reduce the number of layers required in the neural network.
C. To leverage the rich semantic knowledge learned from a massive text corpus, which provides a better model initialization and improves generalization.
D. To completely eliminate the need for any task-specific training (fine-tuning).

33 In a sequence-to-sequence model with attention for machine translation, if a high attention weight is placed on an input word, what does it signify for the current output step?

Attention Medium
A. That the input word is a grammatical stop word that should be ignored.
B. That the input word has the highest frequency in the training corpus.
C. That the input word is the last word of the source sentence.
D. That the input word is highly relevant for predicting the current output word.

34 Which statement best describes a key architectural difference between BERT and GPT that influences their primary applications?

language models (BERT, GPT) Medium
A. BERT must be fine-tuned for specific tasks, whereas GPT can be used directly for any task without fine-tuning.
B. GPT is trained on a larger vocabulary than BERT, making it more knowledgeable.
C. GPT uses an attention mechanism while BERT relies on recurrent layers, making GPT better for longer sequences.
D. BERT is an encoder-only model that processes text bidirectionally, making it ideal for language understanding tasks, while GPT is a decoder-only model that processes text auto-regressively, making it ideal for language generation.

35 What is the core idea behind the Masked Language Model (MLM) pre-training objective used for BERT?

language models (BERT, GPT) Medium
A. To predict the next token in a sequence using only the previous tokens, which is a unidirectional approach.
B. To mask all nouns in a sentence and have the model predict them based on the verbs and adjectives.
C. To predict the next sentence in a document, teaching the model about discourse coherence.
D. To predict randomly masked tokens in a sequence by using both left and right context, forcing the model to learn a deep bidirectional understanding of language.

36 An NLP system is designed to analyze a legal contract to identify the parties involved, their obligations, and the effective dates. This task goes beyond just parsing grammar. Which NLP phase is most central to this goal?

Modern NLP: Introduction to NLP, NLP phases Medium
A. Syntactic Analysis (Parsing)
B. Morphological Analysis
C. Semantic Analysis
D. Lexical Analysis

37 In the architecture of a task-oriented chatbot, what is the primary responsibility of the Dialogue Management (DM) component?

Building chatbots and digital assistants Medium
A. To convert the user's spoken words into text (Speech-to-Text).
B. To extract the user's intent and entities from their message.
C. To maintain the state of the conversation and decide the chatbot's next action.
D. To convert the chatbot's planned response into natural language (Text-to-Speech or NLG).

38 You are building a system to generate a short, one-paragraph summary of a long news article. This is an example of which NLP use case, and what is a primary challenge?

NLP use cases (sentiment analysis, translation, summarization) Medium
A. Machine Translation; converting the article to another language.
B. Named Entity Recognition; identifying all the people and organizations mentioned.
C. Text Summarization; ensuring the summary is coherent and factually consistent with the source text.
D. Sentiment Analysis; determining if the article's tone is positive or negative.

39 A movie review website wants to automatically assign a 'thumbs up' or 'thumbs down' rating to user-submitted reviews based on the text. Which NLP task is most appropriate for this problem?

NLP use cases (sentiment analysis, translation, summarization) Medium
A. Question Answering
B. Text Summarization
C. Sentiment Analysis
D. Topic Modeling

40 A neuron in a neural network has an input vector x = [2.0, 3.0], a weight vector w = [0.5, -1.5], and a bias b = 1.0. What is the output of this neuron if it uses a ReLU (Rectified Linear Unit) activation function?

Introduction to Neural Networks Medium
A. -3.5
B. -2.5
C. 1.0
D. 0.0

41 A Multi-Layer Perceptron (MLP) is constructed with 5 hidden layers, but exclusively uses the linear activation function in all layers, including the output layer. The network is trained on a complex, non-linear classification task. What is the effective representational power of this network?

MLP Hard
A. It will behave like a deep autoencoder, compressing and decompressing the input linearly.
B. It is equivalent to a single-layer perceptron and can only model linearly separable data.
C. It can approximate any continuous function, as per the Universal Approximation Theorem.
D. It can model complex non-linear functions, but training will be unstable due to the depth.

42 In a Convolutional Neural Network, what is the primary purpose of using a 1x1 convolution (also known as a pointwise convolution), and how does it achieve this without altering the spatial dimensions (height and width) of the feature map?

CNN Hard
A. To act as a spatial pooling layer, reducing the height and width of the feature maps.
B. To increase the receptive field of subsequent layers by combining information from a 1x1 spatial area.
C. To introduce non-linearity by applying an activation function to each pixel independently.
D. To perform dimensionality reduction or expansion across the channel dimension while preserving spatial information.

43 Both LSTMs and GRUs are designed to mitigate the vanishing gradient problem in RNNs. Which statement accurately describes a key architectural difference and its performance implication?

RNN Hard
A. GRUs have three gates (input, forget, output) while LSTMs have only two (reset, update), making LSTMs simpler and faster to train.
B. LSTMs have a separate cell state and hidden state, while GRUs combine them. This makes GRUs computationally more efficient but potentially less expressive for complex sequences.
C. LSTMs use a reset gate to discard irrelevant past information, whereas GRUs use a forget gate, which is a less effective mechanism.
D. The cell state in a GRU acts as a long-term memory conveyor belt, a feature that is absent in LSTMs.

44 The self-attention mechanism in the original Transformer model has a computational complexity of , where is the sequence length and is the model dimension. This makes it challenging for very long sequences. Which of the following is NOT a valid and commonly researched approach to mitigate this quadratic complexity?

Transformer Architecture and Applications Hard
A. Drastically increasing the number of attention heads while reducing the dimension per head, such that the total computation becomes linear with respect to sequence length.
B. Applying a fixed, sparse attention pattern (e.g., strided or dilated patterns) to reduce the number of attended-to tokens (e.g., Sparse Transformers).
C. Using a sliding window attention mechanism where each token only attends to a fixed number of neighboring tokens (e.g., Longformer).
D. Replacing the Softmax function with a linear kernel, allowing the order of matrix multiplication to be rearranged and computed in (e.g., Linear Transformers).

45 What is the primary architectural reason that a model like BERT is considered an 'encoder-only' architecture, while a model like GPT-3 is considered a 'decoder-only' architecture, and how does this influence their ideal use cases?

language models (BERT, GPT) Hard
A. BERT is pre-trained with a Masked Language Model objective, which is an encoding task, while GPT is pre-trained with a Causal Language Model objective, a decoding task.
B. BERT uses bidirectional self-attention (seeing the whole sentence at once), making it an encoder ideal for understanding context (e.g., NLU tasks). GPT uses masked, unidirectional self-attention (seeing only past tokens), making it a decoder ideal for generating text (e.g., NLG tasks).
C. BERT uses absolute positional embeddings, suitable for encoding, while GPT uses relative positional embeddings, which are better for decoding sequences of varying lengths.
D. BERT processes tokens in parallel, which is characteristic of encoders, while GPT processes tokens sequentially, which is characteristic of decoders.

46 Static word embeddings like Word2Vec or GloVe suffer from the problem of polysemy (a word having multiple meanings). How do contextualized embedding models like ELMo and BERT fundamentally address this limitation?

Embeddings Hard
A. They maintain a predefined dictionary of vectors for each possible meaning of a word and use a classifier to select the correct one.
B. By training on a much larger corpus, they learn a single, more robust vector that averages all meanings of a word.
C. They use character-level convolutions to build word embeddings, which helps differentiate meanings based on morphology.
D. They generate a different embedding vector for a word each time it appears, based on its specific context in the sentence.

47 In the standard scaled dot-product attention formula, , what is the critical purpose of the scaling factor , where is the dimension of the key vectors?

Attention Hard
A. It ensures that the dot product values are positive before being passed to the softmax function.
B. It acts as a regularization term to prevent overfitting by penalizing large dot product values.
C. It normalizes the variance of the dot products to prevent the softmax function from saturating into regions with extremely small gradients.
D. It is a temperature parameter that controls the sharpness of the attention distribution, with larger values making the distribution softer.

48 Consider Byte-Pair Encoding (BPE) and WordPiece tokenization strategies. A key difference lies in how they select the next pair of tokens to merge during vocabulary creation. Which statement accurately describes this difference and its implication?

Tokenization Hard
A. BPE merges the most frequently occurring pair of adjacent tokens, which can sometimes lead to suboptimal segmentation of common words. WordPiece merges the pair that maximizes the likelihood of the training data, often resulting in more intuitive subwords.
B. BPE always splits words at the rarest character pair, while WordPiece splits based on a predefined vocabulary of common prefixes and suffixes.
C. BPE merges based on raw frequency counts, while WordPiece uses a complex scoring system based on mutual information between token pairs.
D. WordPiece is a character-based tokenizer, while BPE is subword-based, making WordPiece immune to out-of-vocabulary issues.

49 A convolutional layer has an input volume of , uses 32 filters of size , a stride of 2, and padding of 1. What is the total number of learnable parameters (weights and biases) in this layer?

CNN Hard
A. 1,310,720
B. 12,832
C. 12,800
D. 40,992

50 When evaluating machine translation systems, the BLEU score is a common metric. However, it can be misleadingly high for a translation that is grammatically correct but semantically nonsensical or inaccurate. What fundamental limitation of the BLEU score causes this discrepancy?

NLP use cases (sentiment analysis, translation, summarization) Hard
A. It primarily measures recall, checking if all words from the reference translation are present, but ignores precision.
B. It is based on n-gram precision and a brevity penalty, rewarding lexical overlap with reference translations but failing to capture semantic meaning or sentence structure.
C. It requires multiple human-generated reference translations, which are often unavailable or inconsistent.
D. It is computationally expensive and cannot be used during the training process as a loss function.

51 A single perceptron is trained on a 2D dataset using the standard perceptron learning rule. The dataset is NOT linearly separable. What will be the behavior of the learning algorithm during training?

Perceptron Hard
A. The algorithm will raise a mathematical error because the loss function cannot be calculated for non-separable data.
B. The algorithm will converge to a decision boundary that minimizes the number of misclassified points.
C. The algorithm will quickly converge to a random decision boundary and stop updating.
D. The algorithm will never converge, and the weights of the perceptron will continue to be updated indefinitely.

52 What is the primary motivation for using Teacher Forcing during the training of recurrent neural networks for sequence generation tasks, and what is its main drawback?

RNN Hard
A. Motivation: To enable the model to learn long-range dependencies. Drawback: It exacerbates the vanishing gradient problem.
B. Motivation: To speed up convergence by providing the network with ground-truth inputs at each timestep. Drawback: It can lead to a discrepancy between training and inference, causing instability when the model generates long sequences on its own.
C. Motivation: To prevent overfitting by introducing noise into the training process. Drawback: It slows down the training process significantly.
D. Motivation: To reduce the memory footprint of the model during training. Drawback: It requires significantly more computation per training step.

53 In the Transformer architecture, positional encodings are added to the input embeddings. Why is this step strictly necessary for the model to process sequences, unlike in an RNN?

Transformer Architecture and Applications Hard
A. To provide a unique signal for the start and end of a sequence, which the attention mechanism cannot otherwise determine.
B. To normalize the input embeddings before they are processed by the attention layers, improving training stability.
C. Because the self-attention mechanism is permutation-invariant; without positional information, the model would treat a sentence as an unordered bag of words.
D. To allow the model to handle sequences of variable lengths by encoding the absolute position of each token.

54 BERT's pre-training involves two tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). Later analysis (e.g., by the RoBERTa paper) suggested that NSP might be an ineffective pre-training task. What was the reasoning behind this conclusion?

language models (BERT, GPT) Hard
A. The model learned to focus on topic similarity between sentences rather than coherence and logical flow, as the negative examples were too easy to distinguish (randomly sampled sentences).
B. The NSP task was found to be computationally too expensive, providing marginal benefits for the high training cost.
C. The MLM task already implicitly taught the model sentence relationships, making the NSP task redundant.
D. The binary classification nature of NSP was found to be detrimental to the model's ability to generate nuanced text representations.

55 In text summarization, what is the fundamental difference between an 'extractive' and an 'abstractive' approach, and what kind of neural network architecture is typically required for a purely abstractive model?

NLP use cases (sentiment analysis, translation, summarization) Hard
A. Extractive summarization is a form of supervised learning, while abstractive summarization is unsupervised. Abstractive models typically rely on Transformer-based encoders like BERT.
B. Extractive summarization uses rule-based systems to identify keywords, while abstractive summarization uses deep learning. Abstractive models require a CNN-based architecture.
C. Extractive summarization selects important sentences from the source text, while abstractive summarization generates new sentences that capture the meaning. Abstractive models typically require an encoder-decoder architecture (e.g., Sequence-to-Sequence).
D. Extractive summarization creates a summary that is shorter than the source text, while abstractive summarization can create a longer, more detailed summary. Both can be implemented with a simple classifier.

56 When designing a task-oriented chatbot (e.g., for booking flights), what is the distinct role of 'Dialogue State Tracking' (DST) and why is it a more complex problem than simple 'Intent Recognition'?

Building chatbots and digital assistants Hard
A. DST maintains a representation of the user's goal and collected information (slots) throughout a multi-turn conversation, while Intent Recognition is a single-turn classification of the user's immediate goal. DST is harder because it must handle context, ambiguity, and coreference over time.
B. DST is the process of training the chatbot's language model, while Intent Recognition is the process of fine-tuning it for a specific task. DST is harder because it requires more data.
C. Intent Recognition maps user input to a predefined action, while DST tracks the emotional state of the user to adjust the chatbot's tone. DST is harder due to the subjectivity of emotion.
D. DST is responsible for generating the chatbot's response, while Intent Recognition decides which knowledge base to query. DST is harder because natural language generation is a complex task.

57 Vector-space analogies like vec('king') - vec('man') + vec('woman') ≈ vec('queen') are a famous property of Word2Vec embeddings. This property suggests that semantic relationships are encoded as linear substructures in the embedding space. What is a known major limitation or failure mode of this analogical reasoning capability?

Embeddings Hard
A. The resulting vector is often not the closest vector to the target word (e.g., 'queen') and requires a separate classification step to identify the correct analogy.
B. This property only works for single words and fails completely when trying to perform analogies with phrases or sentences.
C. The geometric relationships are highly sensitive to the specific training corpus and hyperparameters, and often do not generalize well to relationships beyond simple gender or capital-city analogies.
D. The vector arithmetic is not commutative, meaning vec('woman') - vec('man') + vec('king') would produce a completely different result.

58 In semantic segmentation tasks, a common architectural pattern is an 'encoder-decoder' structure (like U-Net) where the encoder uses strided convolutions or pooling, and the decoder uses upsampling or transposed convolutions. What is the critical role of 'skip connections' between the encoder and decoder in such architectures?

CNN Hard
A. To enforce a bottleneck in the information flow, forcing the encoder to learn a compressed, salient representation of the input.
B. To reduce the number of parameters in the decoder by reusing the weights from the corresponding encoder layers.
C. To facilitate gradient flow through the deep network, mitigating the vanishing gradient problem common in deep architectures.
D. To combine low-level, high-resolution spatial information from the encoder with high-level, semantic information from the decoder, enabling precise localization.

59 What is the primary advantage of Multi-Head Self-Attention (MHSA) over using a single, large self-attention mechanism with the same total number of dimensions?

Attention Hard
A. It is significantly more computationally efficient than a single large attention head, reducing the overall training time of the Transformer model.
B. It breaks the quadratic complexity of self-attention with respect to sequence length, making it linear.
C. Each head can process a different segment of the input sequence, allowing for parallel processing of very long documents.
D. It allows the model to jointly attend to information from different representation subspaces at different positions, effectively learning diverse types of relationships (e.g., syntactic, positional).

60 In the pipeline of NLP phases, consider the relationship between Syntactic Analysis (Parsing) and Semantic Analysis. Which statement best describes a scenario where a failure in syntactic analysis directly leads to an incorrect semantic interpretation?

NLP phases Hard
A. In the sentence "The old man the boats," a parser failing to identify "man" as a verb (meaning to operate) would lead to a nonsensical semantic interpretation.
B. In the sentence "The bank is on the river bank," a system failing to disambiguate the two meanings of "bank" is a failure of semantic analysis, independent of syntax.
C. A system that correctly identifies the subject, verb, and object in "The dog chased the cat" has completed syntactic analysis, but semantic analysis is still required to understand what 'chasing' means.
D. In the sentence "Colorless green ideas sleep furiously," the sentence is syntactically correct but semantically meaningless, showing the independence of the two phases.