Unit1 - Subjective Questions
CSE472 • Practice Questions with Detailed Answers
Trace the origin and historical evolution of Natural Language Processing (NLP).
Origin and Evolution of NLP:
Natural Language Processing (NLP) has evolved over several decades, generally categorized into three main phases:
- Symbolic/Rule-Based Era (1950s - 1980s): NLP started in the 1950s with the Georgetown-IBM experiment (1954), which demonstrated fully automatic translation of more than sixty Russian sentences into English. This era relied heavily on complex, hand-written grammatical rules and dictionaries (e.g., ELIZA, SHRDLU).
- Statistical Era (1990s - 2010s): Due to the steady increase in computational power and the shift to machine learning algorithms, NLP moved from rule-based to statistical models. Algorithms like Hidden Markov Models (HMMs) and Support Vector Machines (SVMs) were utilized for text processing, relying on probabilistic decisions.
- Neural/Deep Learning Era (2010s - Present): The advent of deep neural networks and word embeddings (like Word2Vec) revolutionized NLP. This era is characterized by models that can automatically learn hierarchical feature representations from raw text, culminating in modern Large Language Models (LLMs) based on Transformer architectures.
Explain the concepts of 'Language' and 'Grammar' in the context of Natural Language Processing.
Language and Grammar in NLP:
- Language: In NLP, language is treated as a complex system of communication consisting of a vocabulary (words/tokens) and a set of rules used to convey meaning. It is highly unstructured, context-dependent, and inherently ambiguous, making it challenging for machines to process natively.
- Grammar: Grammar refers to the structural rules that govern the composition of clauses, phrases, and words in any given natural language. In NLP, formal grammars (like Context-Free Grammars or CFGs) are often used to parse text and understand its syntactic structure. Grammar helps NLP algorithms determine if a sequence of words forms a valid sentence, serving as a foundational step for deeper semantic analysis.
Discuss the three linguistic essentials: Morphology, Syntax, and Semantics, providing suitable examples for each.
Linguistic Essentials in NLP:
- Morphology: The study of the internal structure of words and how they are formed from smaller units called morphemes.
- Example: The word "unhappiness" consists of three morphemes: "un-" (prefix meaning not), "happy" (root), and "-ness" (suffix indicating a state).
- Syntax: The study of the rules governing how words are combined to form valid phrases and sentences. It focuses on the structural arrangement rather than the meaning.
- Example: "The cat sat on the mat" follows standard English Subject-Verb-Object syntax. "Sat mat on the cat the" is syntactically incorrect.
- Semantics: The study of meaning in language. It focuses on interpreting the meaning of words, phrases, and sentences in isolation.
- Example: "The bank is closed" could mean a financial institution or a river bank. Semantics (often combined with pragmatics) helps derive the actual intended meaning.
Define Morphology and explain the difference between derivational and inflectional morphology.
Morphology is the branch of linguistics that studies word structures and the rules by which words are formed from smaller meaning-bearing units called morphemes.
- Inflectional Morphology: Modifies a word to express different grammatical categories such as tense, case, voice, aspect, person, number, and gender, without changing its core meaning or part of speech.
- Example: Walk Walks, Walked, Walking. (The core meaning "to walk" remains, only the tense/number changes).
- Derivational Morphology: Involves the creation of a new word from an existing word, often changing its part of speech or its fundamental meaning.
- Example: Happy (Adjective) Happiness (Noun); Govern (Verb) Government (Noun).
Distinguish between Syntax and Semantics in NLP.
Syntax vs. Semantics:
- Syntax deals with the structural rules of a language. It ensures that a sentence is grammatically correct and that the words are ordered properly according to the language's grammar rules.
- Role in NLP: Handled by parsers (like dependency parsing or constituency parsing) to build a parse tree.
- Example: "Colorless green ideas sleep furiously." (Syntactically correct, but logically meaningless).
- Semantics deals with the meaning of the language. It ensures that the sentence makes logical sense and interprets what the words actually convey.
- Role in NLP: Handled by tasks like Named Entity Recognition (NER), Word Sense Disambiguation (WSD), and semantic role labeling.
- Example: Extracting the fact that in "Apple is launching a new product," "Apple" refers to the company, not the fruit.
What are the major challenges faced in Natural Language Processing? Discuss at least four challenges.
Major Challenges in NLP:
- Ambiguity: Words or sentences can have multiple meanings depending on the context.
- Lexical Ambiguity: e.g., "bat" (animal vs. sports equipment).
- Syntactic Ambiguity: e.g., "I saw the man with the telescope" (Who has the telescope?).
- Context and World Knowledge: NLP models often lack common sense or world knowledge. Understanding sarcastic remarks or idioms (e.g., "piece of cake") requires contextual awareness beyond literal definitions.
- Out-of-Vocabulary (OOV) Words: Slang, misspellings, new jargon, and domain-specific terms constantly emerge. Models trained on static datasets struggle when encountering these unseen tokens.
- Synonymy: Different words can express the same meaning (e.g., "buy" and "purchase"). Models need to map these varied inputs to a single semantic concept.
- Data Sparsity: While language is vast, specific structural patterns or word combinations may appear rarely in training data, making it hard for statistical models to learn their correct usage.
List and describe five major applications of Natural Language Processing.
Major Applications of NLP:
- Machine Translation: Automatically translating text or speech from one language to another (e.g., Google Translate).
- Sentiment Analysis: Identifying and extracting subjective information from source materials to determine the sentiment (positive, negative, neutral) of a text, heavily used in social media monitoring and product reviews.
- Chatbots and Virtual Assistants: Systems designed to converse with humans via text or voice, capable of understanding intents and providing relevant responses (e.g., Siri, Alexa, Customer Support bots).
- Text Summarization: Reducing a long text document to a concise summary while retaining the core meaning and important information. Can be extractive (pulling key sentences) or abstractive (generating new sentences).
- Information Extraction: Automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (e.g., extracting names, dates, and locations using Named Entity Recognition).
Define Tokenization. Explain the difference between word tokenization and sentence tokenization.
Tokenization:
Tokenization is the foundational step in text preprocessing where a stream of text is broken down into smaller, meaningful units called tokens (e.g., words, subwords, or sentences).
- Word Tokenization: The process of splitting a piece of text into individual words.
- Example: "NLP is fun!"
["NLP", "is", "fun", "!"]. - Challenge: Handling contractions (e.g., "don't") or hyphenated words.
- Example: "NLP is fun!"
- Sentence Tokenization (Sentence Segmentation): The process of splitting a large text corpus into individual sentences.
- Example: "Hello world. How are you?"
["Hello world.", "How are you?"]. - Challenge: Periods are used for abbreviations (e.g., "Dr.", "U.S.A.") as well as sentence endings, requiring intelligent splitting algorithms.
- Example: "Hello world. How are you?"
Compare and contrast Stemming and Lemmatization. When should one be preferred over the other?
Stemming vs. Lemmatization:
- Stemming: A heuristic process that chops off the ends of words to reduce them to their root form. It does not consider the linguistic context or part of speech.
- Algorithm: Porter Stemmer, Snowball Stemmer.
- Result: Often produces non-dictionary words (e.g., "running" "run", "universities" "univers").
- Speed: Fast and computationally inexpensive.
- Lemmatization: A more advanced, dictionary-based approach that uses morphological analysis to reduce a word to its base or dictionary form (lemma). It requires knowing the Part-of-Speech (POS) tag.
- Result: Always produces a valid dictionary word (e.g., "better" "good", "running" (verb) "run").
- Speed: Slower and computationally more expensive due to dictionary lookups.
Preference:
- Use Stemming in large-scale Information Retrieval systems (like search engines) where speed is prioritized over perfect linguistic accuracy.
- Use Lemmatization in applications requiring precise language understanding, such as Chatbots, Machine Translation, or Question Answering systems.
Why is stop-word removal and punctuation handling important in NLP preprocessing?
Stop-word Removal and Punctuation Handling:
- Stop-word Removal: Stop words are highly common words (e.g., "the", "is", "in", "and") that generally carry very little semantic weight or unique information about the text.
- Importance: Removing them significantly reduces the vocabulary size and dimensionality of the data, saving computational resources and helping statistical models focus on the keywords that actually convey the core meaning.
- Punctuation Handling: Punctuation marks (like commas, periods, exclamation marks) are often treated as separate tokens or attached to words.
- Importance: If not handled (removed or standardized), the word "hello!" might be treated as different from "hello". Removing punctuation normalizes the text. However, in modern deep learning (like sentiment analysis), punctuation might be retained as it can denote emotion (e.g., "!!!").
Discuss strategies for handling Out-of-Vocabulary (OOV) words in text processing.
Handling Out-of-Vocabulary (OOV) Words:
OOV words are words not present in the model's vocabulary during training but encountered during testing or real-world application.
Strategies:
- UNK Token Replacement: The most common traditional method is to replace all OOV words with a special
<UNK>(unknown) token. The model learns to handle generic unknown entities. - Subword Tokenization: Modern approaches (like Byte-Pair Encoding (BPE) or WordPiece used in BERT) break words down into subword units or characters. If a word is unseen, it is constructed from known subword tokens (e.g., "unhappiness" "un" + "##happi" + "##ness").
- Character-level embeddings: Instead of relying on word tokens, the model interprets text character by character, completely eliminating the OOV problem at the word level.
- Spell Checking and Normalization: Before processing, applying a spell checker can correct misspelled words, turning an OOV word back into an in-vocabulary word.
What is Text Normalization? Outline the common steps involved in normalizing text.
Text Normalization:
Text normalization is the process of transforming text into a single, canonical, standard form. It ensures that variations of the same word or phrase are treated identically by the NLP model.
Common Steps:
- Lowercasing: Converting all characters to lowercase so that "Apple" and "apple" are treated as the same token (though this may impact Named Entity Recognition).
- Expanding Contractions: Converting shortened versions of words into their full forms (e.g., "don't" "do not", "I'll" "I will").
- Removing Special Characters and Numbers: Filtering out emojis, URLs, HTML tags, and numbers if they do not contribute to the task's semantics.
- Stemming/Lemmatization: Reducing words to their base form.
- Standardizing Spellings: Converting regional spelling variations to a common standard (e.g., British "colour" to American "color").
Explain the Bag-of-Words (BoW) model. What are its main advantages and limitations?
Bag-of-Words (BoW) Model:
The Bag-of-Words model is a simplifying representation used in NLP where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and word order but keeping multiplicity.
-
How it works:
- Build a vocabulary of all unique words in the corpus.
- Create a vector for each document where each element represents the frequency (or presence/absence) of a specific word from the vocabulary in that document.
-
Advantages:
- Very simple to understand and implement.
- Highly effective for simple document classification and spam filtering tasks.
-
Limitations:
- Loss of Semantic Meaning and Context: Since word order is ignored, "John loves Mary" and "Mary loves John" have the exact same BoW representation.
- Sparsity: The vocabulary can easily reach tens of thousands of words, resulting in highly sparse vectors consisting mostly of zeros.
- Equal Weighting: It treats all words equally, meaning frequent but uninformative words might dominate the representation.
Define N-grams. Give examples of a unigram, bigram, and trigram for the sentence: 'Deep learning is fascinating'.
N-grams:
An n-gram is a contiguous sequence of n items (usually words or characters) from a given sample of text. N-grams are used to capture the local context and word order that simple Bag-of-Words models lose.
For the sentence: "Deep learning is fascinating"
- Unigram (n=1): Individual words.
["Deep", "learning", "is", "fascinating"]
- Bigram (n=2): Sequences of two adjacent words. Helps in capturing simple context like "New York" or "not good".
["Deep learning", "learning is", "is fascinating"]
- Trigram (n=3): Sequences of three adjacent words.
["Deep learning is", "learning is fascinating"]
Use Case: N-grams form the basis of traditional statistical language models, predicting the next word based on the previous words.
Derive and explain the mathematical formulation of TF-IDF. Why is the logarithm used in IDF?
TF-IDF (Term Frequency - Inverse Document Frequency):
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents (corpus).
1. Term Frequency (TF): Measures how frequently a term appears in a document .
2. Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus of documents. It penalizes highly frequent words (like "the").
Where is the number of documents containing term .
3. TF-IDF Score:
Why Logarithm in IDF?
The logarithm is used to dampen the effect of the IDF. If we have 1,000,000 documents and a word appears in 10, the ratio is 100,000. Without the log, this word would have an astronomical weight compared to a word appearing in 1,000 documents. The log scales down this explosive growth, ensuring that weights remain manageable and preventing rare words from disproportionately dominating the model.
Compare TF-IDF with the standard Bag-of-Words (BoW) representation.
TF-IDF vs. Bag-of-Words (BoW):
- Weighting Mechanism:
- BoW: Uses raw word counts (or binary presence). All words are treated with equal importance based purely on their frequency in the document.
- TF-IDF: Multiplies the local importance (TF) by the global rarity (IDF). It reduces the weight of terms that occur very frequently across the entire corpus and increases the weight of terms that are unique to a specific document.
- Handling Stop Words:
- BoW: Common stop words will have the highest counts and dominate the feature vector unless explicitly removed during preprocessing.
- TF-IDF: Naturally mitigates the effect of stop words. Since they appear in almost all documents, their IDF score approaches zero (e.g., ), minimizing their TF-IDF weight.
- Dimensionality and Sparsity: Both techniques suffer from high dimensionality and sparse vector representations since the vector size is equal to the vocabulary size.
Explain the concept of ambiguity in NLP and describe Lexical, Syntactic, and Semantic Ambiguity with examples.
Ambiguity in NLP:
Ambiguity occurs when a linguistic expression can be interpreted in more than one way. It is one of the hardest challenges in NLP.
- Lexical Ambiguity: Occurs when a single word has multiple meanings (polysemy or homonymy).
- Example: "I am going to the bank." (Financial institution vs. River bank). Resolved via Word Sense Disambiguation (WSD).
- Syntactic (Structural) Ambiguity: Occurs when a sentence can be parsed in multiple grammatical structures, leading to different meanings.
- Example: "The chicken is ready to eat." (Is the chicken fully cooked and ready to be eaten, or is the live chicken hungry and ready to feed?).
- Semantic Ambiguity: Occurs when the sentence structure is clear, but the meaning of the overall sentence is still open to interpretation, often involving scope or referential issues.
- Example: "Every man loves a woman." (Does every man love the same woman, or does every man have some woman that he loves?).
How do n-grams solve the context limitation of the Bag-of-Words model?
N-grams Addressing BoW Limitations:
The standard Bag-of-Words (BoW) model operates on unigrams (single words) and completely ignores word order. Consequently, "not good" and "good" might be mapped to similar vectors if the word "not" is removed or ignored, completely losing the contextual negation.
Role of N-grams:
By using n-grams (where , such as bigrams or trigrams), the model captures contiguous sequences of words.
- A bigram model creates tokens like
"not good","very bad", or"San Francisco". - This inherently preserves local word order and context.
- It allows statistical models to recognize phrases and negations, improving tasks like sentiment analysis where the order of words drastically changes the sentiment (e.g., "happy" vs. "not happy").
What are Over-stemming and Under-stemming? Explain with examples.
Over-stemming and Under-stemming:
These are two common errors produced by heuristic stemming algorithms (like the Porter Stemmer).
- Over-stemming: Occurs when the stemmer chops off too much of a word, causing two words with different distinct meanings to be reduced to the exact same stem (False Positive).
- Example: "University" and "Universe" might both be stemmed to "Univers". This implies they have the same root meaning, which is incorrect.
- Under-stemming: Occurs when the stemmer fails to remove a valid suffix, resulting in words that share the same root meaning being reduced to different stems (False Negative).
- Example: "Data" and "Datum" might remain as "data" and "datum" instead of being reduced to a common root, causing the model to treat them as completely independent concepts.
Design a complete text preprocessing pipeline for a sentiment analysis task, outlining each step from raw text to TF-IDF vectors.
Text Preprocessing Pipeline for Sentiment Analysis:
- Raw Text Collection: Gather the corpus of reviews or social media posts.
- Text Normalization & Cleaning:
- Convert all text to lowercase to maintain consistency.
- Expand contractions (e.g., "isn't" "is not" - crucial for sentiment).
- Remove special characters, HTML tags, and URLs.
- Tokenization: Split the cleaned text into individual word tokens.
- Stop-word Removal: Remove common words (like "the", "is", "at"). Note: In sentiment analysis, negation stop words like "not" or "nor" must be preserved.
- Lemmatization: Convert words to their base dictionary form (e.g., "better" "good") using POS tagging to group variations of a word together.
- N-gram Generation: Generate unigrams and bigrams to capture contextual phrases (e.g., "not good").
- TF-IDF Vectorization:
- Calculate the Term Frequency (TF) for each token in the document.
- Calculate the Inverse Document Frequency (IDF) across the entire corpus.
- Multiply TF and IDF to produce numerical feature vectors that are ready to be fed into a machine learning classifier.