Unit 1 - Notes
INT344
Unit 1: Introduction to NLP and Text Processing
1. Introduction to Natural Language Processing (NLP)
Definition
Natural Language Processing (NLP) is an interdisciplinary subfield of computer science, artificial intelligence, and linguistics. It focuses on the interaction between computers and humans using natural language. The ultimate objective of NLP is to read, decipher, understand, and make sense of human languages in a manner that is valuable.
Key Components:
- NLU (Natural Language Understanding): The process of reading and interpreting language to extract meaning (e.g., deciding if a review is positive or negative).
- NLG (Natural Language Generation): The process of generating meaningful phrases and sentences in the form of natural language (e.g., a chatbot response).
The Origin of NLP
The history of NLP describes the evolution from rule-based systems to statistical models and finally to neural networks.
- 1950s (Foundations):
- Alan Turing (1950): Proposed the "Turing Test" to determine if a machine can exhibit intelligent behavior indistinguishable from a human.
- Georgetown Experiment (1954): Successfully translated over 60 Russian sentences into English automatically, sparking heavy investment in Machine Translation.
- 1960s–1980s (Symbolic/Rule-Based Era):
- Focused on complex sets of hand-written rules (Chomskyan linguistics).
- ELIZA (1966): A simulation of a Rogerian psychotherapist (early chatbot).
- SHRDLU (1970): A program that understood natural language within a "blocks world" environment.
- 1990s–2000s (Statistical Revolution):
- Shift from hard-coded rules to machine learning models (Hidden Markov Models, N-grams) trained on large text corpora.
- Introduction of probabilistic parsing.
- 2010s–Present (Neural NLP):
- Deep Learning dominance. Word embeddings (Word2Vec), RNNs, LSTMs, and the Transformer architecture (BERT, GPT).
Language and Knowledge
NLP relies on the intersection of three types of knowledge:
- Linguistic Knowledge: Understanding the grammar, syntax, and rules of a specific language (e.g., English Subject-Verb-Object order).
- Domain Knowledge: Specialized terminology and facts regarding the specific topic (e.g., medical NLP requires knowledge of anatomy and disease names).
- World Knowledge: Common sense and general facts about how the world works, which helps resolve ambiguity (e.g., knowing that "The bat flew out of the cave" refers to an animal, not a baseball bat).
2. The Challenges of NLP
Human language is inherently unstructured and ambiguous, making it difficult for computers to process.
Types of Ambiguity
- Lexical Ambiguity: A single word has multiple meanings.
- Example: "I went to the bank." (River bank or financial institution?)
- Syntactic Ambiguity (Structural): A sentence can be parsed in multiple ways.
- Example: "I saw the man with the telescope." (Did I use a telescope to see him, or did the man possess a telescope?)
- Semantic Ambiguity: The meaning is unclear even if the structure is known.
- Example: "The car hit the pole and it broke." (Did the car break or the pole?)
- Pragmatic Ambiguity: The intent differs from the literal meaning.
- Example: "Can you pass the salt?" (Literal: Are you physically able? Pragmatic: Please give me the salt.)
Other Challenges
- Slang and Idioms: "It's raining cats and dogs" is not literal.
- Data Sparsity: Rare words or new words (neologisms) appearing in test data that were not in training data.
- Co-reference Resolution: Determining which entity pronouns (he, she, it) refer to.
3. Language and Grammar
In NLP, grammar refers to the set of rules that dictate how valid sentences are constructed.
- Prescriptive Grammar: Rules regarding how language should be used (what is taught in school).
- Descriptive Grammar: Observation of how language is actually used by speakers (focus of modern NLP).
- Context-Free Grammar (CFG): A type of formal grammar used heavily in computer science to model natural language syntax. It consists of a set of rules (productions) used to generate patterns of strings.
4. NLP Applications
- Machine Translation: Automatically translating text from one language to another (e.g., Google Translate).
- Sentiment Analysis: Determining the attitude (positive, negative, neutral) of a text (e.g., Product reviews).
- Information Retrieval (IR): Finding relevant documents from a large set (e.g., Google Search).
- Information Extraction (IE): Pulling structured data from unstructured text (e.g., extracting dates and names from emails).
- Chatbots and QA Systems: Interactive agents that answer user queries (e.g., Siri, Alexa, ChatGPT).
- Text Summarization: Creating a short, accurate summary of longer text documents.
- Spell and Grammar Checking: Automated correction tools (e.g., Grammarly).
5. Linguistic Essentials
To process text, one must understand the linguistic levels of language analysis.
A. Morphology
The study of the internal structure of words.
- Morpheme: The smallest meaningful unit of language.
- Stem/Root: The core meaning of the word.
- Affixes: Prefixes (un-) and Suffixes (-ed, -ing).
- Example: The word "Unhappiness" Un- (Prefix) + happy (Stem) + -ness (Suffix).
B. Syntax
The study of the structural relationships between words; how words are arranged to form sentences.
- Governs word order (e.g., English is SVO: Subject-Verb-Object).
- Parsing: The process of analyzing a string of symbols according to the rules of a formal grammar.
C. Semantics
The study of meaning.
- Lexical Semantics: Meaning of individual words.
- Compositional Semantics: How meanings of individual words combine to form the meaning of a phrase or sentence.
6. Basic Text Processing (Preprocessing)
Before a machine learning model can use text, it must be cleaned and structured. This is the preprocessing pipeline.
Tokenization
The process of breaking a stream of text into smaller units called tokens (words, characters, or subwords).
- Sentence Tokenization: Splitting a paragraph into sentences.
- Word Tokenization: Splitting a sentence into words.
Example:
Input: "It's raining."
Tokens: ["It", "'s", "raining", "."]
Challenges: Handling contractions ("don't" "do", "n't"), hyphens, and punctuation.
Stop Words
Words that are extremely common but carry very little unique information.
- Examples: "the", "is", "at", "which", "on".
- Process: In traditional NLP (like search engines or bag-of-words models), these are often removed to reduce dataset size and noise.
- Note: In modern Deep Learning (like BERT), stop words are usually kept because they provide sentence structure and context.
Stemming
A heuristic process that chops off the ends of words to reduce them to their base form. It is fast but crude and may result in non-words.
- Algorithm: Porter Stemmer (most common).
- Goal: To group variants of words together.
Examples:
- "connection" "connect"
- "ponies" "poni" (Note: "poni" is not a real word, but Stemming accepts this).
Lemmatization
A morphological analysis that removes inflectional endings only and returns the Lemma (dictionary form) of the word. It requires a vocabulary and morphological analysis of words.
- Accuracy: Higher than stemming, but computationally more expensive.
| Comparison: | Word | Stemming Result | Lemmatization Result |
|---|---|---|---|
| Running | Run | Run | |
| Better | Better | Good | |
| Stripes | Strip | Stripe |
7. Capturing Word Dependency using TF-IDF
To feed text into algorithms, we must convert words into numbers (Vectorization). While simple counting (Bag of Words) is useful, it doesn't account for the importance of a word.
TF-IDF stands for Term Frequency - Inverse Document Frequency. It is a statistical measure used to evaluate how important a word is to a document in a collection (corpus).
1. Term Frequency (TF)
Measures how frequently a term occurs in a document.
Logic: If a word appears many times in a document, it is likely significant to that document.
2. Inverse Document Frequency (IDF)
Measures how important a term is. While computing TF, all terms are considered equally important. However, certain terms, such as "is", "of", and "that", may appear a lot but have little importance. Thus, we need to weigh down the frequent terms while scaling up the rare ones.
- : Total number of documents in the corpus.
- : Number of documents containing the term .
Logic: If a word appears in every document (like "the"), the denominator is high, making the IDF close to 0. If a word appears in only one document (like "Quantum"), the IDF is high.
3. The TF-IDF Score
Capturing Dependency/Relevance
While TF-IDF does not capture syntactic dependency (grammar structure), it captures statistical dependency:
- Relevance: It highlights words that define the specific nature of a document relative to the wider dataset.
- Signature: It creates a weighted vector where generic words have low weights and domain-specific words have high weights.
Example:
In a database of biology papers, the word "cell" might have low IDF (it's in every paper). However, in a database of general news, "cell" might have high IDF (appearing only in science or prison news), making it a high-dependency keyword for categorization.