Unit 1 - Notes

CSE472 7 min read

Unit 1: Foundations of NLP and Text Processing

1. Introduction to Natural Language Processing (NLP)

Origin of NLP

The field of Natural Language Processing (NLP) sits at the intersection of computer science, artificial intelligence, and linguistics. Its evolution can be divided into several distinct eras:

1950s - The Turing Test & Early Machine Translation: The origins trace back to Alan Turing's 1950 paper proposing the "Turing Test" as a measure of machine intelligence. In 1954, the Georgetown-IBM experiment successfully translated 60 Russian sentences into English, sparking immense interest.
1960s - Symbolic and Rule-Based NLP: Programs like ELIZA (1966) by Joseph Weizenbaum simulated a Rogerian psychotherapist using pattern matching and substitution rules, demonstrating early human-computer interaction, albeit without actual understanding.
1980s to 1990s - The Statistical Revolution: The introduction of machine learning algorithms for language processing shifted the focus from hand-written rules to statistical models (e.g., Hidden Markov Models) driven by large text corpora.
2010s to Present - The Deep Learning Era: The advent of neural networks (RNNs, CNNs, and later Transformers) revolutionized NLP, enabling models to learn complex linguistic patterns and semantics without heavy manual feature engineering.

Applications of NLP

Machine Translation: Automatically translating text from one language to another (e.g., Google Translate).
Sentiment Analysis: Determining the underlying sentiment (positive, negative, neutral) of a text, heavily used in social media monitoring and product reviews.
Information Extraction & Named Entity Recognition (NER): Pulling specific entities (names, dates, locations) and relationships from unstructured text.
Question Answering (QA): Systems capable of answering questions posed in natural language (e.g., Siri, Alexa, ChatGPT).
Text Summarization: Extracting the most important information or abstracting a new summary from a larger document.

Challenges of NLP

Understanding human language computationally is difficult due to its inherent complexity:

Ambiguity: Words or sentences can have multiple meanings.
- Lexical ambiguity: "Bank" (river bank vs. financial institution).
- Syntactic ambiguity: "I saw the man with the telescope" (Who has the telescope?).
Context and Pragmatics: Meaning often depends on the situation or cultural context.
Sarcasm and Irony: Literal meaning contradicts the intended meaning, which statistical models struggle to capture without deep contextual cues.
Lack of Standardization: Slang, dialects, typos, and informal language (e.g., Twitter/X data) violate standard grammar rules.
Data Sparsity: No matter how large a dataset is, there will always be valid word combinations that have never been seen before.

2. Linguistic Essentials: Language and Grammar

To process language, we must understand its structural rules. Language is a structured system of communication, and grammar is the set of rules governing the assembly of sentences, phrases, and words.

Morphology

Morphology is the study of the internal structure of words.

Morphemes: The smallest meaning-bearing units in a language.
- Free morphemes can stand alone as words (e.g., "play").
- Bound morphemes must be attached to other morphemes (e.g., "er" in "player", "un-" in "unhappy").
Affixation: The process of adding prefixes, suffixes, or infixes to a root word to alter its meaning or syntactic category (e.g., friend -> friendly).
Understanding morphology is crucial for tasks like stemming and lemmatization.

Syntax

Syntax is the study of how words are arranged to form grammatical sentences.

Parts of Speech (POS): Categories of words that share similar grammatical properties (nouns, verbs, adjectives, etc.).
Phrase Structure: Words group together into phrases (e.g., Noun Phrases, Verb Phrases), which hierarchically combine to form sentences.
Parsing: The computational process of analyzing a string of symbols to determine its grammatical structure (producing syntactic trees).

Semantics

Semantics deals with the meaning of text.

Lexical Semantics: The meaning of individual words and the relationships between them (synonyms, antonyms, hypernyms).
Compositional Semantics: How the meanings of individual words combine to form the meaning of sentences. "The dog chased the cat" means something entirely different from "The cat chased the dog," despite containing the exact same words.

3. Text Preprocessing Techniques

Before feeding text into machine learning or deep learning models, it must be cleaned and transformed into a standardized format.

Normalization

Text normalization converts text into a more uniform standard.

Lowercasing: Converting all text to lower case to ensure that "Apple" and "apple" are treated as the same word.
Unicode/Accent Handling: Converting characters to a standard encoding (e.g., changing "café" to "cafe") to reduce vocabulary variations.

Tokenization

Tokenization is the process of breaking down a stream of text into smaller units called tokens (words, subwords, or sentences).

Word Tokenization: Splitting text by spaces and punctuation.
Sentence Tokenization: Splitting text into separate sentences using terminal punctuation (periods, exclamation marks).
Example: "Deep learning is great!" $\rightarrow$ ["Deep", "learning", "is", "great", "!"]

Punctuation Handling

Punctuation often acts as noise in text classification tasks (like bag-of-words models), though it carries meaning in sequence models.

Removal: Stripping out commas, periods, etc., using regular expressions.
Retention: Keeping punctuation as separate tokens if syntactic structure is required (e.g., for POS tagging).

Stop-word Removal

Stop-words are extremely common words (e.g., "the", "is", "in", "and") that often do not carry significant meaning for tasks like topic modeling or document classification.

Removing them reduces the vocabulary size and computational load.
Note: In modern Deep Learning (e.g., Transformers), stop-words are usually kept because they provide crucial syntactic context.

Stemming

Stemming is a crude heuristic process that chops off the ends of words to reduce them to a base or "root" form.

It operates without knowledge of the context.
Example: running, runs, runner $\rightarrow$ run. universities $\rightarrow$ univers.
Algorithm: Porter Stemmer, Snowball Stemmer.

Lemmatization

Lemmatization is a more sophisticated approach that reduces a word to its proper dictionary form (the lemma).

It uses vocabulary and morphological analysis, considering the POS of the word.
Example: better $\rightarrow$ good (Stemming would fail here). running (verb) $\rightarrow$ run, but meeting (noun) remains meeting.

Handling Out-Of-Vocabulary (OOV) Words

OOV words are words that appear in the testing/production data but were not present in the training vocabulary.

UNK Token: Replacing rare words with a special <UNK> (unknown) token during training.
Subword Tokenization: Used heavily in Deep Learning. Algorithms like Byte-Pair Encoding (BPE), WordPiece, or SentencePiece break OOV words down into known sub-components.
- Example: If "unhappiness" is OOV, it might be tokenized into ["un", "##happi", "##ness"].

4. Feature Extraction and Text Representation

Machine learning models require numerical input. Text representation techniques convert tokens into numerical vectors.

Bag-of-Words (BoW)

The Bag-of-Words model represents a document as a sparse vector indicating the frequency of vocabulary words within the document, completely disregarding grammar and word order.

Process:
1. Create a vocabulary from all documents.
2. Count the occurrences of each vocabulary word in the current document.
Pros: Simple, easy to implement.
Cons: Loses semantic meaning (word order), results in highly sparse vectors, and gives undue importance to frequent but less informative words.

N-grams

An extension of BoW that captures local word order by taking contiguous sequences of n items from a given text.

Unigram (1-gram): "I", "love", "NLP"
Bigram (2-gram): "I love", "love NLP"
Trigram (3-gram): "I love NLP"
Pros: Captures some context and negation (e.g., "not good").
Cons: Vocabulary size grows exponentially as n increases, leading to severe data sparsity.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF improves upon simple frequency counting by penalizing words that appear in almost every document (like "the") and giving higher weight to terms that are frequent in a specific document but rare across the entire corpus.

1. Term Frequency (TF): Measures how frequently a term occurs in a document.
$TF(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total number of words in document } d}$

2. Inverse Document Frequency (IDF): Measures how important a term is across the whole corpus.
$IDF(t, D) = \log\left(\frac{\text{Total number of documents } (N)}{\text{Number of documents containing term } t}\right)$

3. TF-IDF Score:
$TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)$

Example Implementation in Python (using Scikit-Learn):

PYTHON

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "Machine learning is fascinating.",
    "Deep learning for natural language processing.",
    "Natural language processing is a subset of artificial intelligence."
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Displays vocabulary and corresponding IDF weights
print(vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

Unit 2