Unit 6 - Notes
Unit 6: Languages, Automata, Grammars
1. Introduction
In discrete mathematics and theoretical computer science, the study of languages, automata, and grammars forms the foundation of computation theory. This unit explores how strings of characters are formed, how they can be grouped into meaningful sets (languages), how machines can be designed to recognize these sets (automata), and the rule-based systems used to generate them (grammars). These concepts are fundamental in compiler design, natural language processing, pattern matching, and understanding the limits of computation.
2. Alphabet, Words, and Free Semigroup
Alphabet
An alphabet (usually denoted by ) is a finite, non-empty set of indivisible symbols or characters.
- Examples:
- Binary alphabet:
- English alphabet:
- Alphanumeric alphabet:
Words (Strings)
A word (or string) over an alphabet is a finite sequence of symbols drawn from .
- Length: The length of a word , denoted as , is the number of symbols it contains. For example, if , then .
- Empty String: The empty string, denoted by (epsilon) or (lambda), is the string containing zero symbols. Its length is 0 ().
Concatenation
The primary operation on words is concatenation. If and are words, their concatenation is formed by appending the symbols of to the end of .
- Properties of Concatenation:
- Associative:
- Identity: (The empty string acts as the identity element).
Free Semigroup and Free Monoid
- Free Semigroup (): The set of all possible non-empty words that can be formed using the symbols in . It is a semigroup because it is closed under the associative operation of concatenation.
- Free Monoid (): Also known as the Kleene closure of , this is the set of all possible words over , including the empty string .
- It is a monoid because it possesses an identity element () for the concatenation operation.
3. Languages and Regular Languages
Formal Definition of a Language
A language over an alphabet is any subset of .
- A language can be finite or infinite.
- Examples:
- The empty language:
- The language containing only the empty string:
- The language of all binary strings ending in 0:
Operations on Languages
Given two languages and over , we can perform set operations:
- Union (): The set of strings that are in or .
- Concatenation (): The set of strings formed by concatenating a string from with a string from .
- Kleene Star (): The set of strings formed by concatenating zero or more strings from . , where .
Regular Languages
A regular language over an alphabet is defined recursively as follows:
- Base Cases:
- The empty language is a regular language.
- The language is a regular language.
- For every symbol , the language is a regular language.
- Recursive Step: If and are regular languages, then the following are also regular languages:
- (Union)
- (Concatenation)
- (Kleene Star)
- Closure: No other languages are regular unless they can be formed by a finite number of applications of the rules above.
4. Regular Expressions
A regular expression (regex) is an algebraic formula used to represent a regular language. It provides a declarative way to specify patterns of strings.
Formal Definition
Given an alphabet :
- is a regular expression denoting the empty language.
- is a regular expression denoting the language .
- is a regular expression denoting the language .
- If and are regular expressions denoting languages and , then:
- or represents .
- represents .
- represents .
Examples of Regular Expressions
Let
0*10*: The set of all binary strings containing exactly one '1'.(0+1)*00(0+1)*: The set of all binary strings containing the substring '00'.(0+1)*: Represents (all possible binary strings).
5. Finite State Automata
A Finite State Automaton (FSA) is a theoretical mathematical model of computation consisting of a finite number of states. It is used to recognize regular languages.
Deterministic Finite Automata (DFA)
A DFA is a 5-tuple , where:
- : A finite set of states.
- : A finite alphabet.
- : The transition function, . For every state and every input symbol, there is exactly one next state.
- : The start state ().
- : The set of accept/final states ().
How a DFA works:
The automaton begins in , reads an input string symbol by symbol, and transitions between states according to . If the automaton is in a state belonging to after reading the entire string, the string is accepted. The language of the machine, , is the set of all accepted strings.
Nondeterministic Finite Automata (NFA)
An NFA is similar to a DFA, but the transition function allows for zero, one, or multiple transitions from a state for a given input symbol, and may include -transitions (transitions without consuming an input symbol).
- (Power set of ).
- An NFA accepts a string if there exists at least one path leading to an accept state.
Equivalence of DFA and NFA
Despite NFAs appearing more powerful due to branching paths, every NFA can be converted into an equivalent DFA that recognizes the exact same language. This is proven using the Subset Construction Algorithm, where states of the new DFA represent sets of states from the NFA.
Kleene's Theorem
Kleene's Theorem establishes a fundamental equivalence in computer science:
- A language is regular if and only if it can be recognized by a Finite State Automaton.
- Therefore: Regular Expressions Regular Languages Finite State Automata.
6. Grammars and Types of Grammars
While regular expressions generate strings from the bottom up, grammars generate languages using a set of substitution rules.
Formal Definition of a Grammar
A grammar is a 4-tuple , where:
- : A finite set of variables (or non-terminal symbols). Usually denoted by uppercase letters (e.g., ).
- : A finite set of terminal symbols (the alphabet of the language). Usually denoted by lowercase letters (e.g., ). .
- : The start symbol ().
- : A finite set of production rules, where each rule replaces a sequence of symbols with another sequence.
Derivation: The process of generating a string by starting with and repeatedly applying production rules until only terminal symbols remain. The language generated by a grammar is denoted .
The Chomsky Hierarchy
Noam Chomsky classified grammars into four distinct types, forming a strict hierarchy where each level is a superset of the level below it.
Type 0: Recursively Enumerable (Unrestricted) Grammars
- Rules: , where contains at least one variable, and .
- Recognizing Machine: Turing Machine.
- Description: No restrictions on production rules. They generate all languages computable by a Turing machine.
Type 1: Context-Sensitive Grammars
- Rules: , with the restriction that (the length of the right side must be greater than or equal to the left side). The only exception allowed is if does not appear on the right side of any rule.
- Recognizing Machine: Linear Bounded Automaton (LBA).
- Description: Replacements depend on the surrounding context of the variable.
Type 2: Context-Free Grammars (CFG)
- Rules: , where is a single variable () and .
- Recognizing Machine: Pushdown Automaton (PDA).
- Description: Variables can be replaced regardless of their context. CFGs are heavily used in defining the syntax of programming languages (e.g., Backus-Naur Form/BNF).
Type 3: Regular Grammars
- Rules: Highly restricted. Rules must be of the form:
- Right-linear: or
- Left-linear: or
(where and )
- Recognizing Machine: Finite State Automaton (DFA/NFA).
- Description: The most restricted type of grammar. They generate exactly the regular languages (equivalent to regular expressions and FSAs).