Practice MCQ

Unit 6 - Notes

MTH265 7 min read

Unit 6: Languages, Automata, Grammars

1. Introduction

In discrete mathematics and theoretical computer science, the study of languages, automata, and grammars forms the foundation of computation theory. This unit explores how strings of characters are formed, how they can be grouped into meaningful sets (languages), how machines can be designed to recognize these sets (automata), and the rule-based systems used to generate them (grammars). These concepts are fundamental in compiler design, natural language processing, pattern matching, and understanding the limits of computation.

2. Alphabet, Words, and Free Semigroup

Alphabet

An alphabet (usually denoted by $\Sigma$ ) is a finite, non-empty set of indivisible symbols or characters.

Examples:
- Binary alphabet: $\Sigma = \{0, 1\}$
- English alphabet: $\Sigma = \{a, b, c, ..., z\}$
- Alphanumeric alphabet: $\Sigma = \{a-z, A-Z, 0-9\}$

Words (Strings)

A word (or string) over an alphabet $\Sigma$ is a finite sequence of symbols drawn from $\Sigma$ .

Length: The length of a word $w$ , denoted as $|w|$ , is the number of symbols it contains. For example, if $w = 0110$ , then $|w| = 4$ .
Empty String: The empty string, denoted by $\epsilon$ (epsilon) or $\lambda$ (lambda), is the string containing zero symbols. Its length is 0 ( $|\epsilon| = 0$ ).

Concatenation

The primary operation on words is concatenation. If $x$ and $y$ are words, their concatenation $xy$ is formed by appending the symbols of $y$ to the end of $x$ .

Properties of Concatenation:
- Associative: $(xy)z = x(yz)$
- Identity: $w\epsilon = \epsilon w = w$ (The empty string acts as the identity element).

Free Semigroup and Free Monoid

Free Semigroup ( $\Sigma^+$ ): The set of all possible non-empty words that can be formed using the symbols in . It is a semigroup because it is closed under the associative operation of concatenation.
- $\Sigma^+ = \Sigma \cup \Sigma^2 \cup \Sigma^3 \cup \dots$
Free Monoid ( $\Sigma^*$ ): Also known as the Kleene closure of , this is the set of all possible words over , including the empty string .
- $\Sigma^* = \Sigma^+ \cup \{\epsilon\}$
- It is a monoid because it possesses an identity element ( $\epsilon$ ) for the concatenation operation.

3. Languages and Regular Languages

Formal Definition of a Language

A language $L$ over an alphabet $\Sigma$ is any subset of $\Sigma^*$ .

$L \subseteq \Sigma^*$
A language can be finite or infinite.
Examples:
- The empty language: $\emptyset$
- The language containing only the empty string: $\{\epsilon\}$
- The language of all binary strings ending in 0: $L = \{0, 10, 00, 110, \dots\}$

Operations on Languages

Given two languages $L_1$ and $L_2$ over $\Sigma$ , we can perform set operations:

Union ( $L_1 \cup L_2$ ): The set of strings that are in $L_1$ or $L_2$ .
Concatenation ( $L_1L_2$ ): The set of strings formed by concatenating a string from $L_1$ with a string from $L_2$ . $L_1L_2 = \{xy \mid x \in L_1, y \in L_2\}$
Kleene Star ( $L^*$ ): The set of strings formed by concatenating zero or more strings from $L$ . $L^* = \bigcup_{i=0}^{\infty} L^i$ , where $L^0 = \{\epsilon\}$ .

Regular Languages

A regular language over an alphabet $\Sigma$ is defined recursively as follows:

Base Cases:
- The empty language $\emptyset$ is a regular language.
- The language $\{\epsilon\}$ is a regular language.
- For every symbol $a \in \Sigma$ , the language $\{a\}$ is a regular language.
Recursive Step: If and are regular languages, then the following are also regular languages:
- $L_1 \cup L_2$ (Union)
- $L_1L_2$ (Concatenation)
- $L_1^*$ (Kleene Star)
Closure: No other languages are regular unless they can be formed by a finite number of applications of the rules above.

4. Regular Expressions

A regular expression (regex) is an algebraic formula used to represent a regular language. It provides a declarative way to specify patterns of strings.

Formal Definition

Given an alphabet $\Sigma$ :

$\emptyset$ is a regular expression denoting the empty language.
$\epsilon$ is a regular expression denoting the language $\{\epsilon\}$ .
$a \in \Sigma$ is a regular expression denoting the language $\{a\}$ .
If and are regular expressions denoting languages and , then:
- $(R_1 + R_2)$ or $(R_1 \mid R_2)$ represents $L_1 \cup L_2$ .
- $(R_1R_2)$ represents $L_1L_2$ .
- $(R_1^*)$ represents $L_1^*$ .

Examples of Regular Expressions

Let $\Sigma = \{0, 1\}$

0*10* : The set of all binary strings containing exactly one '1'.
(0+1)*00(0+1)* : The set of all binary strings containing the substring '00'.
(0+1)* : Represents $\Sigma^*$ (all possible binary strings).

5. Finite State Automata

A Finite State Automaton (FSA) is a theoretical mathematical model of computation consisting of a finite number of states. It is used to recognize regular languages.

Deterministic Finite Automata (DFA)

A DFA is a 5-tuple $M = (Q, \Sigma, \delta, q_0, F)$ , where:

$Q$ : A finite set of states.
$\Sigma$ : A finite alphabet.
$\delta$ : The transition function, $\delta: Q \times \Sigma \rightarrow Q$ . For every state and every input symbol, there is exactly one next state.
$q_0$ : The start state ( $q_0 \in Q$ ).
$F$ : The set of accept/final states ( $F \subseteq Q$ ).

How a DFA works:
The automaton begins in $q_0$ , reads an input string symbol by symbol, and transitions between states according to $\delta$ . If the automaton is in a state belonging to $F$ after reading the entire string, the string is accepted. The language of the machine, $L(M)$ , is the set of all accepted strings.

Nondeterministic Finite Automata (NFA)

An NFA is similar to a DFA, but the transition function $\delta$ allows for zero, one, or multiple transitions from a state for a given input symbol, and may include $\epsilon$ -transitions (transitions without consuming an input symbol).

$\delta: Q \times (\Sigma \cup \{\epsilon\}) \rightarrow \mathcal{P}(Q)$ (Power set of $Q$ ).
An NFA accepts a string if there exists at least one path leading to an accept state.

Equivalence of DFA and NFA

Despite NFAs appearing more powerful due to branching paths, every NFA can be converted into an equivalent DFA that recognizes the exact same language. This is proven using the Subset Construction Algorithm, where states of the new DFA represent sets of states from the NFA.

Kleene's Theorem

Kleene's Theorem establishes a fundamental equivalence in computer science:

A language is regular if and only if it can be recognized by a Finite State Automaton.
Therefore: Regular Expressions $\equiv$ Regular Languages $\equiv$ Finite State Automata.

6. Grammars and Types of Grammars

While regular expressions generate strings from the bottom up, grammars generate languages using a set of substitution rules.

Formal Definition of a Grammar

A grammar $G$ is a 4-tuple $G = (V, T, S, P)$ , where:

$V$ : A finite set of variables (or non-terminal symbols). Usually denoted by uppercase letters (e.g., $A, B, S$ ).
$T$ : A finite set of terminal symbols (the alphabet $\Sigma$ of the language). Usually denoted by lowercase letters (e.g., $a, b, 0, 1$ ). $V \cap T = \emptyset$ .
$S$ : The start symbol ( $S \in V$ ).
$P$ : A finite set of production rules, where each rule replaces a sequence of symbols with another sequence.

Derivation: The process of generating a string by starting with $S$ and repeatedly applying production rules until only terminal symbols remain. The language generated by a grammar $G$ is denoted $L(G)$ .

The Chomsky Hierarchy

Noam Chomsky classified grammars into four distinct types, forming a strict hierarchy where each level is a superset of the level below it.

Type 0: Recursively Enumerable (Unrestricted) Grammars

Rules: $\alpha \rightarrow \beta$ , where $\alpha$ contains at least one variable, and $\alpha, \beta \in (V \cup T)^*$ .
Recognizing Machine: Turing Machine.
Description: No restrictions on production rules. They generate all languages computable by a Turing machine.

Type 1: Context-Sensitive Grammars

Rules: $\alpha \rightarrow \beta$ , with the restriction that $|\alpha| \le |\beta|$ (the length of the right side must be greater than or equal to the left side). The only exception allowed is $S \rightarrow \epsilon$ if $S$ does not appear on the right side of any rule.
Recognizing Machine: Linear Bounded Automaton (LBA).
Description: Replacements depend on the surrounding context of the variable.

Type 2: Context-Free Grammars (CFG)

Rules: $A \rightarrow \gamma$ , where $A$ is a single variable ( $A \in V$ ) and $\gamma \in (V \cup T)^*$ .
Recognizing Machine: Pushdown Automaton (PDA).
Description: Variables can be replaced regardless of their context. CFGs are heavily used in defining the syntax of programming languages (e.g., Backus-Naur Form/BNF).

Type 3: Regular Grammars

Rules: Highly restricted. Rules must be of the form:
- Right-linear: $A \rightarrow aB$ or $A \rightarrow a$
- Left-linear: $A \rightarrow Ba$ or $A \rightarrow a$
  (where $A, B \in V$ and $a \in T^*$ )
Recognizing Machine: Finite State Automaton (DFA/NFA).
Description: The most restricted type of grammar. They generate exactly the regular languages (equivalent to regular expressions and FSAs).

Unit 5