Unit 3 - Notes

CSE322 7 min read

Unit 3: FORMAL LANGUAGES

1. Definition of a Grammar

A Grammar is a mathematical model used to define the structure of a language. It is a set of rules (productions) used to generate valid strings in that language.

Formally, a grammar $G$ is defined as a 4-tuple:
$G = (V, T, P, S)$

Where:

$V$ (Variables / Non-terminals): A finite set of symbols that denote syntactic categories. These are typically represented by uppercase letters (e.g., $A, B, S$ ).
$T$ (Terminals): A finite set of symbols that form the strings of the language. These are typically represented by lowercase letters, numbers, or special symbols (e.g., $a, b, 0, 1$ ). Note that $V \cap T = \emptyset$ .
$P$ (Production Rules): A finite set of relations of the form $\alpha \rightarrow \beta$ , where $\alpha$ represents a string comprising variables and terminals with at least one variable, and $\beta$ represents a string of variables and terminals.
$S$ (Start Symbol): A special variable ( $S \in V$ ) from which the generation of strings begins.

2. Derivations and the Language Generated by a Grammar

Derivations

A derivation is the process of generating a string from the start symbol $S$ by applying production rules.

Direct Derivation ( $\Rightarrow$ ): If $\alpha \rightarrow \beta$ is a production rule, and we have a string $\gamma\alpha\delta$ , we can replace $\alpha$ with $\beta$ to get $\gamma\beta\delta$ . We write this as:
$\gamma\alpha\delta \Rightarrow \gamma\beta\delta$
Derivation in zero or more steps ( $\Rightarrow^*$ ): If a string $w$ can be derived from $u$ in zero or more steps, we write $u \Rightarrow^* w$ .

Types of Derivations:

Leftmost Derivation (LMD): At every step, the leftmost variable (non-terminal) in the string is replaced.
Rightmost Derivation (RMD): At every step, the rightmost variable (non-terminal) in the string is replaced.

Language Generated by a Grammar $L(G)$

The language generated by a grammar $G$ , denoted by $L(G)$ , is the set of all strings consisting only of terminals that can be derived from the start symbol $S$ .

$L(G) = \{ w \in T^* \mid S \Rightarrow^* w \}$

Note:

A string $w$ is in $L(G)$ if and only if it consists solely of terminals and can be derived from $S$ .
Strings containing variables are sentential forms, not sentences of the language.

3. Chomsky Classification of Languages (Chomsky Hierarchy)

Noam Chomsky classified grammars into four types based on the restrictions applied to their production rules ( $\alpha \rightarrow \beta$ ).

Type	Grammar Name	Language Accepted	Automaton	Production Restrictions ( $\alpha \rightarrow \beta$ )
Type 0	Unrestricted Grammar	Recursively Enumerable	Turing Machine	No restrictions. $\alpha \in (V \cup T)^+$ containing at least one variable.
Type 1	Context-Sensitive Grammar (CSG)	Context-Sensitive	Linear Bounded Automaton	$\|\alpha\| \leq \|\beta\|$ (Length of LHS $\leq$ Length of RHS). Exception: $S \rightarrow \epsilon$ is allowed if $S$ doesn't appear on RHS.
Type 2	Context-Free Grammar (CFG)	Context-Free	Pushdown Automaton	$A \rightarrow \gamma$ , where $A \in V$ and $\gamma \in (V \cup T)^*$ . (LHS must be a single variable).
Type 3	Regular Grammar (RG)	Regular Language	Finite Automaton	$A \rightarrow aB$ or $A \rightarrow a$ (Right Linear) OR $A \rightarrow Ba$ or $A \rightarrow a$ (Left Linear).

Languages and their Relation

The languages form a strict hierarchy (subset relationship):
$\text{Regular} \subset \text{Context-Free} \subset \text{Context-Sensitive} \subset \text{Recursively Enumerable}$

Every Regular language is Context-Free.
Every Context-Free language is Context-Sensitive.
Every Context-Sensitive language is Recursively Enumerable.

4. Recursive and Recursively Enumerable Sets

These concepts relate to Type 0 languages and Turing Machines.

Recursively Enumerable (R.E.) Sets

A language $L$ is Recursively Enumerable if there exists a Turing Machine $M$ that recognizes $L$ .

If input $w \in L$ , $M$ halts and accepts.
If input $w \notin L$ , $M$ may halt and reject OR loop forever.
Key Concept: We can enumerate (list) the elements of the set, but we might not know if a specific element is not in the set (due to looping).

Recursive Sets (Decidable)

A language $L$ is Recursive if there exists a Turing Machine $M$ that decides $L$ .

If input $w \in L$ , $M$ halts and accepts.
If input $w \notin L$ , $M$ halts and rejects.
Key Concept: The machine is guaranteed to halt (Total Turing Machine).

Relationship:
$\text{Recursive Sets} \subset \text{Recursively Enumerable Sets}$

All Recursive sets are R.E., but not all R.E. sets are Recursive (this is related to the Halting Problem).

5. Languages and Automata

There is a direct one-to-one correspondence between classes of languages (generated by grammars) and classes of automata (machines that recognize them).

Regular Languages: Recognized by Finite Automata (DFA/NFA). Memory is limited to states; no external stack.
Context-Free Languages: Recognized by Pushdown Automata (PDA). Finite Automaton + one infinite Stack (Last-In-First-Out memory).
Context-Sensitive Languages: Recognized by Linear Bounded Automata (LBA). A Turing Machine with a tape bounded by the length of the input.
Recursively Enumerable Languages: Recognized by Turing Machines (TM). Infinite tape, random access memory.

6. Regular Sets and Regular Grammars

Regular Sets

A Regular Set is a language that can be described by a Regular Expression or accepted by a Finite Automaton. Examples include:

Strings ending in '00'.
Strings with an even number of 'a's.
Finite sets of strings.

Regular Grammars

A Regular Grammar is a Type 3 grammar. It must be either Left Linear or Right Linear.

Right Linear Regular Grammar (RLRG)

All productions are of the form:
$A \rightarrow xB \quad \text{or} \quad A \rightarrow x$
Where $A, B \in V$ and $x \in T^*$ .

The variable (if present) is always the rightmost symbol on the RHS.

Left Linear Regular Grammar (LLRG)

All productions are of the form:
$A \rightarrow Bx \quad \text{or} \quad A \rightarrow x$
Where $A, B \in V$ and $x \in T^*$ .

The variable (if present) is always the leftmost symbol on the RHS.

Important: A grammar cannot mix Left Linear and Right Linear rules and remain a Regular Grammar.

7. Converting Regular Expressions to Regular Grammars

To convert a Regular Expression (RE) to a Regular Grammar, it is often easiest to construct a Finite Automaton (DFA/NFA) first, and then convert that FA to a grammar.

Method (via DFA):

Construct a DFA/NFA for the given Regular Expression.
Let states of the DFA be the variables ( $V$ ) of the grammar.
Let the start state $q_0$ be the start symbol $S$ .
For every transition (transition from state A to B on input 'a'):
- Add production: $A \rightarrow aB$
If is a final state:
- Add production: $A \rightarrow a$ (in addition to $A \rightarrow aB$ )
- Alternatively, standard practice often uses $B \rightarrow \epsilon$ if $B$ is a final state.

Example: RE = $a^*b$

DFA States: $q_0$ (loops on $a$ , goes to $q_1$ on $b$ ), $q_1$ (final state).
Transitions:
- $\delta(q_0, a) = q_0 \implies q_0 \rightarrow aq_0$
- $\delta(q_0, b) = q_1 \implies q_0 \rightarrow bq_1$
Final State Handling ( is final):
- $q_1 \rightarrow \epsilon$
Resulting Grammar (Right Linear):
- $S \rightarrow aS \mid bA$
- $A \rightarrow \epsilon$

8. Converting Regular Grammars to Regular Expressions

To convert a Regular Grammar (specifically Right Linear) to a Regular Expression, we can solve the system of linear language equations, often using Arden's Theorem.

Arden's Theorem

If $P$ and $Q$ are two regular expressions over $\Sigma$ , and $P$ does not contain $\epsilon$ , then the equation:
$R = Q + RP$
Has a unique solution:
$R = QP^*$

Conversion Steps:

Write the equations for the grammar. For every variable , write an equation where equals the sum (union) of its derivation options.
- If $A \rightarrow aB$ , term is $aB$ .
- If $A \rightarrow a$ , term is $a$ .
Express the equations in the form $A = \alpha B + \beta C + \dots + \text{terminals}$ .
Substitute variables into one another to eliminate them until you have an equation for the Start Symbol $S$ strictly in terms of terminals.
Apply Arden's Theorem ( $X = \alpha X + \beta \Rightarrow X = \alpha^*\beta$ ) or ( $X = X\alpha + \beta \Rightarrow X = \beta\alpha^*$ ) to resolve self-loops (recursion).

Example:
Grammar:

$S \rightarrow 0S \mid 1A$
$A \rightarrow 0A \mid 1$

Solution:
From (2): $A = 0A + 1$ .
Applying Arden's theorem ( $R=A, Q=1, P=0$ ):
$A = 1(0)^* = 10^*$

Substitute $A$ into (1):
$S = 0S + 1(10^*)$
$S = 0S + 110^*$

Apply Arden's theorem ( $R=S, P=0, Q=110^*$ ):
$S = (110^*) (0)^* \quad \text{(Note: Standard form } R=QP^* \text{ implies } S=110^*0^* \text{ is incorrect application order, usually } S = 0^*110^* \text{)}$

Correction on Arden Application:
Equation: $S = 0S + 110^*$ matches $R = PR + Q$ (where $P=0$ ). Solution is $R = P^*Q$ .
$S = 0^* (110^*) = 0^*110^*$

Regular Expression: $0^*110^*$

Unit 2

Unit 4