Unit 4 - Notes

CSE322 6 min read

Unit 4: CONTEXT- FREE LANGUAGES

1. Introduction to Context-Free Grammars (CFG)

A Context-Free Grammar (CFG) is a formal system that describes a language by recursive rewriting rules. It is a 4-tuple $G = (V, T, P, S)$ , where:

V (Variables): Finite set of non-terminal symbols (e.g., $A, B, S$ ).
T (Terminals): Finite set of terminal symbols (e.g., $a, b, 0, 1$ ).
P (Productions): Finite set of rules of the form $A \to \alpha$ , where $A \in V$ and $\alpha \in (V \cup T)^*$ .
S (Start Symbol): A distinguished element of $V$ .

Language of a CFG

The language generated by a grammar $G$ , denoted as $L(G)$ , is the set of all strings composed of terminal symbols that can be derived from the start symbol $S$ .
$L(G) = \{ w \in T^* \mid S \Rightarrow^* w \}$

2. Derivations and Sentential Forms

Derivations Generated by a Grammar

A derivation is a sequence of production rule applications that transforms the start symbol into a string of terminals.

Step ( $\Rightarrow$ ): Replacing a variable with the right-hand side of a production rule.
Multi-step ( $\Rightarrow^*$ ): Zero or more derivation steps.

Sentential Forms

If $S \Rightarrow^* \alpha$ , then $\alpha$ is called a Sentential Form.

$\alpha$ may contain a mix of terminals and non-terminals.
If $\alpha$ contains only terminals, it is called a Sentence of the language.

Leftmost and Rightmost Derivations

For any string in a Context-Free Language (CFL), there may be multiple ways to derive it.

1. Leftmost Derivation (LMD)

In every step of the derivation, the leftmost non-terminal in the sentential form is replaced.

Example: $S \to AB, A \to a, B \to b$
Derivation of "ab": $S \Rightarrow \mathbf{A}B \Rightarrow \mathbf{a}B \Rightarrow ab$

2. Rightmost Derivation (RMD)

In every step of the derivation, the rightmost non-terminal in the sentential form is replaced.

Example: $S \to AB, A \to a, B \to b$
Derivation of "ab": $S \Rightarrow A\mathbf{B} \Rightarrow A\mathbf{b} \Rightarrow ab$

3. Ambiguity in CFG

Definition

A Context-Free Grammar is said to be ambiguous if there exists at least one string $w \in L(G)$ that has:

More than one Leftmost Derivation, OR
More than one Rightmost Derivation, OR
More than one Parse Tree (Derivation Tree).

Example of Ambiguity

Consider the grammar for simple arithmetic expressions:
$E \to E + E \mid E * E \mid id$
String: $id + id * id$

Parse Tree 1 (Precedence to +):

TEXT

    E
   /|\
  E + E
  |  /|\
 id E * E
    |   |
   id   id

*Parse Tree 2 (Precedence to ):**

TEXT

      E
     /|\
    E * E
   /|\  |
  E + E id
  |   |
 id   id

Since two distinct parse trees exist for the same string, the grammar is ambiguous.

Inherently Ambiguous Languages

A Context-Free Language is inherently ambiguous if every grammar that generates it is ambiguous. (i.e., it is impossible to construct an unambiguous grammar for that specific language).

4. Simplification of Context-Free Grammars

Simplification involves transforming a grammar into an equivalent one (generating the same language) but with reduced complexity. This process is usually done in three specific steps:

Construction of Reduced Grammars (Removal of Useless Symbols).
Elimination of Null Productions ( $\epsilon$ -productions).
Elimination of Unit Productions.

A. Construction of Reduced Grammars

A symbol is useless if it is either:

Non-generating: It cannot derive a terminal string.
- Algorithm: Identify all variables that can eventually generate terminals. Remove those that cannot.
Non-reachable: It cannot be reached from the Start symbol .
- Algorithm: Draw a dependency graph starting from $S$ . Remove any symbol not visited.

B. Elimination of Null Productions

A Null Production is a rule of the form $A \to \epsilon$ .
A variable $A$ is nullable if $A \Rightarrow^* \epsilon$ .

Algorithm:

Identify all nullable variables.
For every production $X \to \alpha$ , add new productions representing all possible variations where nullable variables in $\alpha$ are removed (substituted by $\epsilon$ ).
Remove the original $A \to \epsilon$ rules (unless $S \to \epsilon$ and $S$ does not appear on RHS).

Example:

$S \to AB$
$A \to aA \mid \epsilon$
$B \to b$
Nullable: $A$ .
New Grammar:
- $S \to AB \mid B$ (Since A is nullable)
- $A \to aA \mid a$ (Since inner A is nullable)
- $B \to b$

C. Elimination of Unit Productions

A Unit Production is a rule of the form $A \to B$ (where both are single variables).

Algorithm:

Find all "unit pairs" $(A, B)$ such that $A \Rightarrow^* B$ using only unit productions.
For every pair $(A, B)$ , if $B \to \alpha$ is a non-unit production, add $A \to \alpha$ to the grammar.
Remove the original unit productions.

5. Normal Forms for CFG

Normal forms restrict the structure of production rules to facilitate parsing algorithms and proofs.

A. Chomsky Normal Form (CNF)

A CFG is in CNF if every production is in one of the following two forms:

$A \to BC$ (Non-terminal $\to$ Two Non-terminals)
$A \to a$ (Non-terminal $\to$ Single Terminal)

(Note: $S \to \epsilon$ is allowed if $S$ is not on the RHS of any rule).

Conversion to CNF:

Simplify grammar (Remove useless, $\epsilon$ , and unit productions).
Isolate Terminals: Replace terminals in mixed bodies. E.g., if $A \to aB$ , create new variable $X_a \to a$ and change rule to $A \to X_aB$ .
Break long strings: If a body has 3+ variables (e.g., $A \to BCD$ ), introduce cascading variables. $A \to BZ_1$ , $Z_1 \to CD$ .

B. Greibach Normal Form (GNF)

A CFG is in GNF if every production is of the form:
$A \to a\alpha$
Where:

$a$ is a single terminal.
$\alpha$ is a string of zero or more variables ( $V^*$ ).

Characteristics:

Every step consumes exactly one terminal symbol.
Crucial for the construction of Pushdown Automata.
Requires the elimination of Left Recursion ( $A \to A\alpha$ ) before conversion.

6. Pumping Lemma for CFG

The Pumping Lemma is a tool used to prove that a specific language is NOT Context-Free.

Statement

If $L$ is a Context-Free Language, there exists a pumping length $n$ such that any string $z \in L$ with $|z| \ge n$ can be decomposed into five parts $z = uvwxy$ satisfying:

$|vx| \ge 1$ (v and x are not both empty).
$|vwx| \le n$ (The "middle" part is not too long).
For all $i \ge 0$ , the string $uv^iwx^iy \in L$ .

Application Example

To prove $L = \{ a^n b^n c^n \mid n \ge 1 \}$ is not CFL:

Assume $L$ is CFL. Let $n$ be the pumping length.
Choose $z = a^n b^n c^n$ .
According to conditions, $vwx$ cannot contain both $a$ 's and $c$ 's (because $|vwx| \le n$ ).
If we pump $v$ and $x$ (e.g., $i=2$ ), the number of one or two symbols increases, but the third remains constant.
The resulting string will not have equal numbers of $a, b, c$ .
Contradiction $\implies L$ is not CFL.

7. Applications of Context-Free Grammars

Compiler Construction:
- CFGs are used to define the syntax of programming languages.
- Parsers (like YACC or Bison) use CFGs to verify if code is syntactically correct and generate Parse Trees.
Document Type Definition (DTD):
- Used in XML and SGML to describe the structure of documents (nesting of tags), which is inherently a context-free structure.
Arithmetic Expressions:
- Defining precedence and associativity rules in calculators and interpreters.
Natural Language Processing (NLP):
- Used in early linguistic models to describe sentence structures (Subject-Verb-Object).

Unit 3

Unit 5