Context Free, Chomsky Greibach Normal Form

Chomsky Normal Form

A context free grammar is in chomsky normal form if each production yields a terminal or two nonterminals. Any context free grammar can be converted to chomsky normal form.

First remove all E productions. If x → E, then a production like y → AxBxC spins off the productions y → ABC | AxBC | ABxC. Perform similar substitutions for all nonterminals that lead to the empty string, then remove productions that yield E. If E is in the language, i.e. s → E, then this production must remain. This is the only nonstandard production in chomsky normal form; all other productions must yield a terminal or two nonterminals.

Next, remove any productions x → x, as they are pointless. Given x → y, let x derive everything that y derives, then remove the unit production x → y.

At this point the right side of each production has two or more symbols. Introduce new symbols to play out the right side. For instance, x → AyzBC might be replaced with the following.

x → q1q2
q1 → A
q2 → yq3
q3 → zq4
q4 → q5q6
q5 → B
q6 → C

Do this across the board and the resulting grammar is in chomsky normal form.

Prove that a word of length n is derived in 2n-1 steps. Hint, each terminal implies one step.

Rechaining

A process called rechaining breaks left self-references. For instance, let the grammar contain the rules x → x3 | x8. Assume the grammar is grounded, hence x leads to other things as well. Manufacture nonterminals q and r for the occasion and replace these two productions with the following.

x → q3 | q8 | q2r | q7r
r → q3r | q8r
q → (everything x produces, except x3 and x8)
r → (everything x produces, except x3 and x8)

Start with a word and its derivation under the original grammar. Whenever x is replaced with a string of x's, look ahead to see if there are other such substitutions, and perform them now. The derivation will no longer be canonical, but that's all right. Expand x into x3 or x8 wherever we can, and if the new instances of x are so replaced, do that now, and so on. finally there is no instance of x in the intermediate string that is replaced with x3 or x8 under the given derivation. A block of x has increased in length, thanks to these productions. Use the new productions, based on q and r, to create a block of the same length, consisting of the nonterminals x, q, and r. Since we can replace q and r as we would x, replace each of these as dictated by the original derivation. Then proceed along the original path, until you run into another x → x3 | x8, whence you can expand the blocks of x as before, and repeat the process. The end result is a new derivation for the given word, using the new grammar. In other words, the language has not changed.

Perform rechaining across the entire grammar, so that a symbol never leads to a string of itself. Notice that the manufactured symbols never reference themselves on the left.

Verify that the new grammar is grounded and connected.

Once this is accomplished, a similar form of rechaining eliminates all instances of self-reference on the left. Suppose x produces xa, xb, c, and d, where a and b are not powers of x's, and c and d do not start with x. Note that something besides xa and xb must exist, else the nonterminal x would not be grounded. Let q be a new nonterminal, manufactured for the occasion. Replace the aforementioned productions of x with the following.

x → c | d | cq | dq
q → a | aq | b | bq

Once again, q does not reference itself on the left. Since there are no E productions, we can expand any of the symbols in a and b, as we like, and q never references itself on the left.

However, x might still reference itself, if a or b starts with x. If b starts with x5, the left reference is now x5 instead of x6. Rechain again and the left reference becomes x4. Repeat until all left references go away. Do this for all nonterminals and there are no left self-references in sight.

Substitution

The following two productions imply the third; this is a form of substitution.

x → uvwC
u → 123

x → 123vwC

In the first production, x leads to u, but after substitution, x leads to 1.

Substitutions can occur anywhere, but in what follows, we will only replace the initial symbol of the right hand side.

Greibach Intermediate Form

A grammar is in greibach intermediate form if each rule leads to a terminal, a terminal followed by some nonterminals, or a string of nonterminals.

Notice that chomsky normal form satisfies greibach intermediate form. Also, any of the aformentioned rechaining operations leaves a greibach intermediate grammar in greibach intermediate form. Finally, when one production is substituted into another, replacing the lead symbol of the right hand side, the new production, which is implied by the first two, is in greibach intermediate form. We can substitute and rechain to our heart's content.

Greibach Normal Form

A grammar is in greibach normal form if each production yields a terminal followed by a (possibly empty) sequence of nonterminals. Again, s → E is a special case.

Assume a context free grammar is grounded and connected, and convert it to chomsky normal form, which is also greibach intermediate.

Order the nonterminals in any way you like, but let s be the first nonterminal, also known as x1. The terminals are in a group by themselves, beyond the nonterminals.

Start by rechaining x1, so there are no self-references on the left. Place any manufactured symbols at the end of the ordered list of nonterminals.

Suppose a production carries xj to x1, for j > 1. In other words, x1 appears first in the right side of the production. Substitute for x1, using all the rules of x1. Now xj produces strings that begin with x2 or higher, and we can drop the original rule that carried xj to x1. Perform similar replacements for all the rules of xj, for all j beyond 1.

If x2 references itself on the left, perform rechaining as described above. Now x2 leads to x3 or higher.

If j exceeds 2, and a rule carries xj to x2, substitute for x2, so that xj leads to x3 or higher.

Do the same for x3, x4, x5, and so on, until every production carries a symbol to a terminal or a higher nonterminal.

What about the rules that start with a manufactured symbol such as q? Suppose, at the time of its creation, q yields x1. Throughout the course of our substitutions, x1 might become x3, which could become x7, which could become x129, the last nonterminal in the original grammar. However, the grammar is grounded, and terminals come after nonterminals, so there must be a substitution that turns x129 into a terminal. This carries q to something higher, and as mentioned earlier, q never references itself. The process runs through all the nonterminals, including the manufactured symbols, and it terminates.

Suppose a rule replaces x3 with a string of nonterminals that begins with x7. Replace x7 with all its right hand sides, and x3 leads to strings that begin with terminals, or nonterminals beyond x7. Do this again and again, until each rule leads to a terminal. The language is the same, and the grammar is in greibach normal form.

Since each step introduces one new terminal, a word of length n is derived in n steps.