A grammar is a a set of rules for recursively describing the structure of string. They exist as a method to represent languages.

Where Regular languages may be expressed via a Regular expressions, there are also some languages that may not be expressed in that form, such as \(a^nb^n\). Context free grammar (CFG) enables the expression of these.

Context-free languages are a superset of regular languages.

A context free grammar is defined by the 4-tuple \[ \langle V, \Sigma, R, S \rangle \]where \[ \begin{align} V &: \text{variables - finite set of symbols} \\ \Sigma &: \text{terminals - finite set of letters; disjoint from } V \\ R &: \text{rules - finite set of mappings} \\ S &: \text{start variable - member of } V \end{align} \]

For example, given the following grammar \[ \begin{align} S &\rightarrow bSa \\ S &\rightarrow ba \end{align} \] This contains \(V=\{S\}\), \(\Sigma=\{a,b\}\), two rules as shown and a start variable \(S\).

Symbol | Common Use |
---|---|

\(a\), \(b\), … (low order lowercase letters) | Terminals |

\(A\), \(B\), … (low order uppercase letters) | Variables |

\(w\), \(z\), … (high order lowercase letters) | Strings of terminals |

\(X\), \(Y\), … (high order uppercase letters) | Either terminals or variables |

\(\alpha\), \(\beta\), … (lower case Greek letters) | Strings of terminals and/or variables |

- Begin with the start symbol and seed the output with it’s rule.
- Find a variable and replace it based on it’s associated rule.
- Repeat step 2 until no variables are left.

A *derivation* is a sequence of substitutions that leads to a generated string.

If there is a derivation from \(u\) to \(v\) this is stated as \(u\) derives \(v\) within some grammar \(G\), symbolically this may appear as \[ u \underset{G}{\overset{*}{\Rightarrow }} v \]

If generation occurs by always replacing the leftmost variable in a string, the resulting derivation is termed a *leftmost derivation*. This is mirrored for a rightmost replacement. Every string in in a CFG has at least on leftmost and at least one rightmost derivation.

Given a grammar \(G = \langle V, \Sigma, R, S \rangle\) then we may express the language is defines as \[ \mathcal L (G) = \{ w \in \Sigma^* | S \underset{G}{\overset{*}{\Rightarrow }} w \} \] That is, the language \(\mathcal L(G)\) is the set of all strings that may be formed by the terminals where each may be derived from the grammar’s start variable.

Any step in a derivation will yield a combination of terminals and/or variables. Each of these are termed a *sentential form*.

A *parse tree* is a tree that shows the structure of a derivation. Internal notes are labeled by variables, and leaves hold terminals or \(\epsilon\). Each internal node represents a production rule where the head of the rule is the label of the node and the labels of it’s children (ordered left to right) for the body of the rule.

If a terminal string may be reached by more than one parse tree, this represents an *ambiguous grammar*. This will also appear as there being more than one leftmost or rightmost derivation. Some languages may support the restructuring of the grammar to remove ambiguity, whereas others are inherently ambiguous.