Thursday, October 7, 2010

Context-free grammar (Part 1 of 2)

What are context-free grammars? An abstract definition might say that they're a set of rules allowing for the construction of complex structures consisting of logical parts that can be arbitrarily nested, but I've found that nobody really likes my abstract definitions, so I'll try using an analogy instead.

Consider a freight train hauling three containers (containers A, B, and C):

Let's say that container A can only contain livestock, container B transports liquids, and container C contains manufactured goods. It doesn't matter what type of livestock, liquids, and manufactured goods are transported so long as they fit into their respective categories.

We have just defined a very simple grammar. Within this structure, it is possible to have 'phrases' such as "pigs milk computers" or "cows mercury needles".

A cow-mercury-needle train and pig-milk-computer train collision would be unfortunate.

Both the cow-mercury-needle train and the pig-milk-computer train are examples of the same model of train — an A-B-C train — that happen to be carrying different items in their identical containers. The model of the train can be thought of as one rule (or phrase structure) in the grammar, but it is certainly possible to have other rules.

The analogy gets a bit weird here (yes, weirder than the above image) because context free grammars also allow their containers to contain other trains.

Let's imagine a different model of train, the B-A-Weird train that has a B-container (which must, as before, contain a liquid), an A-container (holding livestock) and a weird container which holds either an A-B-C train or a B-A-Weird train. Now instead of a pig-milk-computer train, it would be possible to have a milk-pig-cow-mercury-needle train where the milk is in compartment B, the pigs are in compartment A and the cow-mercury-needle train from above is all in the weird compartment.

Okay, that's pretty weird, but that's not nearly as strange as being able to put a B-A-Weird train inside the Weird compartment. This could allow us to take the entire train we just discussed and stick it in the last compartment of another B-A-Weird train. Then we could take that train and do the same thing, over and over, as much as we would like. However, at this point the train analogy has lost almost all meaning and is likely confusing matters substantially.

Fortunately, there is a familiar application for context-free grammars that should make more sense now, namely language. By specifying containers in terms of parts of speech instead of cargo, we can construct sentences instead of trains. This is sort of reminiscent of Mad Libs: "the {noun} {past tense verb} the {noun}" would be one example of a sentence structure from a context-free grammar. Amusingly, this is also how Matlab's "why" function manages to produce output such as "The bald and not excessively bald and not excessively smart hamster obeyed a terrified and not excessively terrified hamster."

As before though, things get interesting when sentence structures get put inside of other sentence structures (just as trains... get... er, put inside of other trains). We can expand the definition of nouns to include phrases in the structure "{noun} the {noun} {past tense verb}". This allows phrases such as "man the hamster ate" to be used in place of the simple noun "man". And why not? Anywhere you can refer to a man, you can grammatically refer to a specific man who happened to be eaten by a hamster.

There is a slight problem with this setup in that, as before, things get complicated when structures are continually placed inside other structures. The phrase
"The dog the boy the man the hamster ate raised owned bit the boy the man the hamster ate raised."
is in fact valid using the two rules defined above and can be parsed as:
The (dog the (boy the (man the (hamster) ate) raised) owned) bit the (boy the (man the (hamster) ate) raised).
...meaning the hamster ate the man who raised the boy who owned the dog which bit the boy raised by the man eaten by the hamster.

So... yeah, people don't generally speak like that because of cognitive limits to how many nested levels of information we can keep track of at once. But maybe, you say, this idea of context-free grammar could still be useful for creating different kinds of structure. Some sort of really complex structure that can have parts embedded in other parts in ways that are hard for people to imagine. Perhaps, you hazard, some sort of biological structure.

In this last set of thoughts, you've outlined the general idea behind GenoCAD, a context-free grammar for synthetic biology. It seems to be a good fit at first but, as Andre from the UW iGEM team points out, there are quite a few properties of DNA sequences that it fails to capture. More on this later.


  1. This made me lol (actually) for multiple reasons, including, but not limited to, your train collision pic and the poor hamster-eaten man (and the dog of the boy of this man).

    Also: this post is full of awesome. (Why is grammar so intrinsically pleasing??)

    Also: holy crap, the wiki is out of date. D:

  2. I was going to respond to this, but my response turned into Part 2 of the post. I think that the pleasingness of grammar is largely due to its ability to hugely (over-)simplify things to the point where we can understand some aspects of language.

    I expect that if we ever do learn the underlying mechanisms behind language (if such mechanisms even exist) they will make all these constructed grammars look incredibly weak.

    Also: thank you, you are most kind. :D