A Parable in Computational Biology
Reuben Settergren

      Suppose that one day, while you are poking around a medieval castle in Germany, in an out-of-the-way chamber you find evidence of pre-Gutenberg mechanical printing! There are stacks and stacks of what look like large rolls of ticker tape about a centimeter wide. Examining a few of the rolls, you can see that on each roll is printed a continuous line of text (in German), that if completely unraveled, would be perhaps a kilometer or more in length! Also, as far as you can tell from the first few feet, the text on each roll is identical.

      As you are crouching in the corner behind a huge stack of rolls to pull out another sample, somebody throws a grenade into the room, and BOOM! it explodes. You survive only because all of the paper in the room bore the brunt of the blast for you. The demolition of the tapes was so complete that the longest intact fragments are only a few meters in length. So the task before you is to reconstruct the mysterious text that was on all of these rolls. It will take a long time, but fortunately, you know a monk in a nearby abbey, who is insane . . .

      All Helmut the insane German monk does every waking hour is examine any kind of printed material, and snip out every occurrence of the word "UND" to add to his collection. So you gather all of the largest fragments you can find, and (after xeroxing and carefully archiving the copies), you spend the next few months in a cell with Helmut, feeding him strips of text, and collecting in a ziploc baggie the snippets that fall every time he snips an UND --- one baggie to hold the snippets from each large fragment.

      Once every large fragment has been reduced to snippets, you retire to your lab to begin the real work. You open each baggie, and measure the length of each snippet. This data can then tell you, for instance, how long each large fragment was. More importantly, it also tells you the lengths of the spaces between all of the adjacent UNDs in that portion of the text. So, for instance, if two large fragments share a common portion of the text which contains 6 UNDs, then the baggies for those two fragments should have 5 pairs of snippets that are the same length. From this kind of information, you hope to build a map of how the fragments overlap each other, and reconstruct the original text!

      To explain the parable: the text on the very long strip of paper is our DNA, 3 billion nucleotides (A,C,T, or G) long. The Human Genome Project is the national effort to sequence (i.e. "read") the entire string. It is not possible to directly "read" long fragments of DNA (more than about 1000 nucleotides), so the DNA must be fractured and reassembled. The large fragments created by some random process (BOOM!) are called clones. The role of Helmut is played by restriction enzymes, whose sole biological function is to search a strand of DNA for a particular small substring of nucleotides, and cleave the DNA at that point (yielding restriction fragments (snippets)). Biological techniques exist for measuring the lengths of DNA fragments, and from that data, it becomes a largely algorithmic issue to build a map of the clones.

      This summer I was working at the Genome Center at MIT's Whitehead Institute, considering the following problem: if the clones are very large, they will produce very many restriction fragments --- so many that any two clones could have quite a few fragments which are the same length (within measurement error), even if they don't truly overlap. How can you tell the difference between two clones that really overlap and two clones that have a lot of common fragment sizes out of coincidence? For the length of clones (about 150,000 nucleotides), and the average length of restriction fragments (about 4096 nucleotides) I was working with, it turned out that it was basically not possible to reliably tell the difference --- the noise from the coincidentally shared fragments drowned out the signal from the true overlaps. But from that failure springs a new question to be investigated: for what parameters of clone length and fragment size is the overlap signal detectable? And so the mathematical journey rolls on.