A Parable in Computational Biology
Reuben Settergren
Suppose that one day, while you are poking around a medieval castle in Germany, in an
out-of-the-way chamber you find evidence of pre-Gutenberg mechanical printing! There are
stacks and stacks of what look like large rolls of ticker tape about a centimeter wide.
Examining a few of the rolls, you can see that on each roll is printed a continuous line of
text (in German), that if completely unraveled, would be perhaps a kilometer or more in length!
Also, as far as you can tell from the first few feet, the text on each roll is identical.
As you are crouching in the corner behind a huge stack of rolls to pull out another sample,
somebody throws a grenade into the room, and BOOM! it explodes. You survive only because all
of the paper in the room bore the brunt of the blast for you. The demolition of the tapes was
so complete that the longest intact fragments are only a few meters in length. So the task
before you is to reconstruct the mysterious text that was on all of these rolls. It will take
a long time, but fortunately, you know a monk in a nearby abbey, who is insane . . .
All Helmut the insane German monk does every waking hour is examine any kind of printed material,
and snip out every occurrence of the word "UND" to add to his collection. So you gather all of
the largest fragments you can find, and (after xeroxing and carefully archiving the copies),
you spend the next few months in a cell with Helmut, feeding him strips of text, and collecting
in a ziploc baggie the snippets that fall every time he snips an UND --- one baggie to hold the
snippets from each large fragment.
Once every large fragment has been reduced to snippets, you retire to your lab to begin the real
work. You open each baggie, and measure the length of each snippet. This data can then tell you,
for instance, how long each large fragment was. More importantly, it also tells you the lengths
of the spaces between all of the adjacent UNDs in that portion of the text. So, for instance, if
two large fragments share a common portion of the text which contains 6 UNDs, then the baggies for
those two fragments should have 5 pairs of snippets that are the same length. From this kind of
information, you hope to build a map of how the fragments overlap each other, and reconstruct the
original text!
To explain the parable: the text on the very long strip of paper is our DNA, 3 billion nucleotides
(A,C,T, or G) long. The Human Genome Project is the national effort to sequence (i.e. "read") the
entire string. It is not possible to directly "read" long fragments of DNA (more than about 1000
nucleotides), so the DNA must be fractured and reassembled. The large fragments created by some
random process (BOOM!) are called clones. The role of Helmut is played by restriction enzymes,
whose sole biological function is to search a strand of DNA for a particular small substring of
nucleotides, and cleave the DNA at that point (yielding restriction fragments (snippets)).
Biological techniques exist for measuring the lengths of DNA fragments, and from that data, it
becomes a largely algorithmic issue to build a map of the clones.
This summer I was working at the Genome Center at MIT's Whitehead Institute, considering the
following problem: if the clones are very large, they will produce very many restriction
fragments --- so many that any two clones could have quite a few fragments which are the same
length (within measurement error), even if they don't truly overlap. How can you tell the
difference between two clones that really overlap and two clones that have a lot of common
fragment sizes out of coincidence? For the length of clones (about 150,000 nucleotides), and
the average length of restriction fragments (about 4096 nucleotides) I was working with, it
turned out that it was basically not possible to reliably tell the difference --- the noise from
the coincidentally shared fragments drowned out the signal from the true overlaps. But from
that failure springs a new question to be investigated: for what parameters of clone length
and fragment size is the overlap signal detectable? And so the mathematical journey rolls on.
![]() |
|