Dr. Will Gilbert

Feats of Bioinfomagic

Genome Output

William Gilbert used BLAST to search the human genome — which, if printed, would fill 1,000 one-thousand-page phone books.

Piecing the Genetic Puzzle

To identify the genetic puzzle pieces unique to humans, Gilbert’s group first looked at research conducted by the University of California, Santa Cruz. Scientists there had found 192 places along the human genome where hereditary instructions might be stable — “stable over the last 100,000 years or so,” Gilbert says. “If our thinking is correct, these areas should contain genes that have pretty much settled down.”

But there was a problem.

There was no way to tell from the Santa Cruz study which genes the 192 areas — or “bins” in biotech language — contained. At the other end of the country, in Washington D.C., researchers at NCBI had annotated the genome, which is estimated to contain 30,000 to 40,000 human genes that determine everything from gender to disease susceptibility. But the NCBI couldn’t designate which genes populated the 192 regions of genetic stability.

“Just for giggles, I cranked it way up. I got A/G BLAST to run a test in 19 seconds, beyond belief.”

What’s more, the two databases used different coordinate systems. Gilbert knew if he could line up the two genomes using their sequences, he’d have his answer. “They were the same genome,” he explains, “so the matches should be identical. Once we matched regions, we could look up the genes and be back in business.”

Slow Going

Gilbert first used NCBI BLAST to compare the 192 bins against the entire human genome — which, if printed, would fill 1,000 one-thousand-page telephone books. It took NCBI BLAST 16 hours to match just one of the bins to the genome. “Actual DNA code,” Gilbert explains, “has just four letters: A, G, C and T. “When you’re doing comparative genomics, you don’t compare one genetic letter at a time — a C or an A. You compare words composed of 20 genetic letters or 50 genetic letters — TACCTAGAC and so on — rarely more than 50 because conventional thinking is that you lose sensitivity when you use longer words.”

Still, making comparisons 50 genetic letters at a time was slow going. “And that’s just doing it once,” Gilbert says. “You’d like to do it more than once because you want to tweak things and ask what-if questions. It quickly became apparent that it would take more than a month to complete all our bins with NCBI BLAST.”

Apple/Genentech BLAST

That’s when Gilbert glanced at a chart he had taped to his wall. “I looked at the plot; it compared the time it took NCBI BLAST and Apple/Genentech BLAST to execute comparisons. The plot for regular BLAST started out and leveled off; the plot for A/G BLAST was a straight line that went up at a 45-degree angle.

Cinema Display and BLAST

“I got to thinking,” Gilbert says, “I wonder if that linearity continues. What if I cranked this thing up to word sizes of 200? That would certainly save the day. So I hopped on my Mac, pulled down the A/G Blast, spent about an hour indexing the genome a different way. And I said ‘This is either going to work or not going to work.’ I tested a word size of 250 for my first shot. The test was done in two minutes, much to my disbelief. So I said, ‘Well, that must not have worked,’ Yet when I examined the output, A/G BLAST had indeed found the right hunk of DNA. Things started getting very exciting at that point.”

Gilbert then slowly brought the word size down to make sure he wasn’t losing any sensitivity. “A/G BLAST found the same gene region. Whether I used a smaller word size or a larger word size, I would find the same piece of DNA. That was very encouraging. And just for giggles, I cranked it way up, too. I think at one point I got A/G BLAST to run a test in 19 seconds, which is just beyond belief.”

1 2 3