Craig Benham

New Detective in Bioinformatics

Scientists have mapped the human genome. Now they’re working to understand the process by which a cell uses the information in its genome to produce proteins — the biochemical workers that enable an organism to survive and function.

How, for example, does the DNA in one cell — a single fertilized egg — give rise to the 10 trillion cells that create a fully-formed human being? How do cells in a fully-formed person sustain themselves?

On this new biological frontier, Craig Benham, a mathematical biologist and founding associate director of the UC Davis Genome Center, is working to uncover one of the secrets — using a 38-node cluster of Apple Xserve G5 systems and Mac OS X in a single virtual computing resource.

New Era in Molecular Biology

“Biology is entering a new era,” Benham says, “where we hope to have complete information about the components of a system — genes, proteins, organization at the cellular level and many other types of information.”

Understanding how a gene’s coded information is converted into cellular structures promises to foster the development of new medical applications that combat disease and improve human health.

The task is just beginning — and it’s daunting. Thanks to new tools in biomathematics, however, scientists can investigate total systems and consider how they interact. “That’s the big challenge of biology in this century,” Benham claims. “And it’s a very complicated task involving the correlation of huge amounts of information, which is only possible to gather computationally.”

Exploring Gene Expression

Benham and his team of eight researchers are mathematically exploring one of the earliest tasks in gene expression — a physical change in DNA strands — in families of genes.

“DNA is made up of two strands held together by weak bonds,” Benham explains, “and it looks and acts like a spring. In fact, you can stress it by twisting it like rope. And, like rope, when sufficiently stressed, the strands can pull apart so the strength of their attachment gets weaker. In order for gene expressions to start, you have to control when and where strand openings occur.”

Benham’s team mathematically imposes different levels of stress on the double helix, then analyzes the sites where its stability is degraded. “The levels of stress on DNA are quite carefully regulated,” Benham says, “and different genes get expressed, in part, according to the stress levels that are imposed. This is where patterns of gene expression could be regulated.”

The Efficiencies of Clustering

To handle the long calculations efficiently, an algorithm divides and assigns them to nodes on the Xserve cluster. “We’re using somewhere between 30 and all the nodes,” says Benham, “and we can calculate at a rate of about a million base pairs an hour. We’re able to easily calculate destabilization patterns for long DNA sequences, whole chromosomes and whole genomes.

“We can calculate at a rate of about a million base pairs an hour. We’re able to easily calculate destabilization patterns for long DNA sequences, whole chromosomes and whole genomes.”

“Solutions that involve computers with large shared memory,” Benham adds, “are much more expensive and very complicated to program. A shared memory machine with a small number of processors can be much more expensive than 30 or 40 nodes of an Xserve system, and you just don’t get the processing throughput.”

Before Benham’s team started using the Xserve cluster, he would give a whole calculation to one processor “and let it grind and grind and grind. It took days to weeks of a CPU’s time just to analyze a small bacterium’s genome that way.”

Pitfalls of Shared Memory

Even using a shared-memory system with four processors, he says, proved slow going. (Shared memory computers use multiple processors that share memory, usually through a shared bus or network.) “We would give the processors independent tasks, and basically have them running until they were done. That was not a very good solution. Things happen when you’re calculating for a week.

“Just because you give a calculation to each processor,” Benham explains, “doesn’t mean the processor doesn’t stop and do something else in the middle. Also, of course, a very long time passes between the time you ask the question and when you get the answer.

“You could have the thing grinding for three or four days and then find out that someone has resequenced the genome you’re calculating, and it’s different from the version you’re using. Or a more interesting question comes up, and you just have to sit and drum your fingers until the previous calculation is complete.”

1 2