Craig Benham
New Detective in Bioinformatics
Scientists have mapped the human genome. Now theyre working to understand the process by which a cell uses the information in its genome to produce proteins the biochemical workers that enable an organism to survive and function.
How, for example, does the DNA in one cell a single fertilized egg give rise to the 10 trillion cells that create a fully-formed human being? How do cells in a fully-formed person sustain themselves?
On this new biological frontier, Craig Benham, a mathematical biologist and founding associate director of the UC Davis Genome Center, is working to uncover one of the secrets using a 38-node cluster of Apple Xserve G5 systems and Mac OS X in a single virtual computing resource.
New Era in Molecular Biology
Biology is entering a new era, Benham says, where we hope to have complete information about the components of a system genes, proteins, organization at the cellular level and many other types of information.
Understanding how a genes coded information is converted into cellular structures promises to foster the development of new medical applications that combat disease and improve human health.
The task is just beginning and its daunting. Thanks to new tools in biomathematics, however, scientists can investigate total systems and consider how they interact. Thats the big challenge of biology in this century, Benham claims. And its a very complicated task involving the correlation of huge amounts of information, which is only possible to gather computationally.
Exploring Gene Expression
Benham and his team of eight researchers are mathematically exploring one of the earliest tasks in gene expression a physical change in DNA strands in families of genes.
DNA is made up of two strands held together by weak bonds, Benham explains, and it looks and acts like a spring. In fact, you can stress it by twisting it like rope. And, like rope, when sufficiently stressed, the strands can pull apart so the strength of their attachment gets weaker. In order for gene expressions to start, you have to control when and where strand openings occur.
Benhams team mathematically imposes different levels of stress on the double helix, then analyzes the sites where its stability is degraded. The levels of stress on DNA are quite carefully regulated, Benham says, and different genes get expressed, in part, according to the stress levels that are imposed. This is where patterns of gene expression could be regulated.
The Efficiencies of Clustering
To handle the long calculations efficiently, an algorithm divides and assigns them to nodes on the Xserve cluster. Were using somewhere between 30 and all the nodes, says Benham, and we can calculate at a rate of about a million base pairs an hour. Were able to easily calculate destabilization patterns for long DNA sequences, whole chromosomes and whole genomes.
We can calculate at a rate of about a million base pairs an hour. Were able to easily calculate destabilization patterns for long DNA sequences, whole chromosomes and whole genomes.
Solutions that involve computers with large shared memory, Benham adds, are much more expensive and very complicated to program. A shared memory machine with a small number of processors can be much more expensive than 30 or 40 nodes of an Xserve system, and you just dont get the processing throughput.
Before Benhams team started using the Xserve cluster, he would give a whole calculation to one processor and let it grind and grind and grind. It took days to weeks of a CPUs time just to analyze a small bacteriums genome that way.
Pitfalls of Shared Memory
Even using a shared-memory system with four processors, he says, proved slow going. (Shared memory computers use multiple processors that share memory, usually through a shared bus or network.) We would give the processors independent tasks, and basically have them running until they were done. That was not a very good solution. Things happen when youre calculating for a week.
Just because you give a calculation to each processor, Benham explains, doesnt mean the processor doesnt stop and do something else in the middle. Also, of course, a very long time passes between the time you ask the question and when you get the answer.
You could have the thing grinding for three or four days and then find out that someone has resequenced the genome youre calculating, and its different from the version youre using. Or a more interesting question comes up, and you just have to sit and drum your fingers until the previous calculation is complete.
1 2 Next Page >
