Pardis Sabeti. Genetic Sleuth.
It’s one of the great mysteries of medical science—why are some people seemingly immune to infectious diseases, while others become critically ill or die? This puzzle so intrigued biologist Dr. Pardis Sabeti that she devised a unique method for detecting promising clues by sifting through the human genome.
Sabeti made her discovery in 2001 as a student at Harvard Medical School, where she wrote an algorithm that let her differentiate genetic mutations resulting from natural selection from those occurring at random. Her tool gave her the ability to spot and track beneficial evolutionary genomic changes—for instance, mutations that make some people inherently resistant to certain diseases. Explains Sabeti, “Everything I do is based on a simple principle: beneficial things will spread through populations very quickly.”
Massive Data, Intensive Analysis
In early 2008, backed by several major grants, Sabeti founded her own research lab at Harvard University to investigate the mechanisms of infectious diseases. To create a state-of-the-art data management and analysis system for her lab, Sabeti enlisted computer scientist Dr. Dan Yamins. “Genomic data is large, and a lot of manipulations need to be done to it to understand its content,” says Yamins. “Our main challenges include understanding the structure and the relationships between the data, and being able to easily keep track of it, access and use it.”
Yamins chose Mac Pro computers for the lab’s workstations, which let researchers maximize the work they can do from the desktop, while easily share information with the multi-thousand-core UNIX cluster at the nearby Broad Institute of MIT and Harvard. “For high-performance scientific work, our eight-core Mac Pros are very powerful,” he explains. “They run UNIX natively, which means we can pull in the entire open-source world of free software. The Macs have better UNIX integration, because they offer a nice user interface and great graphics. Yamins added that, from time to time, some of the scientists need to use Windows applications, which they do by running Parallels Desktop for Mac, giving them the ability to run UNIX, Mac OS X and Windows from the same workstation.
‘An Extension of Your Brain’
To get a rough idea of the file sizes the lab works with, the genome of a mammal is about three billion letters long. If they’re working with disease association mapping in humans, each person has 500,000 to a million pieces of information to look at. Multiply this by thousands of people in each study and you’ve got an enormous amount of data to work with. “This is why our analysis has to be done on clusters—it can take weeks,” says postdoctoral researcher Dr. Elinor Karlsson.
“My Mac is part of this huge computing system, Karlsson adds. “That's what I like about it. I've got all the data on the UNIX cluster, and it’s all big text files and programs I’ve written—things that I can't really look at or work with from there. So I can take those files and copy them onto my Mac, where I can still work with them in UNIX and run all my Perl scripts to manipulate them. Sometimes I do things on the cluster, sometimes I do things on my Mac Pro, and sometimes, I have to figure out exactly which end I'm doing it on - which is interesting, because that means the process is so seamless. It’s like the Mac is an extension of my brain. It’s so nice to be able to move data back and forth so easily.”
Karlsson discovered additional benefits on her Mac using Pages, part of the iWork application suite when composing her thesis. “Pages has an open, easy way of handling images, legends, numbering and styles that no other product could do without a lot more complexity, if at all,” she says.
“It doesn’t matter how good your science is, if you can’t come up with a way to communicate it, then nobody else is going to realize how wonderful it is,” says Karlsson.
Easy to Program in Different Languages
For data analysis, the Sabeti lab scientists typically write their own programs in Perl, Java, or C++, and they adapt existing open-source tools created by others. “The UNIX interface on the Mac makes it extraordinarily easy to program in different languages, and go back and forth between programming and seeing the results,” says Sabeti. Yamins agrees that Mac OS X is an ideal platform for script-based development. On his Mac he uses Python to develop a common scripting language for the lab.
