Michael Barmada

Apple SAN Solution Meets Massive Storage Needs

Barmada and his colleagues need a lot of computing power as well as storage capacity to chew through the heaps of data they acquire. Their department originally had a 48-processor Linux cluster with only one NFS-based file server. “The users learned that if they put more than 40 processes on the grid it would crash the cluster,” says Barmada. “There wasn’t enough capacity on the NFS server.” Barmada submitted a grant proposal to the National Institutes of Health for a new system with 250 G5 processors and 3.5 terabytes of Xserve RAID storage.

“I had always been on a Mac and when the Xserve G5 systems came out it became a viable option for the type of computing we needed,” says Barmada. “The large amount of RAM we were able to put into the machines, 64-bit computing and AltiVec processing were all expected to enhance the performance of our genetics algorithms.”

“[Managing the Xserve G5 cluster] is amazingly easy. Once I get onto the system I can open System Monitor and I’ve got a graphical view of all the machines on the cluster with lights to tell me if there are any hardware issues or buffer overflows.”

Storage was also a crucial part of the compute infrastructure Barmada was envisioning. Without a large server system to juggle multiple processing jobs at once, his cluster simply wouldn’t work. He knew that Apple could provide a relatively inexpensive server, storage and SAN solution when compared with the competition. The Apple solution was “half the price of the Dell/EMC solution and gave us four times the storage space,” says Barmada. “It was just an amazing benefit, cost-wise.”

The university got the grant and Barmada never looked back. The new cluster is light years ahead of the old one. “On an average day we have about 400 to 500 jobs sitting in the grid,” he says. “We run 200 to 250 on the grid and another 150 to 250 jobs are waiting in the queue.”

All those jobs have corresponding files. “For statistical genetics, we use flat text files,” he says. “Some can get really big when you’re dealing with 20,000 people and 500,000 bits of information. But generally, they’re on the order of a couple of hundred kilobytes or a few hundred megabytes, maybe a gigabyte at most. But to really manage the data properly we’ve had to break large files up into smaller sets. Any one analysis of a project will generate thousands of files. Several users on the system have two million files or more in their directories. There are several million files on the storage array in the grid.”

Advantages of a SAN Environment

Xsan consolidates all the Xserve RAID storage into one large pool. This has an obvious advantage when it comes to organization — instead of 20 or more volumes peppered with files, one volume contains all of the department’s research data. In an Xsan system, one Xserve G5 is used as a metadata controller, which is like a traffic cop, directing servers and workstations to the files in the storage pool. Xsan also feeds files to each storage device in the pool more efficiently than distributed storage models, where individual users keep files on various network volumes or their local hard drives.

Easy Setup, Administration and Expansion

“I’m not a computer science person or an IT administrator,” says Barmada. “I’m a professor in the department.” Even so, he has no trouble effectively managing the 282-processor Xserve G5 cluster and SAN — remotely no less. The system lies in a colocation facility six miles from campus. “We use the Xserve admin tools, Workgroup Manager, Apple Remote Desktop 2, Xsan admin tools and Xserve RAID admin through a VPN.

“It’s amazingly easy,” he continues. “Once I get onto the system I can open System Monitor and I’ve got a graphical view of all the machines on the cluster with lights to tell me if there are any hardware issues or buffer overflows. There are command-line tools to check the grid and see how many jobs are running. And we’re installing more sophisticated monitoring tools to tell what network operations are being done on the machines, how much disk I/O there is or how much CPU time is being taken up on a per-node basis.”

1 2 3