Virginia Polytechnic Institute and State University
Cost-Conscious Supercomputing
Scheduling
Déjà vu: No-fault Performance
Virginia Techs new supercomputer will showcase the industrys first solution to the problem of transparent fault tolerance, a decades-old challenge in parallel computing. A single component failure can cripple the completion of the types of jobs that typically run for days, weeks, or even months on large-scale systems. One faulty node, and a project that has been running for two weeks might require a full restart. But thanks to Varadarajan and his new Déjà vu software, Virginia Tech and other supercomputer centers will waste precious processing time far less frequently.
Developed in partnership with the Pittsburgh Supercomputing Center (PSC) and with funding from the NSF, Déjà vu allows IT professionals to set parameters and options to identify various checkpoints while a job is running. If any node in the system fails, the software will automatically find another node and restart the job from the last, safe, check-pointed state. If needed, the application also allows any jobs currently running to migrate to another resource of similar architecture.
All of this takes place within milliseconds, reveals Lockhart. If a job fails, it will be restarted almost instantaneously. This is huge for large-scale computing not only do we have this robust fault tolerance built in, we can actually move a job while its running! With Déjà vu, developers can integrate the technology directly into their applications at compile time, and deliver uninterrupted processing.
< Previous Page Introduction 1 2 3 4 5 6 7 8 9 Next Page >
