JUNE 18–22, 2017

Presentation Details

Name: Performance Portable Checkpoint/Restart with VeloC & UnifyCR
Time: Wednesday, June 21, 2017
02:25 pm - 02:45 pm
Room:   Panorama 1
Messe Frankfurt
Speaker:   Kathryn Mohror, LLNL
Abstract:   High-end supercomputing systems generally achieve increased computing speeds by increasing the number of computing cores in the system. While FLOP goals can be reached with this strategy, the consequence of a larger number of system components is a higher failure rate. Today, systems experience failures on the order of hours or days; however, on future exascale systems, failures could occur more frequently. In this talk, I will give an overview of the work of two ECP projects that address problems associated with fault tolerance on large systems. The VeloC project will produce production-quality multilevel checkpointing software which can significantly reduce the overhead of checkpointing by utilizing fast storage devices (e.g., burst buffers). The UnifyCR project will develop a user-level distributed file system for burst buffers, specialized to provide high performance for checkpoint/restart workloads. Together, VeloC and UnifyCR will provide performance portability for applications to achieve low-overhead fault tolerance on emerging systems.