Authors: Islam, Tanzima Zerin | Mohror, Kathryn | Bagchi, Saurabh | Moody, Adam | de Supinski, Bronis R. | Eigenmann, Rudolf
Article Type:
Research Article
Abstract:
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility
…of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
Show more
Keywords: Data-aware, checkpoint restart, distributed applications, distributed systems, fault tolerance, aggregation, bottleneck, multiple-processor systems, application-level checkpointing, rollback recovery, system reliability, distributed programming, fault tolerant computing, software reliability, system recovery
DOI: 10.3233/SPR-130371
Citation: Scientific Programming,
vol. 21, no. 3-4, pp. 149-163, 2013
Price: EUR 27.50