Searching for just a few words should be enough to get started. If you need to make more complex queries, use the tips below to guide you.
Issue title: Selected Papers from Super Computing 2012
Article type: Research Article
Authors: Islam, Tanzima Zerin; | Mohror, Kathryn | Bagchi, Saurabh | Moody, Adam | de Supinski, Bronis R. | Eigenmann, Rudolf
Affiliations: School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA. E-mails: {tislam, sbagchi, eigenman}@purdue.edu | Lawrence Livermore National Laboratory, Livermore, CA, USA. E-mails: {kathryn, moody20, bronis}@llnl.gov
Note: [] Corresponding author. E-mail: [email protected]
Abstract: High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
Keywords: Data-aware, checkpoint restart, distributed applications, distributed systems, fault tolerance, aggregation, bottleneck, multiple-processor systems, application-level checkpointing, rollback recovery, system reliability, distributed programming, fault tolerant computing, software reliability, system recovery
DOI: 10.3233/SPR-130371
Journal: Scientific Programming, vol. 21, no. 3-4, pp. 149-163, 2013
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA
Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands
Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]
For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]
Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China
Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]
For editorial issues, like the status of your submitted paper or proposals, write to [email protected]
如果您在出版方面需要帮助或有任何建, 件至: [email protected]