Skip to main content


eCommons@Cornell

eCommons@Cornell >
Faculty of Computing and Information Science >
Computing and Information Science >
Computing and Information Science Technical Reports >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1813/5607
Title: Automated Application-level Checkpointing of MPI Programs
Authors: Bronevetsky, Greg
Marques, Daniel
Pingali, Keshav
Stodghill, Paul
Keywords: computer science
technical report
Issue Date: 12-Feb-2003
Publisher: Cornell University
Citation: http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cis/TR2003-1891
Abstract: Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of high-performance computing platforms. Therefore, computational science applications need to tolerate hardware failures. In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in teh literature are not suitable for implementing this approach. In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.
URI: http://hdl.handle.net/1813/5607
Appears in Collections:Computing and Information Science Technical Reports

Files in This Item:

File Description SizeFormat
TR2003-1891.pdf99.55 kBAdobe PDFView/Open

Refworks Export

Items in eCommons are protected by copyright, with all rights reserved, unless otherwise indicated.

 

© 2014 Cornell University Library Contact Us