Skip to main content


eCommons@Cornell

eCommons@Cornell >
College of Engineering >
Computer Science >
Computer Science Technical Reports >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1813/7089
Title: Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs
Authors: Babaoglu, Ozalp
Keywords: computer science
technical report
Issue Date: Dec-1991
Publisher: Cornell University
Citation: http://techreports.library.cornell.edu:8081/Dienst/UI/1.0/Display/cul.cs/TR91-1249
Abstract: The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. In the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In this paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique - passive replication - is explored in detail as it forms the basis for fault tolerance in the Paralex parallel programming environment. Keywords: Parallel processing, reliability, transactions, checkpointing, recovery, replication, reliable broadcast, causal ordering, Paralex.
URI: http://hdl.handle.net/1813/7089
Appears in Collections:Computer Science Technical Reports

Files in This Item:

File Description SizeFormat
91-1249.pdf1.87 MBAdobe PDFView/Open
91-1249.ps374.95 kBPostscriptView/Open

Refworks Export

Items in eCommons are protected by copyright, with all rights reserved, unless otherwise indicated.

 

© 2014 Cornell University Library Contact Us