Impact of Communciation Networks on Fault-Tolerant Distributed Computing
No Access Until
Permanent Link(s)
Collections
Other Titles
Author(s)
Abstract
When the desired reliability of a computing system exceeds that of its individual hardware components the need for fault-tolerant systems arise. While distributed systems have the potential to achieve highly reliable computing, programming them is a challenging task. Several paradigms have been identified that can simplify the conceptual design of fault-tolerant distributed systems. Properties of a distributed system have profound implications on the solvability and efficiency of implementations of these paradigms. In this thesis we study the effect that different communication models have on the efficiency of fault-tolerant computing. As an instance of a fundamental operation we examine protocols for reliable broadcast in distributed systems. Our main contribution is the characterization of the time complexity of reliable broadcast with respect to communication models. A practical consequence of our results is the development of efficient reliable broadcast protocols with respect to communication models. A variety of common networks are shown to support this style of communication. In fact, by parameterizing the minimum multicast size and diameter of these networks, we are able to characterize all known network architectures. Distributed systems where processors perceive the same approximate time makes programming them much easier. Clock synchronization protocols implement this abstraction given only clocks that have bounded drift rates with respect to real time. We show how a primitive which is normally used only for communication in a distributed system can also be used for synchronizing clocks. If this primitive occurs naturally with a sufficient frequency, clock synchronization can be achieved at no additional message cost. Our results reveal hardware/software tradeoffs between performance, resiliency and network cost. Thus, they offer many new alternatives previously not considered in designing fault-tolerant systems.