James Bruce 15-712 Paper Review Monday November 12th "Implementing Fault-Tolerant Services Using the State Machine Approach" Important Points: - Fault tolerant systems can be built in a generic way to support arbitrary state machines. Thus implementing a particular service on top of this is easy since almost any deterministic distributed task can be implemented cleanly in terms of a state machine. - Knowledge of the commands in a specific service can allow further improvements to be made. One example given is for read only requests, which need one reply or t+1 identical replies in the common case for fail stop and Byzantine failures, respectively. The other example is commutative operations, which can be executed under relaxed ordering because they are commutative. - Agreement can be implemented in synchronous systems through logical clocks, or in a very ugly way using approximately synchronized real time clocks. - Online removal of faulty nodes and adding of repaired nodes can allow a system to persist indefinitely with many faults, as long as the rate of faults and replacements satisfies the combining condition (p313) Deficiencies: They gloss over a lot of details of corner cases that are likely to be the source of failure in a Byzantine failure model. For example, a Byzantine failure with real time clocks requires explicitly rejecting messages with "impossible" timestamps since they must have come from a faulty node, but they don't mention that detail. Another example is that the replica-generated ID test does not fail if a faulty node inserts an arbitrarily large CUID (p309), but it could easily exhaust the representation space of IDs (i.e. sent MAXINT-1). They don't say how to deal with that (the other paper does however). Conclusions: This is an overview of how to deal with fail stop failures and to some extent Byzantine faults in distributed systems with bounded message delays. The state machine approach is a powerful one, and allows a generic framework to be built up to support fault tolerance, with relatively little glue code required beyond implementing the service.