James Bruce 15-712 Paper Review Monday November 12th "Practical Byzantine Fault Tolerance" Important Points: - Previous solutions dealing with Byzantine fault tolerance in a replicated state machine proved feasibility but did not provide practical solutions. The system presented in this paper is practical, and can be used to implement server systems competitive with even existing non-fault tolerant systems such as NFS. - A correct system can rely on bounded message delay for safety or liveness, but not both (since distributed agreement in an asynchronous system is impossible). Many previous systems relied on bounded message delay for safety, which is dangerous on real networking systems. The algorithm presented in this paper instead relies on message delay for liveness, but does not require that property for safety. - Byzantine failure is such a weak assumption that a solution can replace other mechanisms for guaranteeing strong semantics on the part of a server. For example, the BFS implementation did not need to do synchronous writes like a standard NFS server, because the Byzantine fault tolerance could already survive crashes on n/3-1 of the nodes without losing data. Deficiencies: I was very impressed with the paper overall, but I would have liked data on the performance of replication groups that could tolerate two or three node faults, which more clearly demonstrate the scalability of the system. Once implemented, it would also be useful to know how long a view change would take in a practical server system. Conclusions: This paper demonstrated that Byzantine fault tolerance can be practical in a real system, and using it as a powerful building block for servers, can be competitive with non-fault-tolerant implementations of the same service. A fault tolerant system was built employing many clever optimizations such as MACs, direct client broadcasts for read only requests, and hash function only replies in the common case to decrease network load. It was really nice to see how the BFS implementation could take advantage of its robustness to avoid the syncing required by the normal NFS daemon to achieve competitive performance - Sometimes a more complicated, less efficient base allows other optimizations to be made at other levels due to the superior semantics it provides.