18-749: Fault-Tolerant Distributed Systems

Units: 12

The course provides an in-depth and hands-on overview of designing and developing fault-tolerant distributed systems. The course covers both the fundamental and advanced concepts of dependability, including replication, multicast, group communication, consistency, checkpointing, transaction processing, fault-monitoring, failure-diagnosis, fault-injection, and software upgrades. The course will also cover real-world commercial practices for achieving high availability and fault-tolerance, along with case studies of failures and downtime incidents. The course includes a hands-on project that involves the design, implementation and empirical evaluation of a distributed fault-tolerant, high-performance distributed system based on middleware and cloud-computing platforms. The teams will also be expected to produce a conference-style research paper at the end of the semester, to document the design, lessons learned and empirical evaluation related to their team project. Students can expect to learn (i) the individual and the combined aspects of high-performance computing and fault tolerance, (ii) the infrastructural aspects of middleware and cloud computing, (iii) tools and techniques for implementing and evaluating fault-tolerance, and (iv) strengths and weaknesses of current distributed technologies and cloud-computing platforms from the viewpoints of high-performance, fault-tolerance and scalability.

Prerequisites: Graduate standing or instructor permission is required.


Computer Software, Software Systems and Computer Networking

Last modified on 2006-03-27



Past semesters:

F15, F10, S06, S05, S04, S03, S02, F99, F98, S97