18-749: Fault-Tolerant Distributed Systems
Spring 2006

Prof. Priya Narasimhan


Course Home

The course provides an in-depth and hands-on overview of designing and developing fault-tolerant distributed systems. The course covers both the fundamental and advanced concepts of dependability, including replication, atomic multicast, group communication, consistency, checkpointing, transaction processing and fault injection, along with industrial standards and real-world practices for achieving high availability and fault-tolerance. Additional topics include the practical trade-offs and inter-relationships between fault-tolerance and other properties, such as real-time and performance. The lecture concepts are complemented through a semester-long hands-on project that involves the design, implementation and empirical evaluation of a distributed fault-tolerant, high-performance distributed system. To introduce students to the state-of-the-art technologies, the project emphasizes the use of object-oriented middleware, such as CORBA and EJB.

Since the only real way to appreciate dependability issues is to experience them first-hand, a substantial portion of the course content will involve a cooperative team software system implementation project. The project requires the design, implementation, empirical evaluation and end-to-end analysis of a real-time fault-tolerant high-performance distributed middleware application. The lectures, along with regular project meetings with the instructor, will allow students to design and implement realistic middleware applications, to develop working infrastructures to make these applications dependable, and to analyze the effectiveness of their techniques. From this course, students can expect to learn (i) the individual and the combined aspects of performance and fault tolerance, (ii) the basics of middleware, (iii) tools and techniques for analyzing dependability, and (iv) strengths and weaknesses of current distributed technologies from the respective viewpoints of real-time, fault-tolerance and scalability.

12 units (3 hours lecture + 1 hour project meeting per week)

(1) Solid knowledge of C++ and/or Java (if you have a knowledge of Java, you will need to pick a Java-based project in this course; if you have a knowledge of C++, you will need to pick a C++-based project in this course).
(2) Understanding of basic operating systems concepts

WEDNESDAY and FRIDAY, 10.30am - 12.20pm, PH A18A
PLUS: project meeting times TBD

18-749 in Spring 2005
18-846/17-654 in Spring 2004
18-846/17-654 in Spring 2003
18-841/17-654 in Spring 2002


Prof. Priya Narasimhan, Assistant Professor of ECE and CS, has 10 years of experience, and over 50 publications, in the field of fault-tolerant distributed systems. Apart from her significant contributions to the Fault-Tolerant CORBA standard, she has real-world experience as the CTO and Vice-President of Engineering of a start-up company building embedded fault-tolerance products. Her current research focuses on fault-tolerant and survivable distributed middleware systems, both in the enterprise and embedded domains.
Office: CIC 2202
Tel: 412-268-8801
Email: priya@cs.cmu.edu