15-712: Advanced Operating Systems & Distributed Systems

Papers

For each class meeting, readings are assigned. Usually, the readings will consist of two or three computer systems papers. The papers selected for this course are either classic papers or papers from recent top conferences. You are expected to read these papers thoroughly and summarize them BEFORE arriving at class. For each class meeting, we identify the topic and papers below; for each, we also try to identify good sources for background reading and for further investigation.

Electronic versions are linked where available (access will be denied for IP addresses outside of CMU); for paper copies, visit the 712 drawer of the course file cabinet (on the D level of Hamerschlag Hall, just outside the elevator). We will try to provide paper copies of the assigned readings at least a week in advance.

(NOTE: This schedule is not set in stone. Some changes may be made to this schedule during the term. As well, guest lectures are still being determined.)

September 12 : Welcome to 712

Because of the late start to 712, we will start quickly. This first meeting will be more than just organizational in nature. In it, we will discuss how the class is going to work and what will (and won't) be covered. In addition, we will dive into the material. This will include very rapidly recapping stuff you should already know (e.g., stuff covered in 15-412), discussing what defines operating systems and distributed systems and what makes them continue to be interesting after all these years, and overviewing how the various topics in the course fit together.

The papers listed above are fun and insight-filled, talking generally about the construction and history of computer systems. We strongly encourage you to read them, but are not requiring the standard written summary for them. The Lampson paper, in particular, is something that you should read now, read again in a few weeks, and then put into the pile of papers that you re-read every year or so. A book that should go in this same category is The Mythical Man-Month: Essays on Software Engineering (by F. Brooks, Addison-Wesley Publishers).

September 17 : Maximizing Disk Performance

The next four class meetings will use file systems as a concrete example for the various general problems introduced on day one (and covered throughout the course). This first one will focus on performance enhancement in local file systems, particularly related to disk drives. To set the stage, we'll look at the levels of control and translation between application interfaces and disk media. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe two storage system implementation techniques specifically focused on maximizing performance given specific disk features. Each describes some of the trends leading to their designs, although both have a "back to the future" feel given that systems used to do such things in "the good old days" when disk drives did not have high-level interfaces like SCSI or IDE. The first identifies the disk track as a sweet spot for mid-sized requests and addresses the difficulties involved with exploiting this insight. The second describes how small, synchronous writes can be completed with minimal mechanical positioning (the dominant cause of disk delays) and quantifies the benefits.

This class topic will move rapidly through a variety of file system and disk management issues, since you should have a solid base already (from your undergraduate OS course). A good source for more background on how disks work is An Introduction to Disk Drive Modeling (by C. Ruemmler and J. Wilkes, in IEEE Computer magazine, March 1994). For more background in file systems, we suggest Practical File Systems Design with the Be File System (by D. Giampaolo, Morgan Kaufmann Publishers), and chapters 6, 7 and 8 of The Design and Implementation of the 4.4BSD Operating System (by M.K. McKusick, K. Bostic, M. Karels and J. Quarterman, Addison-Wesley publishers).

September 19 : File System Integrity

The second class in the file systems series will focus on file system integrity, which is a requirement that post-crash file systems be in a recoverable state. The importance of this topic cannot be over-stated, as this is what ensures that storage persistence is actually valuable. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe two file system implementation techniques that offer alternate views on this problem. Both assume that a small window of vulnerability for new data is acceptible to users. The first assumes that large caches will capture all reads and thus optimizes for writes; note that, like most big ideas, the merits of LFS have been the source of heated debate. The second eliminates the historical expense of synchronous I/O operations by enabling integrity-maintaining write caching for metadata.

Suggested readings for this class meeting are the same as for the last.

September 24 : Distributed File Systems

The third class in the file systems series will focus on distributed file systems, which are file systems whose functionality is split between client machines and file servers that are connected via some kind of network. Topics touched on will include client/server organization, how it works (RPC), how it differs from the local file system case, cache coherence, concurrency control, data consistency, performance enhancement, scalability of clients supported. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe distributed file system designs. The first is a landmark paper that describes AFS and how its design enhances scalability. The second exploits the "on the Internet, nobody knows you're a dog" nature of file server interfaces -- specifically, it specializes the server implementation to get higher performance, easier management, and lower complexity.

For additional background on distributed file systems, a couple of good sources are Chapter 5 of Distributed Operating Systems (by A. Tanenbaum, Prentice Hall Publishers) and Chapter 9 of The Design and Implementation of the 4.4BSD Operating System (by M.K. McKusick, K. Bostic, M. Karels and J. Quarterman, Addison-Wesley publishers). For further reading, we suggest many of the papers referenced in the papers you will read for the next class meeting (NASD, xFS).

September 26 : Distributed Storage Services

The fourth class in the file systems series will focus on distributed storage services, which include file systems whose server-side functionality is distributed among several file servers. Topics touched on will include partitioning and coordination, naming and resource discovery, reliability/availability, security, data consistency, dealing with scale. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe distributed storage service designs. The first describes Network-Attached Secure Disks (NASD), which is an architecture designed to cost-effectively deliver scalable storage bandwidth via a particular partitioning of functionality among clients, servers, and network-attached storage devices. The second describes an ambitious cluster storage system design (xFS) that is meant to avoid having any central point of failure or performance bottleneck.

There isn't a particularly good source of background reading on this topic, but the assigned readings do identify a number of other system designs. We suggest following up on papers referenced therein as a start toward digging deeper. To gain an appreciation for the benefits and difficulties of "multi-device" storage subsystems (even when there is a centralized controller), we suggest RAID: High-Performance, Reliable Secondary Storage (by P. Chen, et. al, in ACM Computing Surveys, June 1994).

October 1 : Communication Models

Having plowed through a concrete set of advanced OS and basic distributed systems examples, we now start looking at aspects in a more general light. Our first such aspect is the communication mechanism among components of a distributed system. The focus is not on the particular network protocol, but on how the communication is viewed by the parts and how it all hangs together. Some particular topics for today include remote procedure call (RPC), implementing RPC, data streams, load balancing, and work stealing. As always, along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today highlight particular communication models. The first is a classic paper discussing implementation details involved with RPCs, which are the foundation of most distributed systems. The second is a more recent paper that describes a (possibly) more appropriate mechanism for data-intensive cluster applications.

For both background reading and further reading, we suggest looking at general distributed systems books, such as Tanenbaum's Distributed Operating Systems, Coulouris, et al.,'s Distributed Systems, or Mullender's Distributed Systems.

October 3 : Function Placement

An important topic in distributed systems in function placement, including both deciding what to run where and instantiating these decisions. Issues involved with deciding what should run where include inter-function communication, parallelism, load balancing, and security. Instantiating these decisions involves communication issues discussed in previous class meetings. In some cases, it also includes moving functionality from one place to another, which introduces the topic of mobile code and such issues as execution environment heterogeneity, virtual machine environments, re-binding, and encapsulation/isolation for protection. In examining these issues, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe systems with particular approaches to making and instantiating function placement decisions. The first describes Emerald, a system that realizes fine-grained mobility by encapsulating data and threads into mobile objects. The second describes Abacus, a system that dynamically adjusts function plaement decisions based on black-box monitoring of runtime performance.

There is not a lot of background literature about function placement (check out the papers' Related Work sections to dig further), but there is a huge amount of literature on mobile code systems. One interesting place to look is at UMBC's Agent Resources website. Another is the Object Management Group's website.

October 8 : Security Dilemmas

Computer system security is a topic of growing importance (and great confusion). This class meeting and the next will explore the basics of computer system security as it relates to operating systems and distributed systems. Topics for this meeting will include what security is really all about and the dilemmas faced by system designers, implementers, and administrators. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today discuss interesting aspects of computer system security. The first is a brief lecture given by Ken Thompson when accepting his Turing Award; it makes an interesting point about trust and security. The second discusses why cryptography does not solve the security problem, by drawing on real-world examples, and some consequences of this fact. The third describes the notorious Internet Worm and some lessons regarding the real problems in computer security; sadly the lessons haven't changed much 12 years later.

There is a huge and growing literature on security. Two excellent overview books are Secrets and Lies: Digital Security in a Networked World by Bruce Schneier (published in 2000 by John Wiley and Sons, Inc.) and Security Engineering: A Guide to Building Dependable Distributed Systems by Ross Anderson (published in 2001 by John Wiley and Sons, Inc.). The former is a light read, and the latter is a serious textbook.

October 10 : Security Mechanisms

Computer system security is a topic of growing importance (and great confusion). This class meeting complements the previous one by discussing mechanisms available to mitigate security dilemmas, such as the role of cryptography, assurrance, trust, firewalls, intrusion diagnosis, and blame. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today discuss interesting aspects of computer system security. The first is the classic description of how to build a secure computing system. It is a bit dated (and therefore doesn't include some recent problems/solutions), but is quite comprehensive in the fundamentals. The second (much shorter) discusses some wild ideas regarding a future form of distributed security.

Suggested readings for this class meeting are the same as for the last.

October 15 : Writing Systems Papers

Great computer systems researchers must be great writers. For better or worse, presentation quality determines the destiny of a research paper nearly as much as the technical content. If your readers cannot understand (or cannot maintain consciousness while reading) what you have written, how can they be expected to appreciate your brilliance?? Today's class meeting will focus on what makes a good systems paper, from outline to final polish. Along the way, we will discuss several types of papers.

The one paper assigned for today hammers the systems community by telling them what the papers submitted to SOSP in 198X should have looked like. It is focused mainly on issues of content.

There are various web pages dedicated to giving advice about written and spoken communication, including some dedicated to computer systems researchers.

October 17 : Exam #1

In-class exam.

October 29 : Concurrency, Threads, Transactions

One of the most basic (and yet persistently complex) aspects of both operating systems and distributed systems is dealing with concurrent threads of control. While your undergraduate OS course undoubtedly spent a lot of time talking about monitors and semaphores, there is a lot more to concurrency than that. This class meeting will discuss a number of such topics, including locking, avoiding deadlock, optimistic concurrency control, avoiding livelock, leases, and transactions. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today differ in what they offer. The first describes experiences with thread programming, offering a number of insights from serious usage. The second describes an optimistic approach to concurrency control, which allows locks to be completely avoided.

For additional background on thread programming, we suggest An Introduction to Programming with Threads (A. Birrell, DEC Technical Report TR-35, DEC/SRC, January 1989; the instructors will be happy to provide you with a copy), Principles of Concurrent and Distributed Programming (by M. Ben-Ari, Prentice Hall Publishers) or Programming with POSIX Threads (by D. Butenhof, Addison-Wesley Publishers).

October 31 : Event Ordering and Multi-Party Consensus

The nature of distributed systems is such that autonomous systems don't see the same things, don't see things at the same time, and can't even easily agree on what time it is. These problems make it difficult for each to reason about what the others are up to and makes it difficult for systems to come to consensus. This class meeting will discuss various issues of event ordering and coming to consensus. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today discuss different aspects of this general area of problem. The first discusses the problems of event ordering and the distributed clock synchronization. The second discusses the problem of coming to consensus when some systems misbehave.

Much of the theoretical side of distributed systems is more or less focuses on these two problems. Thus, many good distributed systems books address them thoroughly, including those on the reserved list.

November 12 : Group Communication and State Machines

As we've discussed, one reason for replacing centralized services with distributed systems is fault tolerance. Dealing with both clean (fail-stop) failures and misbehaving systems (e.g., compromised systems) can be done by having multiple systems doing exactly the same thing in lockstep. This class meeting will look at approaches and technologies for accomplishing such function replication. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe this method in general and in a particular application, distributed file service. The first provides a tutorial-like overview of replicated state machines. The second describes a fault-tolerant NFS replacement that employs some well-tuned algorithms.

The first paper actually provides some pretty good background discussion, and we encourage those looking for further reading to follow up the Related Work sections.

November 14 : Authentication and Authorization

Central to enforcing any form of security is verifying a user's digital identity (not necessarily a one-to-one correspondence with their real-world identity) and deciding whether or not an activity is authorized. Both of these become even more difficult in today's distributed computing environments. In this class, we'll discuss the main steps (identification, authentication, and authorization) and approaches to making them in a local system and across the network.

The papers assigned for today describe cool ideas in authentication and authorization. The first describes an approach to combining biometrics with passwords to improve authentication of claimed digital identities. The second describes a refinement to the classic "Logic of Authentication" that has been the soul of formalized, distributed authentication systems for years.

The security books listed earlier provide some background on this sub-topic as well, as do the bibliographies of the readings.

November 16 : Middleware and Publish/Subscribe

November 19 : Privacy and Censorship-Resistence

One of the most heated topics for the future will center on issues of privacy, anonymity, and censorship-resistence. Although there are many questions of ethics and law, there are also difficult technical issues related to making these possible at all. Most technological advances (faster computing, more storage, better face recognition, smarter data mining ) march against these historical givens, as we are rapidly closing in on the complete set of technologies needed to create Big Brother. During this class period, we will talk about a few of the ideas being explored to help hold back the tide. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe insights from two specific projects. The first describes Publius, which is a decentralized storage service that distributes document storage responsibilities among many independent parties. The second describes experiences with operating an e-mail pseudonym server, providing an interesting look at the practical difficulties faced.

The security books listed earlier provide some background on this sub-topic as well, as are the bibliographies of the papers. One interesting piece of background is Ross Anderson's "The Eternity Service", which focused many people's attention on the value of not losing the anonymous, censureship-resistent publication capability that has historically existed (and been so significant).

November 26 : Fault Tolerance and System Structure

Fault-tolerance is a broad and important topic in operating systems and distributed systems. This class meeting will discuss a range of issues and approaches, complementing previous discussions. Some topics to be discussed include data redundancy, failover, checkpointing, state machines, N-versioning, fault containment, isolation, and failstop versus byzantine failures. Along the way, we will look at several case studies and discuss (yes, you too) their pros and cons.

The papers assigned for today describe fault tolerance strategies for particular systems. The first describes fault management in Quicksilver, which relied heavily on transactions and their recoverability properties. The second describes Hive, a cellular operating system architecture for large multiprocessor systems.

Fault-tolerance is another of the large areas of research and practice. For example, see the IEEE Technical Committee on Fault-Tolerant Computing. For some interesting and scary stories of insufficiently fault tolerant systems and their consequences, see Neumann's Computer Related Risks or Peterson's Fatal Defect.

November 28 : Project Status Presentations (Groups 1-6)

In class presentations, by the students doing them, of projects and their midpoint status.

December 3: Project Status Presentations (Groups 7-11)

In class presentations, by the students doing them, of projects and their midpoint status.

December 5 : Exam #2

In-class exam.

December 10: Futures

As with so many aspects of computer systems, changes in technology and applications changes the issues faced in operating systems and distributed systems. In this class meeting we'll discuss some of these cool new developments and some of their consequences.

The papers assigned for today describe visions for future computing. The first describes ubiquitous computing.

For additional material on ubiquitous computing, see Mark Weiser's Ubiquitous Computing web pages.

December 12 : Project Poster Session

The final day of classes will be a poster session, with one poster for each group's class project. This will allow you all to show off your great work to each other and other interested CMU people.

Last modified: Thu Nov 29 15:13:12 Eastern Standard Time 2001