Teaching Cache Coherence Protocols
with Model Checking

Brian T. Gold
bgold at cmu.edu
January, 2009

About this Lab

Designing your own cache coherence protocol is a great way to learn about this challenging topic in computer architecture. This lab has you design and verify your protocol using Murphi, a model checking tool from David Dill's group at Stanford. The material on this page is meant to supplement lecture notes explaining the basics of cache coherence.

History and Usage

This lab assignment grew out of the Multiprocessor Computer Architecture course taught at Carnegie Mellon in the ECE department (course: 18-742). I believe Prof. Andreas Nowatzyk wrote the original version of this when he taught the course in Spring 2004, when I was the TA. Originally, we asked students to implement a baseline 3-hop protocol along with an optimization (e.g., speculative reads) of their choosing. We gave students the documentation for Murphi but didn't give them any other starting point for writing their code. The assignment typically ran two weeks.

I found that many students were stumbling on the model checker usage, especially given that we need to use some of Murphi's more advanced features (e.g., symmetry reduction) to get tractable state spaces. In the version on this page, I'm giving students a minimal framework that handles the messaging aspects of the lab. All they have to do is implement the actual coherence protocol.

We've used this assignment in courses with great success, but I've also given it to new graduate students to help them learn the coherence protocols we use in our research. Eric Chung took this quite far: he learned the protocols using this method and within a semester had implemented an industrial-strength coherence protocol and controller in an FPGA. This led to a WARFP workshop paper.

Why Model Checking?

The real question is, "What is so hard about cache coherence protocols?" Answer: The corner cases. It is relatively easy to think about the common-case operations, but thinking through all possible races is very tough, especially for aggressive protocols that attempt to minimize the number of messages (hops) per transition.

Model checking aids protocol designers by forcing them to handle all possible races. If you forget to handle a race or handle it incorrectly, the checker will show you a counter-example from which you can work to fix your design. The complexity of coherence-protocol races is not conducive to lecture notes; instead, we've found that model checking is a good way to get students designing their own protocols and learning about common mistakes along the way.

Comments, complaints?

Please send me email (bgold at cmu.edu). Thanks!


Cache Coherence Protocol Lab

Overview

In this assignment, you will design and verify a cache coherence protocol for a multiprocessor system. To help get you started, we'll give you a basic framework for implementing your protocol, but you will have to work out the interesting parts yourself. You will be working in the Murphi model-checking framework, which will allow you to specify your protocol and verify its correctness. We'll provide some notes on Murphi for this assignment and give you pointers to the complete documentation.

Protocol Design

You will design and verify an invalidation-based cache coherence protocol. The protocol you develop will have a number of characteristics:
  1. It uses an interconnect network that supports only point-to-point communication. All communication is done by sending and receiving messages. The interconnect network may reorder messages arbitrarily. It may delay messages, but it will always deliver messages eventually. Messages are never lost, corrupted or replicated. Message delivery cannot be assumed to be in the same order as they were sent, even for the same sender and receiver pair.

  2. At the receiving side of the interconnect system, messages are delivered to a receive port. Once a message has been delivered to the receive port, it will block all subsequent messages to this port until the message is read. Consider this behavior equivalent to that of a mail-box with room for only one letter: you have to remove the letter from the mailbox before you can receive the next one. On the sending side, there is no such restriction: you can always send messages. The interconnect system has enough buffer space to queue messages.

  3. For the purpose of this assignment, you may assume that there is no limit on the buffer space in the interconnect system. However, your protocol will be considered broken if there is a way to generate an infinite number of undelivered messages. Besides, you will not be able to verify your protocol in this case.

  4. You may assume that the interconnect network supports multiple lanes (i.e., virtual channels). For each lane, you have a separate set of send- and receive-ports for each unit. Traffic on one lane is independent of traffic on the other lanes. Messages will never switch lanes. Note that using fewer lanes is better.

  5. Each processor has a dedicated cache that is not shared with any other processor. All caches must be kept coherent by your cache coherence protocol. Processors may issue load and store operations only. Because this assignment only deals with cache coherence and not with consistency issues, you will be concerned with only one storage location (address). However, you need to model cache conflicts. To do this, you need to model a third operation besides load and store: a cache write-back. Write-backs normally arise from a cache conflict if the old line is dirty. Write-back operations may occur at any time between any pair of load/store operations. If the cache is in a clean state, you may simply set it to be invalid or take the appropriate action according to your coherence protocol. Cache replacements of dirty lines must obviously write the line back to memory.

  6. You should assume that the coherence unit is equal to one word and that all loads and stores read or write the entire word.

  7. Besides processors with their caches, there is one memory unit in your system. The memory unit has a directory-based cache-consistency controller which ensures that only one processor can write to the memory block at a time (exclusive-ownership style protocol). The directory representation is unimportant for this assignment. You should assume that you have a full directory (bit vector) that can keep track of all sharers.

  8. The interconnect system can send messages from any unit to any other unit. It is OK if your protocol requires that a cache controller has to send a message to another cache controller.
For this assignment, your cache coherence protocol should not worry about consistency issues. Because of that, you may assume that the memory of this machine has only one word. Your protocol has to make sure that loads from up to three (3) processors always return the value of the most recent stores. In this context, this means that loads and stores issued by one processor are seen by that processor in program order.

You are supposed to write a plain, directory-based cache-coherence baseline protocol without any optimizations other than forwarding of invalidations and exclusive ownership. In other words, your baseline protocol should use no more than 3 hops for any transactions. For example assume that processor 3 has exclusive ownership and processor 1 issues a load, then P3 is supposed to send the cache line to the memory and to P1 (forwarding). This is to minimize latency.

The baseline protocol shall deliver data always in the state needed by the requesting processor. In other words do not bother with speculating on supplying data in an exclusive state for a normal load. Exclusivity is always a consequence of a store. Therefore in this case you only have 3 cache states: I = invalid, S = shared (read-only) and M = modified (exclusive and dirty). The memory unit could be regarded as a home-node without a processor, so it will never do anything on its own. For example, it will never issue an unsolicited recall-request.

Getting Started

To complete this lab, you will need to download Murphi 3.1. Instructions for building Murphi can be found in the tar-ball, along with a complete language and user manual. In addition to Murphi itself, we are also distributing a minimal framework that handles the messaging aspects of this assignment. Comments in the code explain what's there and where to insert your protocol code.

Deliverables and Deadlines

(redacted from web version... usually 2 weeks, turn in all code and a short writeup)


Sample Solutions

No peeking! But just in case, here are sample protocols.

Basic 3-hop protocol

The first example is a basic 3-hop protocol designed according to the description above.

Adding Home Node optimizations

The second example adds the following: the home node (directory) can now issue read and write requests, but as an optimization, its state is not kept in the directory. This allows the home node to write to the address without updating the coherence state, so long as the state is Invalid. For private data this can be a big improvement in latency, as we don't have to wait for the coherence controller just to access local data.

This code models the protocol used in Flexus, the simulator used in our research group. The protocol itself was originally designed by Nikos Hardavellas, with input from Andreas Nowatzyk and others. I took Nikos' design and modeled it in Murphi using a framework similar to this lab. The protocol diagrams PDF illustrates just how complex the transitions get in these protocols.


last updated: January 2009