18-845 Group Project (GP)

Important dates

  1. Thu Feb 22 (1:30pm): GP abstracts due
    • Email your abstract to droh@cs.cmu.edu

  2. Thu Mar 22 (1:30pm): GP mid-term oral reports due
    • Make an appointment to meet with your TA.

  3. Thu Apr 26 (1:30pm) : GP reports due
    • Email your report to droh@cs.cmu.edu

  4. Tue May 1 (1:30pm) : GP reviews due
    • Email your reviews (one file per review) to droh@cs.cmu.edu

  5. Thu May 3 (1:30pm): GP poster session, Location HH 1112
    • Prepare eight or nine letter sized sheets

  6. Sun May 6 (11:59pm): Final camera-ready GP reports due
    • Email your report to droh@cs.cmu.edu

1. Instructions for Preparing Your GP abstracts

  • The abstract must contain the following parts:
    • Title
    • Authors
    • One or two paragraphs describing what you aim to do
    • A paragraph describing the expected result (What do you hope to learn? What conclusions do you hope to draw?)

2. Instructions for Delivering Your GP Mid-Term Oral Report

  • Your TA will meet with you March 22 or March 23 to learn about your progress on the project and to give you guidance and encouragement. This is an informal face-to-face meeting. You are not required to prepare any written material. Both group members must attend.

3. Instructions for Preparing and Submitting Your GP Reports

  • Reports are limited to 10 pages.
  • Reports must follow the official ACM Proceedings format. Use the Word or Latex templates provided by ACM.
  • Reports must include the following parts:
    • Abstract - A paragraph that summarizes the problem and the results.
    • Introduction - Sets the context, describes the problem, and describes your solution.
    • Description - One or more sections that describes the problem and your approach to the solution in detail.
    • Evaluation - A section that quantitatively evaluates your ideas.
    • Related work - Describes work related to your work.
    • Summary and Conclusions - Summarize what you did and what interesting things you learned from the project.
  • Send your reports to droh@cs.cmu.edu.

4. Instructions for Reviewing Your Classmates' Reports

  • Each project report will be reviewed (anonymously) by three of your classmates. Thus, every student will review three reports.
  • Your instructors will evaluate the quality of your reviews as part of your overall project score.
  • Use the same review template you used for your critiques.
  • Send each review as a separate file attachment to droh@cs.cmu.edu.

5. Instructions for Preparing Your Poster Presentations

  • I'll bring 32" x 40" foam core poster boards, easels, and push pins for you to attach your hardcopies to the poster board. Each group will have their own poster board and easel.
  • A poster board can hold 8 letter-size (8 1/2 x 11) pieces of paper in landscape format, or 9 pages in portrait format.
  • You should prepare an 8 or 9 page presentation and bring the hardcopies with you to class. One of these pages must be a title page with the title of your project and your names.
  • I'll order some good food and drinks for us.

Hints for Coming Up with a Topic

If you are currently working on a masters or Ph.D. thesis, you are encouraged to to pursue a topic that is directly related to your thesis research. It's OK (ideal in fact!) to use the group project as a way to make progress on your thesis.

There are two basic approaches you can use for your group research projects.

  • Develop a new idea or a new twist on an existing idea, and then do enough evaluation to serve as a proof of concept.
  • Do an extensive evaluation of an existing idea that gives you some insight into the advantages or disadvantages of that idea.

Here are examples of some projects from previous years:

Here are some possible ISR-related projects:

  • Exploring opportunistic state copying. Two basic techniques for dealing with low bandwidth are prefetching (copying parcel state from the server to the client before it is needed), and preputting (copying dirty parcel state from the client to the server before the parcel is checked in). The disadvantages of these approache are that they can consume already scarce network bandwidth. This project would develop a system for opportunistic prefetching and preputting that would limit these operations to those times when the network is quiescient.

  • Adding CAS everywhere.The current version of ISR stores disk state on both the content server and the local cache in a separate 2-level directory tree for each parcel, where the leaves of the tree are encrypted 128 KB chunks. Another approach is to replace the individual parcel trees on both the client and server with a single flat content addressable storage (CAS) where the file name for each chunk is its SHA1 hash. The goal is to reduce total storage requirements and to simplify parcel management. The resulting system would also be an ideal candidate for a p2p system based on consistent hashing.

  • Combining ISR with portable storage devices.. One can imagine a number of interesting ways to combine ISR and portable storage devices.
    • Pocket ISR. In this scenario, a complete client ISR system (Knoppix Linux+VMWare+ISR) runs on a bootable 20-40 GB USB hard drive that fits in a shirt pocket (Pocket ISR). You can turn any machine into your personal ISR machine by plugging a Pocket ISR drive into the USB port and then booting from USB. Parcels reside permanently on the content server, but partial or complete parcel state can be cached on the Pocket ISR disk.
    • CD ISR. A variant of the Pocket ISR idea is to remaster the Knoppix CD to include VMWare and the ISR client. When the CD boots, it mounts a partition on the host hard drive, and then uses this device to hold all of the local parcel state.
    • LKA cache. A USB dongle could be used as a lookaside (LKA) cache that the ISR client checks before going to the network. Similarly, at checkin time the client could write dirty state to the LKA instead of sending it over the network.

  • Dealing with dirty state and low bandwidth.. One problem with the current ISR design is figuring out what to do with dirty state that is created when there is little or no bandwidth back to the content server (such as at home with a 30 KB/s cable modem upstream bandwidth). Portable storage over sneaker net is one solution, but the ISR model needs to be changed to handle the possibility. In particular, one would need the ability to transport local client state to a new machine in the middle of an ISR session.

  • Dealing with dirty state that isn't really dirty.. One of the significant changes that the ISR model introduces is that some disk state is dirtier than other state. For example, if an application on the guest creates a temp file, writes to it, and then deletes it, then dirty state is created on the disk and in the current ISR model, must eventually be uploaded to the content server, even though to the guest operation system the creation of that dirty state was a no-op. An extreme example of this is defragmenting the disk! Similarly for data that has been cached on disk by applications or operating system on the guest. In this project, you will investigate the causes of false-dirty data and develop techniques for minimizing the amount of it that needs to be uploaded.

  • Building an ISR friendly guest file system. Another way to deal with false-dirty data is at the file system level. This project would modify an existing file system such as Linux's Ext2 fs to minimize the size of compressed ISR parcels when they are transferred between client and server. Techniques might include zeroing deleted files and blocking disk sectors so that related sectors are in the same chunk files.

  • Evaluating the effectivness of memory and disk ballooning. A recent paper (Optimizing the Migration of Virtual Computers, OSDI '02) describes an ISR-like system for migrating virtual machines. In the paper, they introduce a technique, called ballooning, for reducing the size of virtual machine memory images. The basic idea is to run a helper process on the guest that allocates and zeros memory blocks, relying on the page replacement mechanism of the guest OS to steal pages that are not part of the working sets of other processes. The result is a memory image with an increased number of highly compressible zero pages. ISR currently does not do ballooning. The evaluation in OSDI is a good start, but by no means convincing. The project would design, implement, and evaluate ballooning in ISR on representative workloads (designed and codified by the student team), concluding with a firm recommendation about whether to include the technique in a later ISR release.

  • Minimizing the size of downloaded memory images. Another way to deal with false-dirty data is at the file system level. This project would modify an existing file system such as Linux ext2 to minimize the size of compressed ISR parcels when they are transferred between client and server. Techniques might include zeroing deleted files and blocking disk sectors so that related sectors are in the same chunk files.

  • Software distributions in parcels. Unix machines in the CS and ECE environments have access to shared software distributions in the /usr/local directory. This mechanism allows each Unix machine in the environment to have access to a uniform up-to-date set of software programs that are updated nightly. Windows uses a different mechanism, storing the software installation scripts on a server, and relying on each user to download and install the software they want for their system. The goal of this project is to develop techniques for providing shared up-to-date software distributions on Linux and/or Windows parcels.

  • Writing a Windows version of ISR. The ISR client software currently runs on Linux 2.4 and 2.6 machines. This project would build a version of ISR that runs on Windows. The main task would be writing the equivalent of a Linux loadable kernel module (device driver) that interposes VMWare requests to a non-existant raw disk and passes those requests up to a user level process that interprets those requests.

Here are some other ideas for topics (in no particular order):

  • Internet host counting. Perform your own Internet Domain Survey and compare it to the published survey at www.isc.org.

  • Peer-to-peer systems. There are a number of interesting unresolved issues for peer-to-peer systems such as Freenet. What are the performance bottlenecks? How anonymous are Freenet objects really? How can Freenet peers be attacked and defended? How to delete and update documents? How to invalidate cached copies of updated documents? How to resolve the fundamental tension between privacy and accountability? How to name objects? How to search for objects in systems that are designed to make it impossible to identify the origin server for any particular document? Are there better approaches for caching and replicating objects through the network.

  • Peer-to-peer distributed keyword search. Develop a peer-to-peer distributed search engine. Issues: Tradeoffs between security and convenience, scaling outside the local area, searching on metadata as well as contents.

  • Peer-to-peer content publishing systems. There are a number of interesting unresolved issues for anonymous content publishing peer-to-peer systems such as Freenet. What are the performance bottlenecks? How anonymous are Freenet objects really? How can Freenet peers be attacked and defended? How to delete and update documents? How to invalidate cached copies of updated documents? How to resolve the fundamental tension between privacy and accountability? How to name objects? How to search for objects in systems that are designed to make it impossible to identify the origin server for any particular document? Are there better approaches for caching and replicating objects through the network.

  • Monitoring in a non-cooperative environment. Stefan Savage at UCSD has developed a powerful technique for estimating end-to-end bandwidths and packet-loss between hosts, where the remote host is not cooperative in the sense that it would be impossible to get an account on the machine (e.g., the Yahoo server). Savage's approach is to exploit the behavior of TCP (which all servers must implement to the specification) to gain information about the effective bandwidth from the server to the client. For this project, you might apply this general idea in some new context, or use Savage's method for estimating packet loss in the context of a larger application. For example, would it be possible to use Savage's technique to build a client-side performance monitoring system that, for a given HTTP transaction, would isolate the network transmission time from the server processing time and determine which is the bottleneck?

  • Attaching geographical locations to IP addresses in the context of a world-wide disaster monitoring system. When natural disasters such as earthquakes occur, it is very difficult to make accurate estimates of the geographical extent and severity of the damage because the communication infrastructure disappears. However, hosts that provide Internet services are always turned on, so the lack of response from those systems contains some information. The idea is to build a system that would sample hosts in earthquake prone regions on a continual basis. Each sample is a bit vector, one bit per host. Some interesting issues are assigning IP addresses to geographical locations, developing a hierarchical scheme to aggregate response bit vectors, and developing analysis techniques of the response bit vectors to distinguish transients (e.g. localized power failures or normal host downtime) from real damage.

  • Scalable search engines. Current search engines are not scalable because all of the work is done at the remote server site. As a result, the servers are not able to perform much computation when they satisfy a request, typically a quick lookup of an inverted index. As a result, single-word queries, which directly index the database on the server, typically work pretty well, but multiple word queries often give poor results. The idea here is to investigate the following question: Can we improve the performance of search engines such as Google by doing some additional work on the client?

  • Performance evaluation of content distribution networks. Content distribution networks such as Akamai claim to significantly reduce the latency of Web page downloads. The idea here is to evaluate and quantify the performance benefits of the Akamai service. When does it help? When does it not help?

  • Defending against dDOS attacks. Distributed denial of service attacks are somewhat scary and difficult to defend against. The idea here would be to survey the existing approaches, identify strengths and weaknesses, and propose and evaluate some improvement.

  • Evaluation of performance issues in high-speed non-blocking servers. The idea here is to build a high speed server that never blocks on I/O (such as the Flash server from Rice) and then do extensive micro-evaluation of its performance in order to understand the extent of the performance gain that is possible from such non-blocking servers.

Last modified: Wed Jan 31 13:14:40 EST 2007