Fault-Terminators
Fault-Tolerant Distributed Systems, Spring 2006
Team Members
Patty Pun - tpun@andrew.cmu.edu
Kevin Smith - kevinsmith@cmu.edu
Zhang Yi - zhangyi@cmu.edu
Felix Tze-Shun Yip - fty@andrew.cmu.edu
Project Info
Title
Mafia: Online Mobsters
Description
An implementation of the game 'Mafia' that requires instant messenging, character status maintenance and a graphical user interface.
System Configuration
Middleware - EJB
Operating System - Linux
Programming Language - Java
Third-party Software
Ant (Code Compilation & Deployment)
CVS (Version Control)
Eclipse (Development Environment)
Javadoc (Source Code Documentation)
JBoss (EJB Implementation)
MySQL (Database Management)
Tomcat (Web Server)
XDoclet (Interface File Generation)
Baseline Design
Requirements
Allows a client to communicate with other clients
A client may only belong to one chat session
A server may support multiple chat sessions
Discussions take place using any web browser
Interfaces
sendMessage();
receiveMessage();
submitVote();
murderVictim();
leaveGame();
enterGame();
gameMsgs();
youAreKilled();
Scenarios & Interactions
Message processing
Status update
Current Status
Creation of single end-to-end case on the Windows Operating System
Downloads
Binary Distribution
Documentation
Source Code Documentation
Fault-Tolerant Design
Scenarios & Interactions
Use of passive replication
Upon failure of primary server, replication manager begins redirecting clients to backup server
Downloads
Binary Distribution
Documentation
Source Code Documentation
Fault-Tolerance Evaluation
Design Proposal (PDF)
Results & Analysis (PDF)
Client Invocations (PDF)
Raw Graph Data (158 Figures)
Raw Graph Data (tar.gz)
Raw Probe Data (tar.gz)
Real-Time, Fault-Tolerant Design
Real-Time Evaluation
Real-Time Evaluation Results (PDF)
High-Performance, Real-Time, Fault-Tolerant Design
High Performance Plan
Our real-time evaluation showed us that failover from the primary server to the backup server made up over 90% of our recovery time for faulty invocations. As a result, our goal in Phase IV will be to improve this failover time by optimizing the failover process.
One bottleneck in our failover mechanism is the replication manager, which takes a considerable amount of time to update its list. The client spends too much time waiting on the replication manager to provide a new server name. The other bottleneck in our system is the process of creating a new bean once the replication manager has provided a valid server name.
Both of these delays can be greatly reduced by always having a second bean ready on the client. When the client starts, it can ask the replication manager for the name of the next backup server along with the name of the primary server. The client can then create two beans - each pointing to these two different servers. In the event that the client cannot invoke a method on the primary server, it can immediately begin using the secondary bean. It can then continue processing as usual and in the background it can get the name of the next backup server from the replication manager and create a new secondary bean. Using this approach, the client will always have a secondary bean readily available in the event that the primary server goes down. This of course assumes that the backup server will not go down before the primary, but if this does happen, the delay would be no worse than in our current setup. Using this approach of always having two beans readily available on the client, we can significanly reduce the end-to-end latency in the presence of primary server failure.
Tips
JBoss and Java 5
For system evaluation, you may wish to make use of Java 5's System.nanoTime() method. Unfortunately, JBoss has difficulties working under Java 5. To get around this, you can delete the javax.management.* classes in your Java 5 installation. The following commands should accomplish this for you.
cd $JAVA_HOME/jre/lib
mkdir temp
cp rt.jar temp
cd temp
jar xf rt.jar
rm -rf rt.jar javax/management/*
jar cf rt.jar *
cp rt.jar ..
cd ..
rm -rf temp
SSH Environment Variables
SSH provides a nice way of performing remote execution. This is very beneficial for 749 projects which need to remotely start and stop servers and clients. To start the JBoss server on machine risk, for example, you could execute
ssh risk $JBOSS_HOME/bin/run.sh& 2>&1
Unfortunately, when using the ssh method of remote execution, you do not have access to all the environment variables you would normally have access to when logging into machine risk. You can, however, explicitly specify variable values for ssh to use by adding them to the file ~/.ssh/environment on the machine from which you will be performing the remote execution. So in our example, you would modify the file on your source machine and not on machine risk. If you're running everything on ECE machines, it doesn't matter though thanks to AFS. So your file might look something like
JAVA_HOME=/usr/local/j2sdk1.4.2_02
JBOSS_HOME=/afs/ece/class/ece749/ejb/jboss-3.2.3
Documents & Downloads
Baseline Design
Binary Distribution
Source Code Documentation
FT Baseline Design
Binary Distribution
Source Code Documentation
FT Baseline Evaluation
Design Proposal (PDF)
Results & Analysis (PDF)
Client Invocations (PDF)
Raw Graph Data
Raw Graph Data (tar.gz)
Raw Probe Data (tar.gz)
Real-Time Evaluation
RT Results (PDF)
High-Performance Evaluation
HP Results (PDF)
Final Demo
Final Demo Binaries (tar.gz)
Other
Final Presentation (PDF)
Current Source Code Documentation