Team #4
18-749: Fault-Tolerant
Distributed Systems
Spring 2006
|
Fault-Tolerance Evaluation Fault-Tolerance Evaluation Results
Fault-Tolerant Design – Architecture
Replication Manager:
- This
module is assumed never to fail and will have no replication itself
- Heartbeats
every server that should currently be running periodically
- Server
will be killed and restarted on either the same or a different machine if
the heartbeat times out
- The
period of the heartbeats and servers to be used for running servers will
be placed in a configuration file to be loaded by the manager.
- The
replication manager will assure that one replica is always ready or being
started and JNDI knows about this machine
Fault Injector:
- The
fault injector will kill processes by issuing a kill command to various
systems.
- The
injector will be able to operate in two modes
1) Where
the user tells the fault injector which server to kill and how often.
2) Where
the fault injector randomly kills servers at some predefined interval
Server Beans:
- Server
replicas will be stored under the same name ending in “-1” or “-2” to
identify different replicas
- The server
beans will be passively replicated by the replication manager
- The
server will be able to be launched by the replication manager or manually
by a user
- The
state stored in the beans will be kept to a minimum to ease replication
issues
Client:
- The
client will talk to the JNDI server to get a
reference to a server bean
- If a
server fails, the client will ask JNDI for a new
server and retransmit the request
- The
client will maintain a transaction ID number that must be sent with every
message and is incremented only after the server returns a response
indicating that the request was processed.
Database:
- This
is assumed never to fail, thus there is no replication of the database.
Fault-Tolerant Baseline
Summary: The fault-tolerant baseline currently uses a
replication manager that will automatically launch, heartbeat, and re-launch
servers.
This ensures that there are always a certain number of
servers running and that one of them is the primary. The replication manager and
client are both able to handle a failure, with the replication manager
switching the primary and rebooting and the client reconnecting to the
new primary. Currently, process crashes
are handled, node crashes are detected, but it is possible that a re-launch is
attempted on that
node, and duplicate detection is taken care of with transaction IDs for the non
idempotent functions.
Fault-Tolerant Baseline demo code available here
Fault-Tolerance Evaluation
Chief Experimenter: Jon Gray
Necessary
Implementation Changes for Evaluation
·
Create an
automatic client written in java that does the following
- Send
10,000 requests alternating between calling the Account View function
and the Auction Create function since these are representative of most
actions
in our system.
- Implements
all three of the client-side probes
o
Allows for a
constant and configurable inter-request time
- Accepts
parameters describing inter-request time and expected reply size
- Server changes
- Account
view and auction create need to accept a parameter that tells them
how big their response should be in bytes.
o
These two
functions also need to be updated so that they accept a parameter
that contains the client’s hostname so that this can be recorded in the
server’s
log files.
- The
four server side probes need to be implemented so that the required
information is recorded when these two functions are called
Necessary Scripts Required
·
Create a script
that will run varying numbers of java auto clients with different values
for inter-request size and reply message size.
Specifically,
- Start
1, 4, 7, and 10 clients
- Size of
reply: original, 256b, 512b,
and 1024b
- Inter-request
time: 0, 20ms, 40ms
- Create scripts in MATLAB to read and plot this data
Design
·
The script for
launching clients and going through all of the tests will be done in either
perl or shell scripting languages.
- Analyzing the output
data will be done using MATLAB scripts to ease
plotting. Since
the automatic client will be written in java the script will only have to
launch it
and not have to send commands to it which should simplify things.
Completed
Fault-Tolerance Evaluation Files
Data: team4data.tar.gz
Report: team4Analysis.pdf