Title: Determination of Replication Degree
Author: Chutika Udomsinn


Replication is a well known technique for providing fault
tolerance. Most of the time, application developers know
characteristics of their replicas resources(cpu, memory, and network
bandwidth), amount of application state, workload, and fault
interarrival time. The famous f+1 equation is not practical because it
is hard to find out the maximum failure to begin with. Therefore, they
have to use trial-and-error approach to find out replication
configuration: replication style, number of replicas, checkpoint
period, and fault detection frequency in order to achieve their
availability goal. It is a time consuming process and sometimes ends
up with suboptimal configuration.

I plan to build the Advisor component of MEAD
middleware(http://www.ece.cmu.edu/~mead/) developed at CMU. Together
with resource monitoring agent, the Advisor will perform a test run to
determine fail-over and recovery time of a given replica and
application. With workload and fault interarrival time from user, the
Advisor gives replication configuration for a specified replication
style.

The main test application will be electronic voting system with
adjustable amount of state. The advisor system also will be tested on
another 8 different CORBA or J2EE applications from groups in 18-749
class, Spring 2006. The evaluation will be performed in two cases:
validation and optimization. For validation test, I will show that the
suggested configuration will not give down time within the user's
mission time. For optimization test, I will show that the number of
replication suggested by the Advisor is optimum. Using the number of
replication less than suggest will result in system failure, using
more will not help anything.


Assumptions:    
1) constant workload
2) constant fault interarrival time
3) homogeneous replica nodes

Constants:      
1) replicas resources availability
2) replication style
3) workload
4) fault interarrival time
5) application (tells amount of state)

Variables:      
1) checkpoint period
2) fault detection frequency

Determine:      
1) recovery time: f(restart time, amount of state)
2) fail-over time: f(amount of state, resource, checkpoint period,
fault detection time)
3) replication degree