ARS System - Performance Analysis - Phase I

Tests Performed

1. Co-located fault free

WorkloadTest, WebLogic application server, God and Creditcard running on the same machine; no other replicas; database server on a separate machine on campus.

<<< Figure 1: graph co-located fault-free >>>

2. Distributed fault-free - communication baseline

Multiple machines:

WorkloadTest on machine 1
Goad and Creditcard on machine 2
WebLogic application server on machine 3 (primary server)
WebLogic application server on machine 4 (backup server)
Database server on a separate machine on campus

A loop of 100 iterations was executed calling isAlive() on the server. This call takes no arguments and doesn't do any processing or database access. Therefore, the roundtrip time corresponds to the communication time between client and server.

<<< Figure 2: graph >>>

3. Distributed fault-free

Multiple machines:

WorkloadTest on machine 1
Goad and Creditcard on machine 2
WebLogic application server on machine 3 (primary server)
WebLogic application server on machine 4 (backup server)
Database server on a separate machine on campus

A loop of 100 iterations was executed. Each iteration would call three business methods in sequence: getFlights(),makeReservation() and buyTickets().
In executions 52, 57, 81 and 83, the spikes are the result of a problem in the database connection that caused a timeout and transaction rollback on the server. Because the database is considered fault-free in the context of this project, we'll not try to explain the spikes or propose solutions or improvements to avoid them.

<<< Figure 3: graph >>>

4. Distributed with fault-injection

Multiple machines:

WorkloadTest on machine 1
Goad and Creditcard on machine 2
WebLogic application server on machine 3 (primary server)
WebLogic application server on machine 4 (backup server)
Database server on a separate machine on campus

A loop of 100 iterations was executed. Each iteration would call three business methods in sequence: getFlights(),makeReservation() and buyTickets().
After 20 executions, a fault was injected on the primary server. After 180 seconds, a fault was injected on the other server. From then on, a fault was injected every 180 seconds on alternate servers.

<<< Figure 4: graph >>>

Findings and Conclusions

The results allowed us to conclude directly or indirectly:

On the distributed scenario, response time is not as steady as in the co-located scenario. Several aspects of the infrastucture present non-deterministic response times and contribute to increase the variance observed in Figures 2 and 3:

Time to send a message using Ethernet and 802.11b (wireless) is non-deterministic (message loss can also occur).
Time for the database server to respond to a request can vary based on current load and existence of cached data from previous commands.
Time to execute a user request in a J2EE server may vary based on the ability to reuse an EJB or database connection from the respective pool. Also, garbage collector can execute and interfere with response times.

Response times for transactions should be defined as average values; it's not feasible to define worst-case values because of the impossibility of "real-time" determinism.
Communication overhead is the most time-consuming portion of a user "roundtrip" operation.
The first request to an application server after restart takes longer than the average, probably due to internal initialization procedures still going on and the need to fill up pools of EJBs and database connections.
Garbage collection on the application server consumes from 3 to 7 milliseconds. Given the frequency of the GC operation and the cost of the network overhead, this does not appear to be a significant factor in the overall response time.

Next Improvements - Phases II and III

Performance of the system may be increased by taking the following actions:

Each EJB request in isolation requires 4 accesses to the application server:

JNDI lookup for the EJB, which returns the home object (factory).
A call to create() on the home object to create an instance of the EJB, which returns an instance of the EJB object.
A call to the business method (e.g. makeReservation()) itself.
A call to remove() to inform the server that this client won't need that EJB instance anymore.

Therefore, the total execution time of a user operation can be reduced if we avoid some of the 4 calls mentioned above. In the specific context of the ARS that is possible:

Because we're using Stateless Session Beans, the call to remove() is not necessary. That conclusion is intimately related to the understanding of how SLSB operates interanally to the EJB container: a SLSB instance is drawn from a pool of instances for each request. That instance executes the call and returns the results to the user. Then, the container returns the instance to the pool; SLSBs are not passivated and do not retain state, so the call to remove() doesn't alter the availability of instances in the pool.
The EJB object returned by the create() call can be cached inside the client application. Then, when a subsequent business call is made to the same EJB, steps 1 and 2 are not necessary.

In the CreditCard application, a connection to the database is open for every call. We can use a long-term database connection and avoid the time it takes to open a connection, which is very significant.

In some cases, database access takes a significant amount of time. We have identified some opportunities to optimize our SQL statements, for example, by merging two select statements into one.

The class at the client-side that handles communication with the server maintains a list of available servers, which is obtained querying the naming service when the client is initialized. When a request is sent to a server and the request fails, that server is removed from the list. When the list goes down to zero items, the naming service is queried to obtain a refreshed list of available servers. Querying the naming service impacts the total response time of the transaction. To minimize this impact, we want to query the naming service when the size of the list goes from two down to one element. The goals is to try and avoid the situation where the client has no server to talk to, which is the worst-case in terms of impact to execution time.

Load-balance: the current situation - without load-balancing - is as follows: when we have primary server A and replica B, all client applications will send requests to server A; server B will only get a request if server A fails. If we distribute the load across all available servers, the overall throughput of the system should increase and average response time should decrease. Therefore, we plan to implement and test load-balance by:

Having the client call each available server in a round-robin fashion. Thus, if client has servers A, B and C on the list of available servers, it will make the first call to A, the second to B, the third to C, the fourth to A, and so on.
Making the very first call of each client execution to a random server among the servers available. The goal is to avoid the situation where each client execution makes just one or two calls to the server (this is an unlikely but possible operational profile); in that case, there would be a concentration of requests on server A (on the first servers of the list in general).
Creating a script that generates the synthetic workload corresponding to 20+ simultaneous clients.

About the Fault Injector

God application is the replication manager and naming service for the ARS system. For performance analysis, the role of fault injector was also given to God.
The following method was added to God's interface:

scheduleShutdown(long interval)

This method would spawn a thread for each registered application server replica. The thread would periodically invoke stopServer() on RestartServer, which is an RMI application that runs on each machine where a replica of the application server is running. See the red connectors in the diagram below.
RestartServer is responsible for restarting the application server in case of failures, but in the fault injection context, it was used to kill the application server instead. It would do that by calling a script (forceShutdown.cmd) that would interrupt the execution of WebLogic application server on that machine.

images/runtimeViewFI.jpg