Fault-Tolerant Baseline Architecture - Run Time

Scenarios are generated by taking our initial baseline, identifying where we could and could not test faults, and determining what the correct action of the system should be.  Faults are injected by kill –9ing the server on a given machine.  This is handled by the replication manager at the moment, through a public method in the fault-injector client that is under development.

Our initial use case was a simple use case where the user enters a name and list of stocks.  This data is then stored in the database and can be retrieved by the user. Users now have a more advanced GUI, and their interaction with the system is more involved.  The users now log in, and if they are a user, they receive their portfolio and start receiving updates “immediately”.  If it is a new user, the user creates a portfolio by selecting stocks.  Upon creation, the client then starts receiving updates on the stocks in their profile. At this point both new and existing users are receiving a stream of updated data from the server they are connected to, and can at any point add or remove stocks to or from their profile. 

 

Scenarios:

 

  1. The case of a missed initial connection is trivial; the client just keeps trying until it connects.

 

  1. A user connects, is prompted for a username.  The server crashes before the user enters their username.  The client receives a RemoteException, “fails-over” to one of the other servers, and upon connection re-sends the request (transparently to the user).

 

  1. A user is connected, and has input their username.  At this point a new user is prompted to create a new profile.  A server crashes in between confirmation and the setting of the profile.  The client receives a RemoteException, “fails-over” to one of the other servers, and re-requests.  If the profile has been written, it will remain persisted.  If it has not, it will be persisted.  From there operation continues.

 

  1. A user is connected, logged in, and if new has set their profile.  Now it is time to get the profile.  If a server crashes, then the client will receive a RemoteException, it will fail-over, and will resend the request.  This is safe because we have already ensured that any set profile request will keep trying until it is persisted.

 

  1. A user is connected, logged in, has received their profile, and is about to getAllStocks.  This involves not only getting the data, but also subscribing to the updates.  The client will again receive the RemoteException within one second of the server-crash, and so will fail-over.  The client maintains the request it made, and will resend.  This repeat request is safe because the server deals with repeat subscriptions.  The getAllStocks method is safe because it will overwrite with up-to-date information, which is our focus in the first place.

 

  1. A user is connected, logged in, has received their profile, and is getting updates.  Upon a server crash, the client simply fails over to another, which will continue sending out this data feed to all subscribed listeners.  The client re-subscribes and will continue to receive updates in squishy real-time.

 

  1. A user is connected, logged in, has received their profile, is getting stock updates, and then wants to update teir profile to contain more or less stocks.  This is handled similarly to the profile-creation, since it is a set-profile request.

 

Fault-injection:

Our designated fault-injection robustness includes server crash faults.  This may induce message loss, but message loss is not yet a focus of our system’s robustness requirements.  We will be treating the database, replication manager,  “external” stock ticker data feed, and JNDI service as “Sacred”.  Given that these are working properly, messages will not be assumed to be lost in transit between the servers and the databases.  More refined fault injection (with granularity going down to inter-class/inter-bean communication) is expected soon.

At the point at which we require a fault to be tested, either our standalone fault-injector, or the server crash fault enabled replication manager can kill a server.  Because the client continuously communicates with the server it is connected to, within one second, the client will “know” that there is a problem on the other end.  The client will then pick a random server from the server properties list, and keep trying to connect to servers until communication is resumed.  At that point, the client resends whatever request it was trying before.  Due to the idempotent nature of our system, repeat requests are not dangerous, and will only serve to confirm the request in the event that it was already submitted.  If the client tries to add or remove stocks that are already there or gone, the server will gracefully handle the repeat request.

On the server side, when a server is killed, it re-spawns automatically, raised by the replication manager.  The state of the stock-cache is copied over, and a stock-update queue is also transferred, until the newly raised server is functioning consistently with the others.  At that point, the “new” server is re-registered with the JNDI service, and the client, if fail-over occurs again, has the chance to reconnect with this “new” server.

In order to avoid cycles of connections to two servers, we have the client randomly select a server to start trying to connect to (ensuring it isn’t the one that just dropped it), then proceed iteratively through the list of servers until connection is resumed.

Summary:

A user can connect, activate or create a profile, and then receives a stream of update data.  The user can change what data form the feed it receives.  Upon any server failure, the server is restarted by the replication manager, but the client fails over to another server on the list.  If a client was in the middle of a request, then that request is sent again.  If the request involves anything that was already committed to, then the server handles it gracefully, keeping the process transparent to the user.