Fault Tolerant Design - Feb 28th, 2005 ======================================== Fault detection: ---------------- - CORBA COMM_FAILURE exception at client - raised because communication is lost (client does not get responses from the server) - QUESTION: is it possible to change the timeout value this exception is raised? Fail-over: ---------- - Have multiple GameServer objects concurrently running on multiple server machines - When client loses connection to original GameServer, client will contact another GameServer to re-establish a connection and resume the Game - GameServer objects will be numbered, e.g. GameServer-1, GameServer-2, etc. - Client needs to know ahead of time how many GameServers exist - Tricky case occurs when clients attempt to reconnect to GameServers and clients re-establish connections with different servers e.g. Clients A and B reconnect to GameServer-2 and clients C and D reconnect to GameServer-3 - Allow games to continue across multiple Game objects - Multiple servers will attempt to access/modify the same database data - use MySql LOCK TABLES and UNLOCK TABLES to keep DB consistent - this will require us to modify our database structure so that there is separate set of tables for each game ID e.g. Players1, GameState1, Bomb1 for Game ID = 1 and Players2, GameState2, Bomb2 for Game ID = 2 This will allow other games to continue modifying the database without hanging each other if we used a LOCK/UNLOCK TABLES on a single global Players table - Lock tables before reading game state, make DB updates, unlock tables - Will need to add a new method to the GameServer (resume game) that clients will call when trying to reconnect. This will create a new Game object if necessary and will resume the game. - We will also need to decide how long to "pause" resumed games to wait for the other clients to rejoin the game (i.e. don't immediately resume the game if only 1 client has rejoined because there won't be anyone to play against) Recovery: --------- - Server restarts - Server needs to re-add self to the CORBA naming service and overwrite the old entry (must change naming service in case the server starts up on a different port number) - Servers need to know ahead of time which server number they are e.g. GameServer-1 vs. GameServer-2 - Client program crashes and restarts - We don't attempt to rejoin the existing game in this case since the client will no longer remember its client ID - When starting up the client process, it will attempt to join a completely new game - Write a script ("mr.script") that automatically restarts dead servers - "mr.script" is a wrapper for each server that will restart ONLY that particular server replica Checkpointing: -------------- - Our servers do not store any state (state is entirely stored in the database), therefore we do not need to perform any checkpointing Fault Scenarios: ---------------- Assuming we allow clients to play in the same game via multiple GameServers: Middle-tier crash scenario: - All clients in Game 1 are connected to the game via GameServer-1 - The process running GameServer-1 and Game 1 crashes - Clients receive CORBA COMM_FAILURE exception - Clients independently attempt to connect to other GameServers - Clients pick another GameServer to connect to at random (this prevents the load on the next numerical GameServer from spiking as all clients attempt to migrate to it) - Other GameServer objects will each create a new Game object for the clients - knows to do this based on the sequence number and Game ID the client sends - Clients resume playing their game via the new Game objects Middle-tier restart scenario: - GameServer that restarts will overwrite its old entry in the naming service and will wait for new connections - we do not attempt to move original connections back to the original server as this will increase latency for game play