Activity Tracing Component: Preprocessing Traces in a Distributed Storage System Adrian Ng Introduction: Real-time abnormality detection in distributed systems is a desired property for a smart and autonomous self-managing system like the self-star system1. In order to achieve the property, the system has to incorporate a monitoring infrastructure to first gather end-to-end traces that indicate system health and second preprocess them efficiently. Background Information: We have the Activity Tracing Component (ATC) in the self-star system to gather system health traces. The ATC includes source-level instrumentations across the system as well a server (the ATC server) which save the raw trace records into a database. Motivation: One of the downsides of the ATC is the database schema used being too primitive. it only represents fields in the raw records and it lacks systemic way to store the computed or mined traces into the database. In order to achieve self-manageability, the ATC server also needs to have the ability to preprocess the traces as well as to systemically and efficiently save them to the database. In other words, instead of just storing the raw trace records, the server should extract semantic knowledge from a set of related traces as soon as it receives them. And then the server should save the pre-processed traces to the database in a systematic and permanent way. Not only would that allow real-time and efficient subsequent querying2 (from the ATC client.s point of view), but also, more importantly, it enables us the ability to mine the data more efficiently. Project Goals: In order to achieve the intended advantages, we propose an enhancement to the ATC server. However, the enhancement presents several challenges, which we strive to answers. (Note that the enhancement discussed here is similar to the advantages you can get from using a streaming database.) 1. Can we serve client requests more efficiently? If so, by how much and under what circumstances? 2. How quickly can the preprocessing be? Or how long does the client need to wait before they can query the preprocessed traces. What are the different tradeoffs that affect the .real-timeness.? 3. How much overhead would the enhancement introduce to the ATC server? If there is overhead, how do I quantify it? 4. How much does the enhancement ease the data pruning process? While the questions presented are not trivial, we will address some of them for the project. 1. Can we serve client requests more efficiently? While the answer is obvious, we will provide a convincing explanation, if not supported with empirical data. 2. What are the tradeoffs that affect the .real-timeness. of the preprocessing? The more aggressively we preprocess the traces, the more likely the enhanced ATC will drop traces. Because we can tolerate a wide time window (in terms of days) during which the data is being processed, true real-timeness does not concern us much. However, we are interested in seeing the interaction among real-timeness, record response time (the time it takes ATC to flush a trace record to the database since the record arrive), number of records dropped in addition to the load on the server (rate of which the raw traces arrive at the server). Note that for efficiency optimization, we could have used a mechanism to prioritize and preprocess only important information, in addition to other smart mechanism to prune and compress the data. However, at the current stage, we are more interested in providing a general infrastructure for the intended advantages.