Activity Tracing Component: Preprocessing Traces in a Distributed Storage System

Adrian Ng

Introduction:
Real-time abnormality detection in distributed systems is a desired property for
a smart and autonomous self-managing system like the self-star system1. In order
to achieve the property, the system has to incorporate a monitoring
infrastructure to first gather end-to-end traces that indicate system health and
second preprocess them efficiently.

Background Information:
We have the Activity Tracing Component (ATC) in the self-star system to gather
system health traces. The ATC includes source-level instrumentations across the
system as well a server (the ATC server) which save the raw trace records into a
database.

Motivation:
One of the downsides of the ATC is the database schema used being too
primitive. it only represents fields in the raw records and it lacks systemic
way to store the computed or mined traces into the database.

In order to achieve self-manageability, the ATC server also needs to have the
ability to preprocess the traces as well as to systemically and efficiently save
them to the database. In other words, instead of just storing the raw trace
records, the server should extract semantic knowledge from a set of related
traces as soon as it receives them. And then the server should save the
pre-processed traces to the database in a systematic and permanent way. Not only
would that allow real-time and efficient subsequent querying2 (from the ATC
client.s point of view), but also, more importantly, it enables us the ability
to mine the data more efficiently.

Project Goals:
In order to achieve the intended advantages, we propose an enhancement to the
ATC server. However, the enhancement presents several challenges, which we
strive to answers.
(Note that the enhancement discussed here is similar to the advantages you can
get from using a streaming database.)
1. Can we serve client requests more efficiently? If so, by how much and under
what circumstances?
2. How quickly can the preprocessing be? Or how long does the client need to
wait before they can query the preprocessed traces. What are the different
tradeoffs that affect the .real-timeness.?
3. How much overhead would the enhancement introduce to the ATC server? If there
is overhead, how do I quantify it?
4. How much does the enhancement ease the data pruning process?

While the questions presented are not trivial, we will address some of them for
the project.
1. Can we serve client requests more efficiently? While the answer is obvious,
we will provide a convincing
explanation, if not supported with empirical data.
2. What are the tradeoffs that affect the .real-timeness. of the preprocessing?
The more aggressively we preprocess the traces, the more likely the enhanced ATC
will drop traces. Because we can tolerate a wide time window (in terms of days)
during which the data is being processed, true real-timeness does not concern us
much. However, we are interested in seeing the interaction among real-timeness,
record response time (the time it takes ATC to flush a trace record to the
database since the record arrive), number of records dropped in addition to the
load on the server (rate of which the raw traces arrive at the server).
Note that for efficiency optimization, we could have used a mechanism to
prioritize and preprocess only important information, in addition to other smart
mechanism to prune and compress the data. However, at the current stage, we are
more interested in providing a general infrastructure for the intended advantages.