Supplemental Information for:
Why Do Upgrades Fail And What Can We Do About It?
T. Dumitraş and P. Narasimhan
Abstract
Enterprise-system upgrades are unreliable and often result in
downtime or data-loss. Errors in the upgrade procedure, such as
broken dependencies, constitute the leading cause of upgrade
failures. We propose a novel upgrade-centric fault model, based on
data from three independent sources, which focuses on the impact of
procedural errors rather than software defects.
We show that current approaches for upgrading enterprise systems,
such as rolling upgrades, are vulnerable to these faults because the
upgrade is not an atomic operation and it risks breaking hidden
dependencies among the distributed system-components.
Research paper:
T. Dumitraş and P. Narasimhan. Why Do Upgrades Fail And What Can We Do About It? Toward Dependable, Online Upgrades in Enterprise Systems. In ACM/IFIP/USENIX Conference on Middleware, Urbana-Champaign, IL, Nov.-Dec. 2009.
In this clustering
dendrogram, the leaves correspond to the
55 faults reported in the user study (u), survey (s) and field
study (f). Each vertical line links two clusters into a larger
cluster, and their position on the X-axis indicates the mean
inter-fault distance. For example, two or three identical
faults, reported in different experiments or studies, form a
cluster with mean distance = 0. A link with a significantly
larger distance than the links below suggests the presence of
a natural cluster.
Each cluster, highlighted by a rectangle (left side)
corresponds to a specific type of upgrade faults.
This fault model has four distinct categories:
(1) simple configuration errors (e.g. typos);
(2) semantic configuration errors (e.g. misunderstood effects of
parameters);
(3) broken environmental dependencies (e.g. library or port
conflicts); and
(4) data-access errors, which render the persistent data
partially-unavailable.
The cophenetic correlation coefficient, which measures the
correlation between the inter-fault distance and the distance
in the dendrogram, is 0.85.
Principal-component analysis (right side) creates a two-dimensional
shadow of the fault clusters, which suggests that the four types
of upgrade faults do not overlap.
Annotated fault list: [XLS]
A preliminary version of this fault model is described in the technical report CMU-PDL-08-115: [PDF]
Last updated: .
Contact: Tudor Dumitraş