Implementing online software upgrades (changes in the behavior,
configuration, code, data or topology of a running application)
is one of the most exciting unsolved problems in distributed systems.
This functionality is essential for enabling the self-regulating,
autonomic management and maintenance of enterprise computer systems.
The biggest challenges are maintaining the existing (and potentially
unknown) dependencies between distributed components and services,
handling API evolution, performing upgrades
that span mutually-distrustful administrative domains, transferring
state -- which may require long running data migration and conversion
tasks executing in parallel with regular requests for the same data
--, assessing and minimizing the impact of upgrades on the running
services while improving the value of the infrastructure according
to some well-defined metrics and tolerating faults during the upgrading
process.
[read more ...]