Tudor Dumitraş
ECE Department
Carnegie Mellon University
Pittsburgh, PA 15213










Current Project

  • Dependable, Dynamic Upgrades in Distributed Systems

    Implementing online software upgrades (changes in the behavior, configuration, code, data or topology of a running application) is essential for enabling the self-regulating, autonomic management and maintenance of enterprise computer systems. Such dynamic change-management is difficult to perform because of the complex interactions between the distributed components:

    • The dependecies between system components are not always well documented and are very hard to track. A dynamic upgrading system must be careful not to disable existing applications by breaking unknown dependencies, while updating all the components required by the new version of the application being installed.
    • When upgrading distributed systems, this problem is even more acute because there are additional sources of dependencies (e.g., networking protocols, middleware, routes). For example, upgrading a component to a version that exposes a modified RPC API (e.g., a new COM interface, a modified CORBA object or a WSDL method with different parameters) requires patching all the entities that reference the upgraded component, taking into account the fact that sometimes the old and new APIs may be incompatible. 
    • A related problem, specific to distributed systems, is that sometimes upgrades have to be performed across mutually-distrustful administrative domains, while preserving the same correctness invariants and coherence of the forward and reverse dependencies.
    • Dynamic upgrades must preserve the correctness of the system, which often requires the transfer of state, composed of persistent and even transient data. Many applications require massive amounts of data to be converted to new schemas, which happens over a long period of time during which clients may request transactions involving the same data being converted.
    • Dynamic upgrading systems must also assess the impact of the changes on the running services and determine the most opportune moment to apply the upgrade to avoid significant penalties due to degraded performance and dependability, while improving the value of the infrastructure according to some well-defined metrics.
    • The upgrading process must be reliable and tolerate faults without the loss of data or functionality.
  • Related projects

 





 


Past Graduate Research Projects

  • Versatile Dependability

    This work is part of the Middleware for Embedded Adaptive Dependability (MEAD) project. The goal of this project is to enhance distributed CORBA applications with new capabilities, including: transparent, yet tunable, fault tolerance in real time, proactive dependability, resource-aware system adaptation to crash, communication and timing faults with scalable and fast fault-detection and fault-recovery.

    Versatile dependability defines a hierarchy of low-level and high-level control knobs. Low-level knobs control the internal fault-tolerant mechanisms of the infrastructure and typically correspond to discrete (e.g., the degree of replication) or even non-countable sets (e.g., replication styles). In contrast, high-level knobs should regulate external properties (e.g., scalability, availability) that are relevant to the system’s users and hide internal implementation details, and they should have a linear transfer characteristic, with unsurprising effects for the users.

    I have implemented the first version of MEAD. My contributions to the project include:

    • Defining and implementing a "control knob" for tuning the system scalability;
    • Designing a mechanism for switching between active and passive replication on-the-fly;
    • Discovering the "magical 1% effect": the impredictability (in terms of uncontrollably-high end-to-end latencies) of a fault-tolerat CORBA application is isolated to 1% of the remote invocations.

    The MEAD trace

    The MEAD trace contains 9.1 Gbytes of experimental data about 960 different configurations of our system. The results are described in my paper:



  • Stochastic Communication
    Stochastic communication is a new communication paradigm for on-chip networks. As opposed to traditional system-on-chip (SoC) communication architectures, which are organized around shared buses, the networks-on-chip (NoCs) suggest to place the various modules of a SoC in the nodes of a regular structure (for example a rectangular grid) and to connect them with a micro-network. This requires more sophisticated communication protocols which have to take into account the almost random faults specific to modern deep-sub-micron (DSM) technologies, that cannot be handled by the current CAD tools. Relaxing the requirement of 100% correctness for devices and interconnects would drastically reduce the costs of design but, at the same time, it requires that SoCs be designed with some degree of system level fault-tolerance. Stochastic communication defines a new class of protocols for the on-chip networks, based on a randomized broadcast algorithm. Our results show that stochastic communication is resilient to the faults specific to DSM technologies, while maintaining a constant or gracefully degrading latency. The design methodology associated with stochastic communication provides fault-tolerance and high performance while drastically simplifying the task of the designer. The wide range of applicability of this method, combined with the current trend to have several clock/frequency/voltage domains on a single chip, lead us to believe that our technique will create a major paradigm shift in SoC design. Stochastic communication continues to be developed in the System-Level Design research group at Carnegie Mellon.

Assistive Technologies 

  • Eye of the Beholder: Text-Recognition System for the Visually-Impaired
    Blind and visually-impaired people cannot access essential information in the form of written text in our environment (e.g., on restaurant menus, street signs, door labels, product names and instructions, expiration dates). We have developed a mobile text-recognition system capable of extracting written information from a wide variety of sources and communicating it on-demand to the user. The user needs no additional hardware except an ordinary, Internet-enabled mobile camera-phone - a device that many visually-impaired individuals already own. This approach fills a gap in assistive technologies for the visually-impaired because it makes users aware of textual information not available to them through any other means.
  • Color-Blindness Correction System
    Tests performed during the past 50 years on the few people with one normal eye and another one affected by colorblindness have lead to the creation of an approximate model for simulating the effects of this genetic deficiency. Based on this model, we have discovered that, by applying color-space filtering and processing, images can be enhanced for the color-blind vision. In normal images, there may be some patterns that are completely hidden for an eye with deficient vision. However, by applying our technique, these patterns are revealed, with only minimal changes to the content of the images. This result also shows that perception-based image and video encoding is possible, by keeping only the color information that is relevant to the viewer's eye.

Undergraduate Research Projects

  • Open Source Atomic Broadcast Package
    Atomic Broadcast is a very important communication primitive for reliable, distributed systems. The specification of Atomic Broadcast states that the same messages should be delivered in the same order at all the receivers. This is a very difficult problem, proven impossible to solve in a totally asynchronous system. I took part in an effort to develop Atom, an open source Atomic Broadcast package for UNIX, based on unreliable failure detectors.
  • The Argo search engine

    This was an undergraduate project to develop a Web search engine and crawler in Java. Argo was able to answer a request in 0.3 s (in 2002), to maintain a distributed database and to analyze the indexed sites in parallel

Misc

What does this C program print out?

int val = 3;
void Exit(int val) {
printf("%d", val);
exit(0);
}
void usr1_handler(int sig) {
Exit(val);
}
int main() {
int pid;
signal(SIGUSR1, usr1_handler);
if ((pid = fork()) == 0) {
setpgid(0, 0);
if (fork())
Exit(val + 1);
else
Exit(val - 1);
}
kill(-pid, SIGUSR1);
}

If you think you know the answer ... think again. Then read this page.

 

Courses