Tuesday May 11, 2004
Hamerschlag Hall D-210
With each technology generation, we are experiencing an increased rate
of cosmically-induced soft errors in our chips. In the past, the impact
of such errors could be minimized through protection of large memory
structures. Unfortunately, such techniques alone are becoming insufficient
to maintain adequately low error rates. Although, to a very rough approximation,
the fault rate per transistor is not changing much, the increasing number
of transistors is resulting in an ever increasing raw rate of bit upsets.
Thus, we are starting to see a dark side to Moore's Law in which the
increased functionality we get with our exponentially increasing number
of transistors is being countered with a exponentially increasing soft
error rate. This will take increasing effort and cost to cope with.
In this talk I will describe the severity of the soft error problem
as well as techniques to estimate a processor's soft error rate. These
estimates should help designers choose appropriate error protection schemes
for various structures within a microprocessor. A key aspect of our soft
error analysis is that some single-bit faults (such as those occurring
in the branch predictor) will not produce an error in a program's output.
We define a structure's architectural vulnerability factor (AVF) as the
probability that a fault in that particular structure will result in
an error in the final output of a program. A structure's error rate is
the product of its raw error rate, as determined by process and circuit
technology, and the AVF. Unfortunately, computing AVFs of complex structures,
such as the instruction queue, can be quite involved. To guide such complex
AVF calculation, we identify numerous cases, such as prefetches, dynamically
dead code, and wrong-path instructions, in which a fault will not affect
correct execution. Our simulations using these techniques show that the
AVFs of a Mckinley-like microprocessor's instruction queue and execution
units are 29% and 9%, respectively.
Shubu Mukherjee is the Director of Intel's FACT group in Hudson, Massachusetts.
The Fault Aware Computing Technology (FACT) group is involved with various
aspects of soft error measurement, detection, and recovery techniques
in current and future machines. In the past, he worked for Digital Equipment
Corporation for ten days and Compaq Computer Corporation for three years.
In Compaq, he worked on fault tolerance techniques for Alpha processors
and was one of the architects of the Alpha 21364 interconnection network.
He received his B.Tech. from the Indian Institute of Technology, Kanpur
and M.S. and PhD from the University of Wisconsin-Madison. He has received
a number of outstanding achievement awards in the past few years.