RAS
Design Practices in Commercial Microprocessors
Tuesday October 3, 2006
Hamerschlag Hall 1112
4:30 pm
Nhon Quach
AMD
As the dimensions of transistor devices become smaller and more
susceptible to soft errors with each technology generation, ensuring the
reliability of a commercial microprocessor is becoming a bigger challenge.
The task is further exacerbated by the trends of integrating larger caches
and more processor cores on a single socket.
This presentation consists of two parts. In the first part, I will
introduce the general concepts (soft errors, error rates, reliability vs.
availability, etc.) and considerations behind strategies used to implement
RAS (reliability, availability, and serviceability) features in current
commercial microprocessors. I will examine past design practices in the
area of RAS, present the challenges facing current designers, and discuss
future trends. I will use examples at the end to tie these concepts
together.
In the second part, time permitting, I will delve deeply into two topics
to illustrate the concepts presented in the first part: thread level
redundancy (TLR) and data poisoning. TLR is an efficient
microarchitectural redundancy technique to protect a processor against
soft errors. The technique shows great potential as a way to combat soft
errors in the combinational logic and has been widely researched and
reported in the literature. However, in this presentation, I will show
that the technique, as proposed in the literature, is unlikely to be
adopted in commercial microprocessors due to its high verification
complexity. Data poisoning is a way to mitigate the impact of double-bit
ECC errors in the processor. I will describe its benefits and
implementation cost in this part of the presentation. The technique is
chosen because it's relatively new and serves as a good example of
graceful degradation that I talk about in the first part of the
presentation.
Nhon Quach has over 15 years of industry experience in various aspects of
microprocessor design at different companies including AMD, Oracle, and
Intel. He is currently an architect working on the processor specific
features (including RAS) of AMD's next generation processor. At Oracle, he
set hardware and software optimization strategies for various server
platforms in the virtual OS server technology group. Prior to joining
Oracle, he was the system architect as well as the manager of the Itanium
system architecture group at Intel and defined the RAS architecture for
the first generation Itanium processor. His Ph.D. research work at
Stanford on high-speed integer addition, integrated rounding for FP
addition, and systematic rounding for FP multiplication has been widely
adopted in the FP hardware design community today.
|