Self-Repair of Uncore Components in Robust System-on-Chips

Yanjing Li (Intel Labs/Stanford)

Thursday, September 19, 4:30pm-5:30pm
Hamerschlag Hall 1107

Abstract

Self-repair replaces/bypasses faulty components in a system-on-chip (SoC) to keep the system functioning correctly even in the presence of permanent faults. Such faults may result from early-life failures, circuit aging, and manufacturing defects and variations. Unlike on-chip memories, processor cores, and networks-on-chip, little attention has been paid to self-repair of uncore components (e.g., cache controllers, memory controllers, and I/O controllers) that occupy significant portions of multi-core SoCs. In this talk, we present new techniques that utilize architectural features to achieve self-repair of uncore components while incurring low area, power, and performance costs. We demonstrate the effectiveness and practicality of our techniques, using the industrial OpenSPARC T2 SoC with 8 processor cores that support 64 hardware threads. Our key results are: 1. Our techniques enable effective self-repair of any single faulty uncore component with 7.5% post-layout chip-level area impact and 3% power impact. In contrast, existing redundancy techniques impose high (e.g., 20%) area costs. Our techniques do not incur any performance impact in fault-free systems. In the presence of a single faulty uncore component, there can be a 5% application performance impact. 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with graceful degradation of application performance. 3. Our techniques achieve high self-repair coverage of 97.5% in the presence of a single fault. Our self-repair techniques also enable flexible tradeoffs between self-repair coverage and area costs. For example, 75% self-repair coverage can be achieved with 3.2% post-layout chip-level area impact.

Bio

Yanjing Li is a research scientist at Intel Labs and a visiting scholar at Stanford University. She received her Ph.D. in Electrical Engineering from Stanford University in 2013, and both a M.S. in Mathematical Sciences and a B.S. in Electrical and Computer Engineering from Carnegie Mellon University. Yanjing has authored and presented award-winning papers at IEEE International Test Conference and IEEE VLSI Test Symposium. She has also been awarded an Intel Divisional Recognition Award. Her research interests include robust system design, validation and test, and computer architecture.