Reliable Processors and Systems

From James Hoe

(Difference between revisions)
Jump to: navigation, search
Revision as of 15:22, 6 December 2008
Admin (Talk | contribs)

← Previous diff
Current revision
Admin (Talk | contribs)

Line 1: Line 1:
This research investigates the impact of soft-error tolerance in future deep-submicron microprocessor designs. The study investigates different options to achieve the desired level of protection against soft errors. This research effort is in part supported by NSF through a CAREER Award. The [http://www.ece.cmu.edu/~truss TRUSS Project] (Total Reliability Using Scalable Servers) develops a reliable, available, and serviceable (RAS) hardware platform based on a distributed cluster of commodity blade servers. The goal of the project is to leverage the cost-effectiveness of commodity processor and memory modules in a reliable server design that achieves both performance and cost scalability. This research effort is in part supported by NSF through an ITR Award and by Intel Corp. (Go to the [http://www.ece.cmu.edu/~truss TRUSS Project Page].) This research investigates the impact of soft-error tolerance in future deep-submicron microprocessor designs. The study investigates different options to achieve the desired level of protection against soft errors. This research effort is in part supported by NSF through a CAREER Award. The [http://www.ece.cmu.edu/~truss TRUSS Project] (Total Reliability Using Scalable Servers) develops a reliable, available, and serviceable (RAS) hardware platform based on a distributed cluster of commodity blade servers. The goal of the project is to leverage the cost-effectiveness of commodity processor and memory modules in a reliable server design that achieves both performance and cost scalability. This research effort is in part supported by NSF through an ITR Award and by Intel Corp. (Go to the [http://www.ece.cmu.edu/~truss TRUSS Project Page].)
-* '''OpenSPARC: An Open Platform for Hardware Reliability Experimentation'''. Ishwar Parulkar, Alan Wood, [[James C. Hoe]], Babak Falsafi, Sarita V. Adve and Josep Torrellas. Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE), April 2008. ([http://www.ece.cmu.edu/~jhoe/distribution/2008/selse08.pdf pdf])+* '''Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors'''. B. T. Gold, B. Falsafi, and J. C. Hoe. Pacific Rim International Symposium on Dependable Computing (PRDC), November 2009.
-* '''Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding'''. Jangwoo Kim, Nikos Hardavellas, Ken Mai, Babak Falsafi and [[James C. Hoe]]. ACM/IEEE International Symposium on Microarchitecture (MICRO), December 2007. ([http://www.ece.cmu.edu/~jhoe/distribution/2007/micro07.pdf pdf])+* '''OpenSPARC: An Open Platform for Hardware Reliability Experimentation'''. I. Parulkar, A. Wood, J. C. Hoe, B. Falsafi, S. V. Adve and J. Torrellas. Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE), April 2008. ([http://www.ece.cmu.edu/~jhoe/distribution/2008/selse08.pdf pdf])
-* '''PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers'''. Jangwoo Kim, Jared C. Smolens, Babak Falsafi and [[James C. Hoe]]. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), December 2007. ([http://www.ece.cmu.edu/~jhoe/distribution/2007/prdc07.pdf pdf])+* '''Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding'''. J. Kim, N. Hardavellas, K. Mai, B. Falsafi and J. C. Hoe. International Symposium on Microarchitecture (MICRO), December 2007. ([http://www.ece.cmu.edu/~jhoe/distribution/2007/micro07.pdf pdf])
-* '''Detecting Emerging Wearout Faults'''. Jared C. Smolens, Brian T. Gold, [[James C. Hoe]], Babak Falsafi, and Ken Mai. The Third Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2007. ([http://www.ece.cmu.edu/~jhoe/distribution/2007/selse07.pdf pdf])+* '''PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers'''. J. Kim, J. C. Smolens, B. Falsafi and J. C. Hoe. Pacific Rim International Symposium on Dependable Computing (PRDC), December 2007. ([http://www.ece.cmu.edu/~jhoe/distribution/2007/prdc07.pdf pdf])
-* '''Reunion: Complexity-Effective Multicore Redundancy'''. Jared C. Smolens, Brian T. Gold, Babak Falsafi, and [[James C. Hoe]]. International Symposium on Microarchitecture (MICRO), December 2006.([http://www.ece.cmu.edu/~jhoe/distribution/2006/micro06.pdf pdf])+* '''Detecting Emerging Wearout Faults'''. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. The Third Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2007. ([http://www.ece.cmu.edu/~jhoe/distribution/2007/selse07.pdf pdf])
-* '''TRUSS: Reliable, Scalable Server Architecture'''. Brian T. Gold, Jared C. Smolens, Jangwoo Kim, Eric S. Chung, Vasileios Liaskovitis, Eriko Nurvitadhi, Babak Falsafi, [[James C. Hoe]], and Andreas G. Nowatzyk. IEEE Micro, Volume 25, Number 6, November/December 2005. ([http://ieeexplore.ieee.org/iel5/40/33228/01566557.pdf?tp=&arnumber=1566557&isnumber=33228 pdf]) +* '''Reunion: Complexity-Effective Multicore Redundancy'''. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. International Symposium on Microarchitecture (MICRO), December 2006.([http://www.ece.cmu.edu/~jhoe/distribution/2006/micro06.pdf pdf])
-* '''Understanding the Performance of Concurrent Error Detecting Superscalar Microarchitectures'''. Jared C. Smolens, Jangwoo Kim, [[James C. Hoe]], and Babak Falsafi. Invited paper at IEEE Symposium on Signal Processing and Information Technology, December 2005. ([http://www.ece.cmu.edu/~jhoe/distribution/2005/isspit05.pdf pdf]) +* '''TRUSS: Reliable, Scalable Server Architecture'''. B. T. Gold, J. C. Smolens, J. Kim, E. S. Chung, V. Liaskovitis, E. Nurvitadhi, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. IEEE Micro, Volume 25, Number 6, November/December 2005. ([http://ieeexplore.ieee.org/iel5/40/33228/01566557.pdf?tp=&arnumber=1566557&isnumber=33228 pdf])
-* '''Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth'''. Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, [[James C. Hoe]], and Andreas G. Nowatzyk. IEEE Micro, Volume 24, Number 6, November/December 2004. ([http://ieeexplore.ieee.org/iel5/40/30203/01388154.pdf?tp=&arnumber=1388154&isnumber=30203&arSt=22&ared=29&arAuthor=Smolens%2C+J.C.%3B++Gold%2C+B.T.%3B++Jangwoo+Kim%3B++Falsafi%2C+B.%3B++Hoe%2C+J.C.%3B++Nowatzyk%2C+A.G.%3B pdf]) ''(note: Top Picks version of ASPLOS 2004.)''+* '''Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth'''. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. IEEE Micro, Volume 24, Number 6, November/December 2004. ([http://ieeexplore.ieee.org/iel5/40/30203/01388154.pdf?tp=&arnumber=1388154&isnumber=30203&arSt=22&ared=29&arAuthor=Smolens%2C+J.C.%3B++Gold%2C+B.T.%3B++J.+Kim%3B++Falsafi%2C+B.%3B++Hoe%2C+J.C.%3B++Nowatzyk%2C+A.G.%3B pdf]) ''(note: Top Picks version of ASPLOS 2004.)''
-* '''Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures'''. Jared C. Smolens, Jangwoo Kim, [[James C. Hoe]], and Babak Falsafi. International Symposium on Microarchitecture (MICRO), November 2004. ([http://www.ece.cmu.edu/~jhoe/distribution/2004/micro04.pdf pdf])+* '''Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures'''. J. C. Smolens, J. Kim, J. C. Hoe, and B. Falsafi. International Symposium on Microarchitecture (MICRO), November 2004. ([http://www.ece.cmu.edu/~jhoe/distribution/2004/micro04.pdf pdf])
-* '''Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth'''. Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, [[James C. Hoe]], and Andreas G. Nowatzyk. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2004. ([http://www.ece.cmu.edu/~jhoe/distribution/2004/asplos04.pdf pdf])+* '''Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth'''. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2004. ([http://www.ece.cmu.edu/~jhoe/distribution/2004/asplos04.pdf pdf])
-* '''Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery'''. Joydeep Ray, [[James C. Hoe]] and Babak Falsafi. International Symposium on Microarchitecture (MICRO), December 2001. ([http://www.ece.cmu.edu/~jhoe/distribution/2001/micro01.pdf pdf])+* '''Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery'''. J. Ray, J. C. Hoe and B. Falsafi. International Symposium on Microarchitecture (MICRO), December 2001. ([http://www.ece.cmu.edu/~jhoe/distribution/2001/micro01.pdf pdf])
[[Category: Research Areas]] [[Category: Research Areas]]

Current revision

This research investigates the impact of soft-error tolerance in future deep-submicron microprocessor designs. The study investigates different options to achieve the desired level of protection against soft errors. This research effort is in part supported by NSF through a CAREER Award. The TRUSS Project (Total Reliability Using Scalable Servers) develops a reliable, available, and serviceable (RAS) hardware platform based on a distributed cluster of commodity blade servers. The goal of the project is to leverage the cost-effectiveness of commodity processor and memory modules in a reliable server design that achieves both performance and cost scalability. This research effort is in part supported by NSF through an ITR Award and by Intel Corp. (Go to the TRUSS Project Page.)

  • Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors. B. T. Gold, B. Falsafi, and J. C. Hoe. Pacific Rim International Symposium on Dependable Computing (PRDC), November 2009.
  • OpenSPARC: An Open Platform for Hardware Reliability Experimentation. I. Parulkar, A. Wood, J. C. Hoe, B. Falsafi, S. V. Adve and J. Torrellas. Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE), April 2008. (pdf)
  • Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. J. Kim, N. Hardavellas, K. Mai, B. Falsafi and J. C. Hoe. International Symposium on Microarchitecture (MICRO), December 2007. (pdf)
  • PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers. J. Kim, J. C. Smolens, B. Falsafi and J. C. Hoe. Pacific Rim International Symposium on Dependable Computing (PRDC), December 2007. (pdf)
  • Detecting Emerging Wearout Faults. J. C. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai. The Third Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2007. (pdf)
  • Reunion: Complexity-Effective Multicore Redundancy. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. International Symposium on Microarchitecture (MICRO), December 2006.(pdf)
  • TRUSS: Reliable, Scalable Server Architecture. B. T. Gold, J. C. Smolens, J. Kim, E. S. Chung, V. Liaskovitis, E. Nurvitadhi, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. IEEE Micro, Volume 25, Number 6, November/December 2005. (pdf)
  • Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. IEEE Micro, Volume 24, Number 6, November/December 2004. (pdf) (note: Top Picks version of ASPLOS 2004.)
  • Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures. J. C. Smolens, J. Kim, J. C. Hoe, and B. Falsafi. International Symposium on Microarchitecture (MICRO), November 2004. (pdf)
  • Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth. J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2004. (pdf)
  • Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery. J. Ray, J. C. Hoe and B. Falsafi. International Symposium on Microarchitecture (MICRO), December 2001. (pdf)