Carnegie Mellon University

The sun rising over earth

October 09, 2025

Detecting and Correcting Soft Errors in Space

By Amber Frantz

Krista Burns

The Institute of Electrical and Electronics Engineers (IEEE) has awarded Franz Franchetti, professor of electrical and computer engineering and associate dean for research, and Ken Mai, principal systems scientist for electrical and computer engineering, the IEEE High Performance Extreme Computing (HPEC) Best Paper Award.

Their paper, “Towards an Algorithm-based Approach for Soft Error Tolerance Using Interval Arithmetic,” was presented at the 29th annual IEEE HPEC virtual conference this September and was selected as the top paper out of nearly 200 submissions to HPEC.

“It’s an honor to receive this recognition from IEEE,” said Franchetti. “The High Performance Extreme Computing Conference brings together the best minds in performance. We are honored to have been able to present our findings.”

Large scale deployment of semiconductor technology in both space and terrestrial applications has led to increased vulnerability to radiation soft errors. Soft errors occur when high energy particles, like protons or neutrons, strike the chip and cause a charge disturbance resulting in a temporary change of state or data corruption. These errors can lead to system failures, and protecting against them is especially crucial in space or upper atmosphere conditions where radiation levels are significant.

In their paper, Franchetti and Mai present a novel algorithm-based approach using interval arithmetic and forward error analysis to help a system detect and correct these temporary soft errors. The research team built a proof-of-concept, a test chip that implemented the technique on an FFT systolic array datapath.

Traditional approaches to mitigate the effects of soft errors include system-level hardware redundancy for error detection and correction through the use of triple modular redundancy (TMR) or dual modular redundancy (DMR) schemes. However, both TMR and DMR incur significant costs in chip area, power, and performance.

An alternative approach to reducing soft errors is algorithm-based fault tolerance (ABFT)—a technique that assists in error detection and correction within the algorithm itself. The work in this paper is inspired by ABFT and proposes a quantitative evaluation of a new hardware redundancy approach, demonstrating that floating-point interval arithmetic and forward error analysis of a specific computation can assist in error detection and redundancy.

“We hope to test our approach in low Earth orbit in the coming months,” explains Mai. “Our unique protection technique can significantly  improve the energy efficiency and robustness of space systems making space exploration available to more types of researchers for lower costs.”