# The Efficacy of Error Mitigation Techniques for DRAM Retention Failures

Samira Khan\*<sup>\$</sup>, Donghyuk Lee\*, Yoongu Kim\*, Alaa R. Alameldeen<sup>\$</sup>, Chris Wilkerson<sup>\$</sup>, and Onur Mutlu\* \*Carnegie Mellon University \$Intel Labs

### DRAM Scaling Problem

- DRAM is critical for performance
- Demand for high capacity
- Scaling enabled higher capacity
- Scaling of DRAM results in failures
- Intermittent failures are hard to detect



Longer manufacture-time tests, Lower yield, Higher cost

#### Vision: Online Profiling

- Detect and mitigate errors runtime
  - After the system has become operational
- Reduces cost of testing, increases yield, enables scaling
- We analyze the efficacy of system-level techniques
  - Using experimental data from real DIMMs



#### DRAM Intermittent Failures









### Efficacy of System-level Detection and Mitigation

#### Testing Write some pattern in the module Wait until Read and verify refresh interval Even after hundreds of rounds, a small number of new cells keep failing Only a few rounds can discover most of the failures 100 200 300 400 500 600 700 800 900 1000 **Number of Rounds**

## **2** Guardbanding



Even a large guardband (5X) cannot detect 5-15% of the intermittently failing cells

### **B** Error Correcting Code

Additional information to detect error and correct data —— SECDED ---- SECDED, 2X Guardband SECDED code reduces error rate by 100 times Probability ( 1000 **Number of Rounds Combination of techniques** reduces error rate by 10<sup>7</sup> times

A combination of mitigation techniques is much more effective

#### **Towards an Online Profiling System**

#### **Key Observations so far:**

1. Testing alone cannot detect all possible failures

**Testing alone cannot detect** 

all possible failures

- 2. Combination of ECC and other mitigation techniques is much more effective
  - But degrades performance
- 3. Testing can help to reduce the ECC strength
  - Even if we start with a higher strength ECC

**Periodically Test** Mitigate errors and **Initially Protect DRAM** Parts of DRAM reduce ECC with Strong ECC **Test** 

Run tests periodically after a short interval at smaller regions of memory