|
Learning Based Fault Detection Through Indirect System Observation

Mitchell Martin |

Paul Bogdan |

Radu Marculescu |

Shawn Blanton |
Integrated systems are increasingly in need of schemes that ensure that they are operating reliably, free of various types of faults. Detecting faults using conventional approaches is typically accomplished off-line using special test modes of operation that are controlled by expensive test equipment. Obviously, conventional approaches are not viable for systems already deployed and operating in the field, where faults can develop due to NBTI (negative bias temperature instability), for example. Machine learning offers an alternate approach for on-line, fault detection within an integrated system. Assuming subtle, non-catastrophic faults can adversely affect system operation, it is conceivable that models can be learned from system data that distinguishes faulty systems from their fault-free counterparts. A machine learning module within the system is thus envisioned that gathers appropriate data, learns the model, monitors operation using the model, and flags anomalies that result from the manifestation of faults during in-field operation.
The aforementioned framework has been implemented to detect faults that affect a processor design with a network-on-chip (NoC) communication architecture. NoCs with packet-drop rates that do not cause catastrophic behavior but negatively affect communication performance are created for contrast with their fault-free counterparts. K-nearest neighbor and decision tree classifiers are utilized to classify NoC operation as either fault free or faulty using data that measures the node-to-node throughput and latency for each pair of processors. The K-nearest neighbor classifier (Fig. 1) produces a higher accuracy than the decision tree (Fig. 2) in most scenarios, although both perform satisfactorily. Current work is examining how this information can be further used to localize the fault so that corrective actions can be taken to mitigate any long-term, negative effects on system operation.
 |
 |
Fig 1. Decision-tree classification of 3,000 NoCs. Green points are correctly classified as faulty, yellow points are correctly classified as fault-free, red points are incorrectly classified. |
Fig 2. K-nearest neighbor classification of the same 3,000 NoCs. |
|