A Bayesian reliability analysis of neutron-induced errors in high performance computing hardware

Curtis Storlie, Sarah E. Michalak, Heather M. Quinn, Andrew J. DuBois, Steven A. Wender, David H. DuBois

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

A soft error is an undesired change in an electronic device's state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online.

Original languageEnglish (US)
Pages (from-to)429-440
Number of pages12
JournalJournal of the American Statistical Association
Volume108
Issue number502
DOIs
StatePublished - 2013
Externally publishedYes

Fingerprint

Soft Error
Reliability Analysis
Bayesian Analysis
Neutron
High Performance
Hardware
Computing
Crash
Censored Survival Data
Interval-censored Data
Mixed Effects
Supercomputer
Microprocessor
Flip
Computational Results
Experimental Data
Electronics
Uncertainty
Testing
Configuration

Keywords

  • Accelerated testing
  • Cox proportional hazards
  • Gaussian process
  • Mixed effects
  • Neutron beam
  • Silent data corruption
  • Stochastic search variable selection

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

A Bayesian reliability analysis of neutron-induced errors in high performance computing hardware. / Storlie, Curtis; Michalak, Sarah E.; Quinn, Heather M.; DuBois, Andrew J.; Wender, Steven A.; DuBois, David H.

In: Journal of the American Statistical Association, Vol. 108, No. 502, 2013, p. 429-440.

Research output: Contribution to journalArticle

Storlie, Curtis ; Michalak, Sarah E. ; Quinn, Heather M. ; DuBois, Andrew J. ; Wender, Steven A. ; DuBois, David H. / A Bayesian reliability analysis of neutron-induced errors in high performance computing hardware. In: Journal of the American Statistical Association. 2013 ; Vol. 108, No. 502. pp. 429-440.
@article{31c867f519c84bea907a0eb7005508c9,
title = "A Bayesian reliability analysis of neutron-induced errors in high performance computing hardware",
abstract = "A soft error is an undesired change in an electronic device's state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online.",
keywords = "Accelerated testing, Cox proportional hazards, Gaussian process, Mixed effects, Neutron beam, Silent data corruption, Stochastic search variable selection",
author = "Curtis Storlie and Michalak, {Sarah E.} and Quinn, {Heather M.} and DuBois, {Andrew J.} and Wender, {Steven A.} and DuBois, {David H.}",
year = "2013",
doi = "10.1080/01621459.2013.770694",
language = "English (US)",
volume = "108",
pages = "429--440",
journal = "Journal of the American Statistical Association",
issn = "0162-1459",
publisher = "Taylor and Francis Ltd.",
number = "502",

}

TY - JOUR

T1 - A Bayesian reliability analysis of neutron-induced errors in high performance computing hardware

AU - Storlie, Curtis

AU - Michalak, Sarah E.

AU - Quinn, Heather M.

AU - DuBois, Andrew J.

AU - Wender, Steven A.

AU - DuBois, David H.

PY - 2013

Y1 - 2013

N2 - A soft error is an undesired change in an electronic device's state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online.

AB - A soft error is an undesired change in an electronic device's state, for example, a bit flip in computer memory, that does not permanently affect its functionality. In microprocessor systems, neutron-induced soft errors can cause crashes and silent data corruption (SDC). SDC occurs when a soft error produces a computational result that is incorrect, without the system issuing a warning or error message. Hence, neutron-induced soft errors are a major concern for high performance computing platforms that perform scientific computation. Through accelerated neutron beam testing of hardware in its field configuration, the frequencies of failures (crashes) and of SDCs in hardware from the Roadrunner platform, the first Petaflop supercomputer, are estimated. The impact of key factors on field performance is investigated and estimates of field reliability are provided. Finally, a novel statistical approach for the analysis of interval-censored survival data with mixed effects and uncertainty in the interval endpoints, key features of the experimental data, is presented. Supplementary materials for this article are available online.

KW - Accelerated testing

KW - Cox proportional hazards

KW - Gaussian process

KW - Mixed effects

KW - Neutron beam

KW - Silent data corruption

KW - Stochastic search variable selection

UR - http://www.scopus.com/inward/record.url?scp=84890078418&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84890078418&partnerID=8YFLogxK

U2 - 10.1080/01621459.2013.770694

DO - 10.1080/01621459.2013.770694

M3 - Article

AN - SCOPUS:84890078418

VL - 108

SP - 429

EP - 440

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

SN - 0162-1459

IS - 502

ER -