Correcting Soft Errors Online in Fast Fourier Transform
Abstract
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage.
Recommended Citation
X. Liang et al., "Correcting Soft Errors Online in Fast Fourier Transform," Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017, Denver, CO), Association for Computing Machinery (ACM), Nov 2017.
The definitive version is available at https://doi.org/10.1145/3126908.3126915
Meeting Name
International Conference for High Performance Computing, Networking, Storage and Analysis, SC '17 (2017: Nov. 12-17, Denver, CO)
Department(s)
Computer Science
Keywords and Phrases
Algorithm-Based Fault Tolerance; DFT; FFT; FFTW; Soft Errors
International Standard Book Number (ISBN)
978-145035114-0
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2017 Association for Computing Machinery (ACM), All rights reserved.
Publication Date
12 Nov 2017
Comments
This work is partially supported by the NSF grants OAC-1305624, CCF-1513201, the SZSTI basic research program JCYJ2015063011494- 2313, and the MOST key project 2017YFB0202100.