Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
Abstract
Extensive researches have been done on developing and optimizing algorithm-based fault tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors. However, little has been done on developing and optimizing ABFT schemes for heterogeneous systems with GPU accelerators. While existing ABFT schemes can correct computing errors like 1+1=3, we find that many memory storage errors can not be corrected by existing ABFT schemes. In this paper, we first develop a new ABFT scheme for Cholesky decomposition that can correct both computing errors and storage errors at the same time, and then develop several optimization techniques to reduce the fault tolerance overhead of ABFT for heterogeneous systems with GPU accelerators. Experimental results demonstrate that our fault tolerant Cholesky decomposition is able to correct both computing errors and storage errors in the middle of the computation and can achieve better performance than the state-of-the-art vendor provided version Cholesky decomposition library routine in CULA R18.
Recommended Citation
J. Chen et al., "Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs," Proceedings of the 30th International Parallel and Distributed Processing Symposium (2016, Chicago, IL), pp. 993 - 1002, Institute of Electrical and Electronics Engineers (IEEE), Jul 2016.
The definitive version is available at https://doi.org/10.1109/IPDPS.2016.81
Meeting Name
30th International Parallel and Distributed Processing Symposium, IPDPS 2016 (2016: May 23-27, Chicago, IL)
Department(s)
Computer Science
Keywords and Phrases
Cholesky Decomposition; CULA; Fault Tolerance; GPUs; MAGMA; Offline ABFT; Online ABFT
International Standard Book Number (ISBN)
978-150902140-6
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2016 Institute of Electrical and Electronics Engineers (IEEE), All rights reserved.
Publication Date
18 Jul 2016
Comments
This work is partially supported by the NSF grants CCF-1305622, ACI-1305624, CCF-1513201, and the SZSTI basic research program JCYJ20150630114942313.