Silent Data Corruption Resilient Two-Sided Matrix Factorizations
Abstract
This paper presents an algorithm based fault tolerance method to harden three two-sided matrix factorizations against soft errors: reduction to Hessenberg form, tridiagonal form, and bidiagonal form. These two sided factorizations are usually the prerequisites to computing eigenvalues/eigenvectors and singular value decomposition. Algorithm based fault tolerance has been shown to work on three main one-sided matrix factorizations: LU, Cholesky, and QR, but extending it to cover two sided factorizations is non-trivial because there are no obvious offline, problem specific maintenance of checksums. We thus develop an online, algorithm specific checksum scheme and show how to systematically adapt the two sided factorization algorithms used in LAPACK and ScaLAPACK packages to introduce the algorithm based fault tolerance. The resulting ABFT scheme can detect and correct arithmetic errors continuously during the factorizations that allow timely error handling. Detailed analysis and experiments are conducted to show the cost and the gain in resilience. We demonstrate that our scheme covers a significant portion of the operations of the factorizations. Our checksum scheme achieves high error detection coverage and error correction coverage compared to the state of the art, with low overhead and high scalability.
Recommended Citation
P. Wu et al., "Silent Data Corruption Resilient Two-Sided Matrix Factorizations," Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2017, Austin, TX), pp. 415 - 427, Association for Computing Machinery (ACM), Jan 2017.
The definitive version is available at https://doi.org/10.1145/3018743.3018750
Meeting Name
22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '17 (2017: Feb. 4-8, Austin, TX)
Department(s)
Computer Science
International Standard Book Number (ISBN)
978-145034493-7
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2017 Association for Computing Machinery (ACM), All rights reserved.
Publication Date
26 Jan 2017
Comments
This work is partially supported by the NSF ACI- 1305624, CCF-1513201, the SZSTI basic research program JCYJ20150630114942313, and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund (the second phase).