Fault Tolerant One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs
Abstract
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension; (2) their checking scheme is not efficient due to redundant checksum verifications; (3) they fail to protect PCIe communication; and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.
Recommended Citation
J. Chen et al., "Fault Tolerant One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs," Proceedings of the 30th ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (2018, Dallas, TX), pp. 854 - 865, Association for Computing Machinery (ACM), Mar 2019.
The definitive version is available at https://doi.org/10.1109/SC.2018.00071
Meeting Name
30th ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '18 (2018: Nov. 11-16, Dallas, TX)
Department(s)
Computer Science
Keywords and Phrases
Algorithm-Based Fault Tolerance; GPU; Heterogeneous System; Linear Algebra; Matrix Decomposition
International Standard Book Number (ISBN)
978-153868384-2
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2019 Association for Computing Machinery (ACM), All rights reserved.
Publication Date
11 Mar 2019