Computer Science Faculty Research & Creative Works

Towards Practical Algorithm based Fault Tolerance in Dense Linear Algebra

Panruo Wu
Qiang Guan
Nathan DeBardeleben
Sean Blanchard
Dingwen Tao
Xin Liang, Missouri University of Science and TechnologyFollow
For full list of authors, see publisher's website.

Abstract

Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by directly using a comprehensive architectural fault model and devise a comprehensive ABFT scheme that can tolerate multiple architectural faults of various kinds. We implement the new ABFT scheme into high performance linpack (HPL) to demonstrate the feasibility in large scale high performance benchmark. We conduct architectural fault injection experiments and large scale experiments to empirically validate its fault tolerance and demonstrate the overhead of error handling, respectively.

Recommended Citation

P. Wu et al., "Towards Practical Algorithm based Fault Tolerance in Dense Linear Algebra," Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (2016, Kyoto, Japan), pp. 31 - 42, Association for Computing Machinery (ACM), May 2016.

The definitive version is available at https://doi.org/10.1145/2907294.2907315

Meeting Name

25th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC '16 (2016: May 31-Jun. 4, Kyoto, Japan)

Department(s)

Computer Science

Comments

This work is partially supported by the NSF grants CCF-1305622, ACI-1305624, CCF-1513201, the SZSTI basic research pro- gram JCYJ20150630114942313, and the Special Program for Applied Research on Super Computation of the NSFC- Guangdong Joint Fund (the second phase).

International Standard Book Number (ISBN)

978-145034314-5

Document Type

Article - Conference proceedings

Document Version

Final Version

File Type

text

Language(s)

English

Rights

Publication Date

31 May 2016

Download

Full Text Link

Included in

Computer Sciences Commons

COinS

Computer Science Faculty Research & Creative Works

Towards Practical Algorithm based Fault Tolerance in Dense Linear Algebra

Abstract

Recommended Citation

Meeting Name

Department(s)

Comments

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Towards Practical Algorithm based Fault Tolerance in Dense Linear Algebra

Author

Abstract

Recommended Citation

Meeting Name

Department(s)

Comments

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations