Computer Science Faculty Research & Creative Works

Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem

Tasmia Reza
Kristopher Keipert
Sheng Di
Xin Liang, Missouri University of Science and TechnologyFollow
Jon Calhoun
Franck Cappello

Abstract

Future exascale systems are expected to be characterized by more frequent failures than current petascale systems. This places increased importance on the application to minimize the amount of time wasted due to recompution when recovering from a checkpoint. Typically HPC application checkpoint at iteration boundaries. However, for applications that have a high per-iteration cost, checkpointing inside the iteration limits the amount of re-computation. This paper analyzes the performance and accuracy of using lossy compressed check-pointing in the computational chemistry application NWChem. Our results indicate that lossy compression is an effective tool for reducing the sub-iteration checkpoint size. Moreover, compression error tolerances that yield acceptable deviation in accuracy and iteration count are quantified.

Recommended Citation

T. Reza et al., "Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem," Proceedings of DRBSD-5 2019: 5th International Workshop on Data Analysis and Reduction for Big Scientific Data - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 23 - 27, Association for Computing Machinery (ACM), Nov 2019.

The definitive version is available at https://doi.org/10.1109/DRBSD-549595.2019.00009

Meeting Name

2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data, DRBSD-5 @ SC'19 (2019: Nov. 17, Denver, CO)

Department(s)

Computer Science

Comments

This material is based upon work supported by the National Science Foundation under Grant No. SHF-1910197 and Grant No. SHF-1619253. This work is supported by the US Department of Energy under subaward No. 9F-60179. This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC. The material was supported by the U.S. Department of Energy, Office of Science, under contract DEAC02- 06CH11357.

Keywords and Phrases

Checkpoint-Restart; Coupled-Cluster Singles and Doubles; Lossy Data Compression; NWChem

International Standard Book Number (ISBN)

978-172816017-7

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Nov 2019

Link to Full Text

COinS

Computer Science Faculty Research & Creative Works

Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem

Abstract

Recommended Citation

Meeting Name

Department(s)

Comments

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem

Author

Abstract

Recommended Citation

Meeting Name

Department(s)

Comments

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations