Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem
Future exascale systems are expected to be characterized by more frequent failures than current petascale systems. This places increased importance on the application to minimize the amount of time wasted due to recompution when recovering from a checkpoint. Typically HPC application checkpoint at iteration boundaries. However, for applications that have a high per-iteration cost, checkpointing inside the iteration limits the amount of re-computation. This paper analyzes the performance and accuracy of using lossy compressed check-pointing in the computational chemistry application NWChem. Our results indicate that lossy compression is an effective tool for reducing the sub-iteration checkpoint size. Moreover, compression error tolerances that yield acceptable deviation in accuracy and iteration count are quantified.
T. Reza et al., "Analyzing the Performance and Accuracy of Lossy Checkpointing on Sub-Iteration of NWChem," Proceedings of DRBSD-5 2019: 5th International Workshop on Data Analysis and Reduction for Big Scientific Data - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 23-27, Association for Computing Machinery (ACM), Nov 2019.
The definitive version is available at https://doi.org/10.1109/DRBSD-549595.2019.00009
2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data, DRBSD-5 @ SC'19 (2019: Nov. 17, Denver, CO)
Keywords and Phrases
Checkpoint-Restart; Coupled-Cluster Singles and Doubles; Lossy Data Compression; NWChem
International Standard Book Number (ISBN)
Article - Conference proceedings
© 2019 Association for Computing Machinery (ACM), All rights reserved.
01 Nov 2019