FT-ISort: Efficient Fault Tolerance for Introsort
Abstract
Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale distributed systems. Hardware-mediated soft errors can result in comparison and memory errors, and thus cause introsort to generate incorrect output, which in turn disrupts systems built upon introsort; hence, it is critical to incorporate fault tolerance capability within introsort. This paper proposes the first theoretically-sound, practical fault tolerant introsort with negligible overhead: FT-iSort. To tolerate comparison errors, we use minimal TMR protection via exploiting the properties of the effects of soft errors on introsort. This algorithm-based selective protection incurs far less overhead than nave TMR protection designed to protect an entire application. To tolerate memory errors that escape DRAM error correcting code, we propose XOR-based re-execution. We incorporate our fault tolerance method into the well-known parallel sorting implementation HykSort, and we find that fault tolerant HykSort incurs negligible overhead and obtains nearly the same scalability as unprotected HykSort.
Recommended Citation
S. Li et al., "FT-ISort: Efficient Fault Tolerance for Introsort," International Conference for High Performance Computing, Networking, Storage and Analysis (2019, Denver, CO), Association for Computing Machinery (ACM), Nov 2019.
The definitive version is available at https://doi.org/10.1145/3295500.3356195
Meeting Name
International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19 (2019: Nov. 17-19, Denver, CO)
Department(s)
Computer Science
Keywords and Phrases
Algorithm based Fault Tolerance; Comparison Errors; Fault Tolerant Sorting; Introsort; Soft Errors
International Standard Book Number (ISBN)
978-145036229-0
International Standard Serial Number (ISSN)
2167-4329; 2167-4337
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2019 Association for Computing Machinery (ACM), All rights reserved.
Publication Date
17 Nov 2019
Comments
The material was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357, and supported by the National Science Foundation under Grant No. 1619253. This work was also supported by National Science Foundation CCF 1513201 and National Key Research and Development Programs No. 2017YFB0202100.