Abstract
The checksum technique is a low-cost method to detect errors in matrix operations performed by processor arrays. The fault detection of this method is done only at problem termination, so this method is not an effective fault tolerance technique for large scale matrix multiplication. This paper presents a new algorithm, the ID algorithm, which minimizes the fault-detection latency, In the ID algorithm, a fault is detected as soon as the fault occurs instead of at problem termination. For n2 processors, the fault-latency time of the ID algorithm is l/n of that of checksum algorithm with a run-time penalty of O(nlog2n) in a nxn matrix operation. This new algorithm has better performance in terms of error coverage and expected run time in large scale matrix multiplications such as signal and image processing, weather prediction, and finite element analysis.
Recommended Citation
C. E. Hong and B. M. McMillin, "Fault-tolerant Parallel Matrix Multiplication With One Iteration Fault Detection Latency," Proceedings - International Computer Software and Applications Conference, pp. 665 - 672, article no. 170258, Institute of Electrical and Electronics Engineers, Jan 1991.
The definitive version is available at https://doi.org/10.1109/CMPSAC.1991.170258
Department(s)
Computer Science
Keywords and Phrases
Application-oriented fault tolerance; Multicomputers
International Standard Serial Number (ISSN)
0730-3157
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2023 Institute of Electrical and Electronics Engineers, All rights reserved.
Publication Date
01 Jan 1991