Error-detecting algorithms can determine when, at run time, a program deviates from its expected behavior due to a hardware, software or communication error. In a fixed interconnect multiprocessor system, the error detecting ability heavily depends on the number of faults, which is bounded, and their spatial distribution. Otherwise multiple fault occurrences can mask each other. This paper provides a general method for computing the overall system failure bound, the maximal fault index, from the system topology and local communication patterns. The result of the computation is used to design a mapping of processes to processor groups such that multiple processor failures preserve the error-detecting ability of the algorithm. We show the problem of finding the maximal fault index to be NP-Hard and show, for certain regular topologies, that the problem yields polynomially computable embeddings. Finally, we give an example of mapping an error detecting matrix relaxation algorithm derived from program verification using Changeling.
Schollmeyer, Martina and McMillin, Bruce M., "A General Method for Maximizing the Error-Detecting Ability of Distributd Algorithms" (1993). Computer Science Technical Reports. 38.
© 1993 University of Missouri--Rolla, All rights reserved.
23 Jun 1993