Electrical and Computer Engineering Faculty Research & Creative Works

Fault Tolerant Memory Design for HW/SW Co-Reliability in Massively Parallel Computing Systems

Minsu Choi, Missouri University of Science and TechnologyFollow
N. J. Park
Koshy M. George
Byoungjae Jin
Nohpill Park
Yong-Bin Kim
Fabrizio Lombardi

Abstract

A highly dependable embedded fault-tolerant memory architecture for high performance massively parallel computing applications and its dependability assurance techniques are proposed and discussed in this paper. The proposed fault tolerant memory provides two distinctive repair mechanisms: the permanent laser redundancy reconfiguration during the wafer probe stage in the factory to enhance its manufacturing yield and the dynamic BIST/BISD/BISR (built-in-self-test-diagnosis-repair)-based reconfiguration of the redundant resources in field to maintain high field reliability. The system reliability which is mainly determined by hardware configuration demanded by software and field reconfiguration/repair utilizing unused processor and memory modules is referred to as HW/SW Co-reliability. Various system configuration options in terms of parallel processing unit size and processor/memory intensity are also introduced and their HW/SW Co-reliability characteristics are discussed. A modeling and assurance technique for HW/SW Co-reliability with emphasis on the dependability assurance techniques based on combinatorial modeling suitable for the proposed memory design is developed and validated by extensive parametric simulations. Thereby, design and Implementation of memory-reliability-optimized and highly reliable fault-tolerant field reconfigurable massively parallel computing systems can be achieved.

Recommended Citation

M. Choi et al., "Fault Tolerant Memory Design for HW/SW Co-Reliability in Massively Parallel Computing Systems," Proceedings of the 2nd IEEE International Symposium on Network Computing and Applications (2003, Cambridge, MA), pp. 341 - 348, IEEE Computer Society, Apr 2003.

The definitive version is available at https://doi.org/10.1109/NCA.2003.1201173

Meeting Name

2nd IEEE International Symposium on Network Computing and Applications (2003: Apr. 16-18, Cambridge, MA)

Department(s)

Electrical and Computer Engineering

Keywords and Phrases

Built-In Self Test; Computer Architecture; Fault Tolerance; Fault Tolerant Computer Systems; Integrated Circuit Design; Maintenance; Manufacture; Memory Architecture; Network Architecture; Parallel Architectures; Probes; Redundancy; Reliability; Repair; Software Reliability; Dependability Assurance; Design and Implementations; Fault Tolerant Systems; Hardware Configurations; Massively Parallel Computing; Parallel Processing; Production Facility; Reliability Characteristics; Parallel Processing Systems; Manufacturing; Production Facilities; Redundancy

International Standard Book Number (ISBN)

978-0769519388

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Apr 2003

Link to Full Text

COinS

Electrical and Computer Engineering Faculty Research & Creative Works

Fault Tolerant Memory Design for HW/SW Co-Reliability in Massively Parallel Computing Systems

Abstract

Recommended Citation

Meeting Name

Department(s)

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations

Electrical and Computer Engineering Faculty Research & Creative Works

Fault Tolerant Memory Design for HW/SW Co-Reliability in Massively Parallel Computing Systems

Author

Abstract

Recommended Citation

Meeting Name

Department(s)

Keywords and Phrases

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations