Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures

Abstract

GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structure's relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.

Meeting Name

International Conference on Embedded Computer Systems: Architectures, Modeling, and SImulation: SAMOS XIII (2013: Jul. 15-18, Agios Konstantinos, Greece)

Department(s)

Electrical and Computer Engineering

Keywords and Phrases

Computer Graphics; Computer Simulation; Monte Carlo Methods; Personal Computers; Program Processors; Architectural Vulnerability Factor; Error Correction Codes; Fault Sensitivity; General Purpose Computing on GPU; General-Purpose Computing; Graphics Processing Unit; High Performance Computing; Monte-Carlo Simulations; Computer Graphics Equipment

International Standard Book Number (ISBN)

978-1479901036

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2013 Institute of Electrical and Electronics Engineers (IEEE), All rights reserved.

Publication Date

01 Jul 2013

Share

 
COinS