Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures
GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structure's relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.
R. Shah et al., "Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures," Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (2013, Agios Konstantinos, Greece), pp. 271-278, Institute of Electrical and Electronics Engineers (IEEE), Jul 2013.
The definitive version is available at http://dx.doi.org/10.1109/SAMOS.2013.6621134
International Conference on Embedded Computer Systems: Architectures, Modeling, and SImulation: SAMOS XIII (2013: Jul. 15-18, Agios Konstantinos, Greece)
Electrical and Computer Engineering
Keywords and Phrases
Computer Graphics; Computer Simulation; Monte Carlo Methods; Personal Computers; Program Processors; Architectural Vulnerability Factor; Error Correction Codes; Fault Sensitivity; General Purpose Computing on GPU; General-Purpose Computing; Graphics Processing Unit; High Performance Computing; Monte-Carlo Simulations; Computer Graphics Equipment
International Standard Book Number (ISBN)
Article - Conference proceedings
© 2013 Institute of Electrical and Electronics Engineers (IEEE), All rights reserved.