Computer Science Faculty Research & Creative Works

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Kai Zhao
Sheng Di
Sihuan Li
Xin Liang, Missouri University of Science and TechnologyFollow
Yujia Zhai
Jieyang Chen
For full list of authors, see publisher's website.

Abstract

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%\sim∼8% in both error-free and error-injected situations).

Recommended Citation

K. Zhao et al., "FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1677 - 1689, Institute of Electrical and Electronics Engineers (IEEE), Jul 2021.

The definitive version is available at https://doi.org/10.1109/TPDS.2020.3043449

Department(s)

Computer Science

Keywords and Phrases

Algorithm-Based Fault Tolerance; Deep Learning; High-Performance Computing; Reliability; Silent Data Corruption

International Standard Serial Number (ISSN)

1045-9219; 1558-2183

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Jul 2021

Link to Full Text

COinS

Computer Science Faculty Research & Creative Works

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Author

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Faculty Gallery

Author Corner

Related Content

Useful Links

Article Locations