Deep Spatiotemporal Fusion Network For Vision-Based Robotic Inspection Of Structures


The Convolutional Neural Networks Commonly Deployed For Semantic Understanding Of Visual Inspection Data Can, In General, Learn Robust Spatial Features. However, They Lack The Ability To Capture Temporal Dependencies That Characterize The Video Data Collected By Various Robotic Inspection Systems. As A Result, They Are Found Lacking In Dealing With Various Challenges Arising From Cross-View Illumination Variation, Perspective Difference, Scale Change, Background Clutter, And Occlusion. Their Performance Is Further Deteriorated By Motion Blur And Other Distortions Induced By Rapid Camera Movement. This Study Aims To Address This Challenge By Extending The Task Of Visual Scene Understanding From The Still Image Domain To The Video Domain By Incorporating Cross-Frame Information Fusion. A Deep End-To-End Network Is Developed By Integrating An Encoder–decoder-Based Convolutional Neural Network With A Long Short-Term Memory-Based Recurrent Neural Network For Pixel-Level Semantic Labeling Of Sequential Visual Inspection Data. The Proposed Multishot Architecture Can Jointly Learn Discriminative Fusion Features Leading To A Rich Understanding Of The Complex Spatiotemporal Dynamics. The Proposed Approach Is Validated With Two Case Studies Involving Automatic Structural Element Segmentation In Robotic Building And Bridge Inspection Videos. Two Different Multishot Fusion Techniques Are Suggested Leveraging Sequence-To-One And Sequence-To-Sequence Architectures. Additionally, Two Different Fusion Schemes Based On The Sum-Of-Scores And Bayesian Updating Rules Are Examined To Aggregate Multiple Label Maps Produced At Each Time Step By An Overlapping Sliding Window-Based Inference Scheme. A Comprehensive Performance Evaluation Indicated That Multishot Fusion Could Enhance The Intersection Over Union (IoU) Score By 4.6% And 13.3% For Building And Bridge Component Segmentation Tasks, Respectively, Compared To A Baseline Single-Shot Approach.


Civil, Architectural and Environmental Engineering


Office of the Assistant Secretary for Research and Technology, Grant 69A3551747126

Keywords and Phrases

Building and bridge inspection; Convolutional neural network; Deep learning; Long short-term memory-based recurrent neural network; Multishot fusion; Spatiotemporal analysis

International Standard Serial Number (ISSN)


Document Type

Article - Journal

Document Version


File Type





© 2024 Elsevier; International Federation of Automatic Control (IFAC), All rights reserved.

Publication Date

01 Jul 2024