Abstract

With continuous growth of IoT applications, service failures are quite inevitable. Due to the complexity and dynamics of IoT services, the root cause analysis (RCA) following an alert can assist in quickly resolving the possible faults. However, the time scales of metrics (e.g., CPU utilization, memory usage) generated by microservices and the dynamic topologies generated by calls between the Application Program Interfaces (APIs) are different. Moreover, the status of devices is an important aspect of RCA in IoT. All these make it extremely challenging to learn failure features of microservice metrics and API calls. Therefore, we propose a novel framework for collaborative identification of root cause analysis (CIRCA) to identify the most potential root cause path with the highest fault scores (weights). In detail, we use both microservice-level and API-level root cause identification (RCI) models to obtain the node fault score in the path. Since we prove the root cause path inference problem is an NP-hard problem, and we then propose a topology-based weighted variable neighborhood search (TWVNS) algorithm and infer the optimal root cause path from two-level scores and call topologies. Our experiments demonstrate CIRCA achieves satisfactory results of RCI and path inference on four public datasets.

Department(s)

Computer Science

Publication Status

Early Access

Keywords and Phrases

IoT Microservice Architecture; Path Inference; Root Cause Analysis; Root Cause Identification

International Standard Serial Number (ISSN)

1939-1374

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2025 Institute of Electrical and Electronics Engineers; Computer Society, All rights reserved.

Publication Date

01 Jan 2025

Share

 
COinS