Abstract
Managing and diagnosing faults in microservices architectures is a challenge. Solutions such as anomaly detection and root cause analysis (RCA) can help, as anomalies often indicate underlying problems that can lead to system failures. This investigation provides an integrated solution that extracts microservice architecture knowledge, detects anomalies, and identifies their root causes. Our approach combines the use of latency thresholds with other techniques to learn the normal behavior of the system and detect deviations that point to faults. Once deviations are identified, a hybrid RCA method is applied that integrates empirical data analysis with an understanding of the system’s architecture to accurately trace the root causes of these anomalies. The solution was validated using trace log data from an Internet Service Provider’s (ISP) microservices system.