Abstract
Managing and diagnosing faults in microservices architectures is a challenge, especially in a service provider environment that hosts third-party services. Solutions such as anomaly detection can help, as anomalies often indicate underlying problems that can lead to system failures. We develop an integrated solution that extracts microservice architecture knowledge and detects anomalies using the architecture knowledge to provide context for these anomalies. Our approach combines the use of latency thresholds with temporal distribution of latency anomalies to determine normal behavior of a system and detect deviations that point to faults. The solution proposed was validated using data from an Internet Service Provider’s microservices system. We were able to identify critical components as key points of failure during fault conditions. The combined use of architecture mining and anomaly detection enabled us to analyse anomalies in depth.