Anomaly Detection and Root Cause Analysis for Microservice Systems
A research thesis addresses critical limitations in automated anomaly detection and root cause analysis (RCA) for microservice systems by introducing integrated methods that leverage multiple data types and establishing standardized benchmarking frameworks. The work combines anomaly detection with RCA, incorporates event data alongside traditional metrics, and eliminates dependency on service call graphs while advancing causal inference techniques.
Microservice architectures have become foundational to modern cloud infrastructure, yet their distributed complexity creates substantial operational challenges. This thesis tackles a fundamental problem: when systems fail, identifying what went wrong and why remains computationally and analytically difficult. Traditional approaches fragment the problem by separating anomaly detection from root cause analysis, creating cascading errors when initial detection proves imprecise.
The research contributes meaningfully to DevOps and site reliability engineering by proposing integrated frameworks that process observability data holistically. By incorporating event data—API calls, configuration changes—alongside metrics and logs, the work captures a more complete picture of system behavior. The elimination of service call graph requirements is particularly valuable, as many organizations lack complete dependency mappings. BARO, EventADL, and TORAI represent methodological advances with demonstrated effectiveness on real systems.
The benchmarking contribution through RCAEval addresses a critical gap inhibiting progress in this field. Standardized datasets and evaluation frameworks enable fair comparison of competing approaches and accelerate research velocity. This is especially important for causal inference-based methods, which dominate current literature but lack systematic evaluation of their effectiveness and robustness.
For the industry, faster and more accurate RCA directly reduces mean time to resolution (MTTR) and minimizes revenue loss from outages. DevOps teams, platform engineers, and observability vendors will benefit from these methodological advances. The standardized benchmark particularly enables smaller research teams and startups to contribute meaningfully without building infrastructure from scratch. Future work should focus on deploying these methods in production environments and addressing the latency challenges of real-time anomaly detection at scale.
- →Integrated anomaly detection and RCA frameworks outperform separated approaches by handling imprecise detection gracefully.
- →Event data integration captures configuration and API changes often missed by traditional metric-only analysis.
- →Eliminating service call graph requirements makes RCA applicable to organizations lacking complete dependency mappings.
- →Standardized benchmark datasets enable fair comparison and accelerate research progress in microservice failure diagnosis.
- →Causal inference approaches require systematic evaluation to clarify their real-world effectiveness and computational efficiency.