Abstract
In this paper, we present the method of building a scalable, real-time inference platform for large-scale time-series anomaly detection and root-cause analysis solutions, built as a part of AI For Operations (AIOps) tool. AIOps is a tool built to ease the manual and time-consuming activities of DevOps engineers involved in monitoring and troubleshooting production systems. Such a system has to be operated in real-time to detect anomalies in a plethora of time-series metrics and logs from the productions systems in order to provide timely alerts and possible root causes for quick remediation and thus requires a low-latency operation. This system must be scalable for the vast amounts of data involved for ETL and ML inference jobs that the solution needs. In this work, we show how we engineered and scaled up the AI research POC to a solution that supports a massive search engine system, where we achieved reduction in latency by 30x. We also evaluate different tools for inference such as Apache Airflow, Serverless REST API and Spark engine and demonstrate our improvements achieved and our estimations of these different commonly used platforms for ML inference, in terms of feasibility and cost for an AIOps solution.