How to cost-effectively and reliably build infrastructure for machine learning

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.


Talk Title	How to cost-effectively and reliably build infrastructure for machine learning
Speakers	Osman Sarood (Mist Systems)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Mist Systems consumes several terabytes of telemetry data every day from its wireless access points (APs) deployed all over the world. A significant portion of this telemetry data is consumed by machine learning algorithms, which are essential for the smooth operation of some of the world’s largest WiFi deployments. Mist applies machine learning to incoming telemetry data to detect and attribute anomalies, which is a nontrivial problem and requires exploring multiple dimensions. Although the infrastructure is small compared to some of the tech giants, it’s growing very rapidly. Most of Mist’s anomaly detection and attribution is done in real time. Effectively doing anomaly detection and attribution can require significant resources and can quickly become cost prohibitive. Mist’s data pipeline starts with Kafka, where all incoming telemetry data is buffered. The company has two main real-time processing engines: Apache Storm and an in-house real-time time series aggregation framework, Live-aggregators. Mist’s Storm topologies host the bulk of its machine learning algorithms, which consume telemetry data from Kafka, apply domain-specific models on it to estimate metrics like throughput, capacity, and coverage for each WiFi client, and write these numbers back to Kafka. Live-aggregators reads the estimated metrics and aggregates them using different groupings (e.g., per 10 minute average throughput per organization). After aggregating the data, Live-aggregators writes it to Cassandra. Some other topologies consume the aggregated data to detect and attribute anomalies. The API can then query Cassandra and serve these aggregates or anomalies to the end user. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million—a 66% reduction in our AWS cost. Spot instances are on average 80% cheaper than traditional on-demand instances but can terminate anytime with a two-minute warning. Handling such volatility is in general difficult for most real-time applications, especially machine learning applications. Osman also covers the monitoring and alerting strategy for Mist’s applications and explains why they are a critical part in ensuring reliability. He also shares his experience using Amazon’s spot fleet and explains how Mist identified which EC2 instance types (memory intensive versus compute intensive) to use, given that various instance types have different spot price profiles and there is a possibility of getting outbid and compromising cluster stability. You’ll also discover the impact of losing spot instances for real-time platforms like Storm versus microservices running on top of Mesos. Seeing is believing: Osman concludes with a demo of terminating spot instances from Mist’s production Storm and Mesos clusters, which are completely running on spot instances, and illustrates their impact by examining real-time health metrics. He also details how many spot instance terminations Mist can endure for each of its Storm and Mesos clusters and the associated overprovisioning required to ensure the company always has enough capacity for high availability.

How to cost-effectively and reliably build infrastructure for machine learning

Apache Spark programming

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

Distributed deep learning with containers on heterogeneous GPU clusters

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Kafka at PayPal: Enabling 400 billion messages a day

ONAP Continuous Deployment on OOM Kubernetes via Public Cloud