The Observatorium: Combining Machine Learning and Observability to Improve Incident Response
At DigitalOcean, a global hosting company predicated on providing building blocks for developers, the proliferation of microservices necessary to support a worldwide cloud creates a unique-yet-univers …
Talk Title | The Observatorium: Combining Machine Learning and Observability to Improve Incident Response |
Speakers | Alex Kass (Engineering Manager, DigitalOcean) |
Conference | Open Source Summit + ELC Europe |
Conf Tag | |
Location | Lyon, France |
Date | Oct 27-Nov 1, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
At DigitalOcean, a global hosting company predicated on providing building blocks for developers, the proliferation of microservices necessary to support a worldwide cloud creates a unique-yet-universal conundrum - while the internal code is decidedly custom to DO, the incidents that arise are common to many companies.In the Observability group, open source tools like Prometheus, Kafka, and Spark play critical roles feeding data into a central application called The Observatorium, whose primary goal is to reduce MTTD/R by curating information intelligently. Combining distributed platform data engineering and predictive machine learning, all through open source tools, the team surfaces signals essential to first responders to help improve detection times and reduce service downtime.In this talk, the speaker will describe in detail the architecture of The Observatorium, and how its creative amalgamation of OSS tools has measurably improved the company’s overall reliability.