Scaling Resilient Systems: A Journey into Slack's Database Service
Monitoring and observability are important concepts, especially in complex and distributed systems. Redundancy and defensive programming are important as well, but sometimes they are not enough. Desig …
Talk Title | Scaling Resilient Systems: A Journey into Slack's Database Service |
Speakers | Guido Iaquinti (Site Reliability Engineer, Freelance), Rafael Chacon (Staff Software Engineer, Slack) |
Conference | KubeCon + CloudNativeCon North America |
Conf Tag | |
Location | San Diego, CA, USA |
Date | Nov 15-21, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Monitoring and observability are important concepts, especially in complex and distributed systems. Redundancy and defensive programming are important as well, but sometimes they are not enough. Designing systems to minimize the blast radius when the unexpected happens is often the key.In this talk, Rafael and Guido will share an overview about how Slack designed, built, scaled and then iterated to improve its distributed database service based on top of Vitess, now a CNCF project. The Databases team at Slack scaled a Vitess cluster from 0 to spikes of 2.7 Million queries per second. This journey has taught us how to operate a database cluster with more than 2000 nodes and expecting to growth to more than 3500 in the next 12 months.