December 20, 2019

198 words 1 min read

Scaling Resilient Systems: A Journey into Slack's Database Service

Scaling Resilient Systems: A Journey into Slack's Database Service

Monitoring and observability are important concepts, especially in complex and distributed systems. Redundancy and defensive programming are important as well, but sometimes they are not enough. Desig …

Talk Title Scaling Resilient Systems: A Journey into Slack's Database Service
Speakers Guido Iaquinti (Site Reliability Engineer, Freelance), Rafael Chacon (Staff Software Engineer, Slack)
Conference KubeCon + CloudNativeCon North America
Conf Tag
Location San Diego, CA, USA
Date Nov 15-21, 2019
URL Talk Page
Slides Talk Slides
Video

Monitoring and observability are important concepts, especially in complex and distributed systems. Redundancy and defensive programming are important as well, but sometimes they are not enough. Designing systems to minimize the blast radius when the unexpected happens is often the key.In this talk, Rafael and Guido will share an overview about how Slack designed, built, scaled and then iterated to improve its distributed database service based on top of Vitess, now a CNCF project. The Databases team at Slack scaled a Vitess cluster from 0 to spikes of 2.7 Million queries per second. This journey has taught us how to operate a database cluster with more than 2000 nodes and expecting to growth to more than 3500 in the next 12 months.

comments powered by Disqus