Who guards the guardians? Designing for resilience in cluster orchestrators
Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads.
Talk Title | Who guards the guardians? Designing for resilience in cluster orchestrators |
Speakers | Preetha Appan (HashiCorp) |
Conference | Velocity |
Conf Tag | Build resilient systems at scale |
Location | New York, New York |
Date | September 20-22, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature. Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.