November 28, 2019

205 words 1 min read

Who guards the guardians? Designing for resilience in cluster orchestrators

Who guards the guardians? Designing for resilience in cluster orchestrators

Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads.

Talk Title Who guards the guardians? Designing for resilience in cluster orchestrators
Speakers Preetha Appan (HashiCorp)
Conference Velocity
Conf Tag Build resilient systems at scale
Location New York, New York
Date September 20-22, 2016
URL Talk Page
Slides Talk Slides
Video

Cluster orchestrators enable reliable and repeatable application deploys and provide fault tolerance without operator intervention. These orchestrators are themselves complex distributed systems like the applications they manage. The blast radius when a cluster orchestrator fails is huge; it could take down all your applications. Designing resilience into the orchestrator is a unique challenge given its critical operational nature. Preetha Appan outlines various failure modes ranging from network failures to entire server failures in Nomad, an open source scheduler that supports heterogeneous workloads. You’ll discover how building graceful degradation and resilience to address these failures involves looking at the problem as a trade-off between three system features: correctness, performance, and availability. Along the way, Preetha shares examples of design decisions that impact the availability of applications managed by the scheduler and lessons learned that apply to building any complex distributed system.

comments powered by Disqus