December 3, 2019

275 words 2 mins read

Availability, latency, and cost: Withstanding regional outages

Availability, latency, and cost: Withstanding regional outages

Multiregion deployments can improve availability and latency and can cost way less than you think. Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach.

Talk Title Availability, latency, and cost: Withstanding regional outages
Speakers Aaron Blohowiak (Netflix)
Conference Velocity
Conf Tag Build resilient systems at scale
Location New York, New York
Date September 20-22, 2016
URL Talk Page
Slides Talk Slides
Video

Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. Netflix has turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove the company to refine its approach—and its understanding—as it has matured. Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with a brief “all is well” email. Once you’ve decided to go multiregion, three major questions arise: How many regions do you need? How should you steer users to regions? And how do you actually perform the failover? Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach. He also offers an overview of the design considerations and system models Netflix used to make those decisions.

comments powered by Disqus