Availability, latency, and cost: Withstanding regional outages
Multiregion deployments can improve availability and latency and can cost way less than you think. Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach.
Talk Title | Availability, latency, and cost: Withstanding regional outages |
Speakers | Aaron Blohowiak (Netflix) |
Conference | Velocity |
Conf Tag | Build resilient systems at scale |
Location | New York, New York |
Date | September 20-22, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Running in multiple regions is better for your users through increased availability and lower latencies, and it won’t cost as much as you think. Netflix has turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove the company to refine its approach—and its understanding—as it has matured. Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it’s a matter of routine that usually concludes with a brief “all is well” email. Once you’ve decided to go multiregion, three major questions arise: How many regions do you need? How should you steer users to regions? And how do you actually perform the failover? Aaron Blohowiak dives into his experience operating in multiple regions at scale at Netflix and shares the algebraic models, code, and incident management playbooks the company has developed to tame, refine, and leverage its approach. He also offers an overview of the design considerations and system models Netflix used to make those decisions.