December 16, 2019

421 words 2 mins read

Canary in a coal mine: Building infrastructure resiliency with canary data reloads

Canary in a coal mine: Building infrastructure resiliency with canary data reloads

Remember the old practice of the canary in the coal mine, where miners used fragile feathered friends as a failure detector for toxic gasses? In software, a canary run is a trial executed on one machine before the rest of the cluster runs. Ann Kilzer explains how Indeed created a canary service leveraging Consuls key value store to improve the resilience of data reloads in any infrastructure.

Talk Title Canary in a coal mine: Building infrastructure resiliency with canary data reloads
Speakers Ann Kilzer (Indeed)
Conference O’Reilly Velocity Conference
Conf Tag Build Resilient Distributed Systems
Location San Jose, California
Date June 20-22, 2017
URL Talk Page
Slides Talk Slides
Video

Remember the old practice of the canary in the coal mine, where miners used fragile feathered friends as a failure detector for toxic gasses? In software, a canary run is a trial executed on one machine before the rest of the cluster runs. Ann Kilzer explains how Indeed created a canary service leveraging Consul’s key value store to improve the resilience of data reloads in any infrastructure, covering the original implementation of the canary reloader, corner cases encountered, how the canary framework was extended to a general library, and lessons learned along the way. Ann’s team implemented canary reloads in Indeed’s search engine, a Java application that runs on multiple, homogenous servers within global data centers. The search engine performs periodic reloads of the search index, a mission-critical binary data file that grows in size daily. While the reload is underway, the memory usage doubles, as the old version must be retained to serve ongoing requests. As part of capacity planning, the team anticipated a situation where every server in a given data center would attempt to load the search index, and each would fail due to an OutOfMemoryException, thereby causing an outage. The canary service addresses this possibility and accounts for corner cases like server outages not caused by index reloads. The team also ensured that the system maintained liveness, as its goal is to serve the freshest data possible to its customers. Over time, they fine-tuned the algorithm, learning from production issues and accounting for growth. Other teams asked to incorporate canary reloading into their projects, so a reusable library was created. Clients can simply add their reloading logic into a Java callable, configure a few parameters, and pass the callable to the library code. The library abstracts away the details of leader election and locking mechanisms, enabling other teams to easily avoid outages in any data reloading service.

comments powered by Disqus