December 20, 2019

296 words 2 mins read

Rebuilding the airplane in flight. . .safely

Rebuilding the airplane in flight. . .safely

Rewriting the key software component of your platform from scratch is always intimidating, especially when you guarantee 100% uptime, your platform is in the critical application delivery path, and your environment is highly distributed. Shannon Weyrick discusses NS1's recent DNS server rewrite and the steps the company took to roll it out across its globally distributed network with no downtime.

Talk Title Rebuilding the airplane in flight. . .safely
Speakers Shannon Weyrick (NS1)
Conference O’Reilly Velocity Conference
Conf Tag Building and maintaining complex distributed systems
Location San Jose, California
Date June 12-14, 2018
URL Talk Page
Slides Talk Slides

In 2017, NS1 embarked on a ground-up rewrite of its advanced DNS server software. This required significant research, planning, and execution time—over a year in total. Shannon Weyrick details the various challenges NS1 encountered and key points the company had to consider to successfully engineer and deploy across its global managed DNS network with no negative impact or downtime to its customers. Shannon shares background on the decision to move forward with a rewrite (including DNSSEC and scaling requirements), research of appropriate technologies to balance performance, functionality, and engineering velocity, phased milestones with a hybrid release approach to better facilitate product iteration and to gain operational experience, a system for tee-testing traffic for verification of correctness during deploy, and the utilization of an anycast network for fault isolation during roll out. Along the way, Shannon discusses the many minor successes, failures, setbacks, and delays that you may face day to day and offers tips and advice to support you in your own quests to rebuild your airplanes in flight. You’ll leave with an appreciation for the challenges involved in planning and executing a large-scale rewrite and deployment of critical path software across a widely distributed network.

comments powered by Disqus