Honey, I shrunk the database: Resilience and recoverability in cloud native services
Sidney Shek and Jeff Farber explain how to use techniques like event sourcing, CQRS, and CRDTs to mitigate unpredictable failures that stem from humans and increasingly complex architectures in the cloud native world (microservices, anyone?). You'll learn implementation tips and tricks based on their successes (and failures) in building out the Identity platform that underpins Atlassian Cloud.
Talk Title | Honey, I shrunk the database: Resilience and recoverability in cloud native services |
Speakers | Sidney Shek (Atlassian), Jeff Farber (Atlassian) |
Conference | O’Reilly Software Architecture Conference |
Conf Tag | Engineering the Future of Software |
Location | Berlin, Germany |
Date | November 5-7, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Mistakes happen, even in the cloud. Your database may now be a managed service, globally replicated with 12 copies and automatic failover, and you may have blocked SSH into prod to prevent accidental ‘rm -rf’, but code and config bugs can still destroy your data. In the world of SaaS, these bugs may not even be yours; instead, they could be from a complex network of easy-to-adopt-hard-to-debug dependencies. You need to architect and build your systems to be resilient to, and more importantly, recoverable from these types of failures. Sidney Shek and Jeff Farber explore patterns for handling and recovering from failures that they’ve used in the Identity platform at Atlassian, including design, implementation, and operational considerations such as event sourcing and command-query responsibility segregation (CQRS) and the importance of supporting “rebootstrapping” of downstream systems; commutative/convergent replicated data types (CRDTs) for replicating data and why they chose state over operation transfer; using multiple independent technologies to avoid single points of failure (e.g., event storage in S3 versus Cassandra); localized validation through signing and caching and how to handle rapid invalidation of data; and building “recovery” services, which may require more thought than the main functionality itself.