Honey, I shrunk the database: Resilience and recoverability in cloud native services

Sidney Shek and Jeff Farber explain how to use techniques like event sourcing, CQRS, and CRDTs to mitigate unpredictable failures that stem from humans and increasingly complex architectures in the cloud native world (microservices, anyone?). You'll learn implementation tips and tricks based on their successes (and failures) in building out the Identity platform that underpins Atlassian Cloud.


Talk Title	Honey, I shrunk the database: Resilience and recoverability in cloud native services
Speakers	Sidney Shek (Atlassian), Jeff Farber (Atlassian)
Conference	O’Reilly Software Architecture Conference
Conf Tag	Engineering the Future of Software
Location	Berlin, Germany
Date	November 5-7, 2019
URL	Talk Page
Slides	Talk Slides
Video

Mistakes happen, even in the cloud. Your database may now be a managed service, globally replicated with 12 copies and automatic failover, and you may have blocked SSH into prod to prevent accidental ‘rm -rf’, but code and config bugs can still destroy your data. In the world of SaaS, these bugs may not even be yours; instead, they could be from a complex network of easy-to-adopt-hard-to-debug dependencies. You need to architect and build your systems to be resilient to, and more importantly, recoverable from these types of failures. Sidney Shek and Jeff Farber explore patterns for handling and recovering from failures that they’ve used in the Identity platform at Atlassian, including design, implementation, and operational considerations such as event sourcing and command-query responsibility segregation (CQRS) and the importance of supporting “rebootstrapping” of downstream systems; commutative/convergent replicated data types (CRDTs) for replicating data and why they chose state over operation transfer; using multiple independent technologies to avoid single points of failure (e.g., event storage in S3 versus Cassandra); localized validation through signing and caching and how to handle rapid invalidation of data; and building “recovery” services, which may require more thought than the main functionality itself.

Honey, I shrunk the database: Resilience and recoverability in cloud native services

Build and Operate a Multi-tenants Cloud Object Storage Service for Enterprise Private Cloud

Container Network Functions or Cloud Native Network Functions for 5G

Identity-based Cross-cluster Fabrics - Challenges & Rewards

Database migrations don't have to be painful, but the road will be bumpy

Ready to Serve! Speeding-Up Startup Time of Istio-Powered Workloads

Intro: Rook