Orchestrating chaos: Applying database research in the wild

Lineage-driven fault injection (LDFI), a novel approach to automating failure testing, can greatly reduce the number of faults that must be explored via fault injection. Peter Alvaro explores LDFIs theoretical roots in the database research notion of provenance and presents early results from the field and opportunities for near- and long-term future research.


Talk Title	Orchestrating chaos: Applying database research in the wild
Speakers	Peter Alvaro (UC Santa Cruz)
Conference	O’Reilly Velocity Conference
Conf Tag	Build Resilient Distributed Systems
Location	San Jose, California
Date	June 20-22, 2017
URL	Talk Page
Slides
Video	Talk Video

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites regularly run failure drills in which faults are deliberately injected in production or staging systems. While fault injection infrastructures are becoming relatively mature, existing approaches either explore the combinatorial space of potential failures randomly or exploit the “hunches” of domain experts to guide the search. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort. Lineage-driven fault injection (LDFI), a novel approach to automating failure testing, utilizes existing tracing or logging infrastructures to work backward from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection. Peter Alvaro explores LDFI’s theoretical roots in the database research notion of provenance and presents early results from the field and opportunities for near- and long-term future research.

Orchestrating chaos: Applying database research in the wild

Rethinking stream processing with Apache Kafka: Applications versus clusters and streams versus databases

Unified Monitoring of Containers and Microservices [I]

Scaling a user delivery network for real-time audience targeting

Queueing Theory, In Practice: Performance Modelling in Cloud-Native Territory [I]

Automating and Testing Production Ready Kubernetes Clusters in the Public Cloud

Building containerized Spark on a solid foundation with Quobyte and Kubernetes