Orchestrating chaos: Applying database research in the wild
Lineage-driven fault injection (LDFI), a novel approach to automating failure testing, can greatly reduce the number of faults that must be explored via fault injection. Peter Alvaro explores LDFIs theoretical roots in the database research notion of provenance and presents early results from the field and opportunities for near- and long-term future research.
Talk Title | Orchestrating chaos: Applying database research in the wild |
Speakers | Peter Alvaro (UC Santa Cruz) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Build Resilient Distributed Systems |
Location | San Jose, California |
Date | June 20-22, 2017 |
URL | Talk Page |
Slides | |
Video | Talk Video |
Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, an increasing number of large-scale sites regularly run failure drills in which faults are deliberately injected in production or staging systems. While fault injection infrastructures are becoming relatively mature, existing approaches either explore the combinatorial space of potential failures randomly or exploit the “hunches” of domain experts to guide the search. Random strategies waste resources testing “uninteresting” faults, while programmer-guided approaches are only as good as the intuition of a programmer and only scale with human effort. Lineage-driven fault injection (LDFI), a novel approach to automating failure testing, utilizes existing tracing or logging infrastructures to work backward from good outcomes, identifying redundant computations that allow it to aggressively prune the space of faults that must be explored via fault injection. Peter Alvaro explores LDFI’s theoretical roots in the database research notion of provenance and presents early results from the field and opportunities for near- and long-term future research.