The A in SRE: Architecting for reliability

Upfront architecture is essential to ensure reliability. Ideally, the system design starts with defining clear service-level objectives (SLOs) that translate into the right architecture to avoid gold-plating or costly redesigns after the system is live. Marco van der Linden and Tom Hofte explain how to define clear SLOs and apply architectural patterns to design a system that works as promised.


Talk Title	The A in SRE: Architecting for reliability
Speakers	Marco van der Linden (Xebia), Tom Hofte (Xebia)
Conference	O’Reilly Software Architecture Conference
Conf Tag	Engineering the Future of Software
Location	New York, New York
Date	February 24-26, 2020
URL	Talk Page
Slides	Talk Slides
Video

Site reliability engineering (SRE) has become a popular discipline within organizations to improve the reliability of their IT landscape. Typically, SRE focuses on improving reliability of existing services by optimizing the operational procedures and feedback loops to the teams with the ultimate goal of improving service reliability. In some situations, you need to make changes to the architecture to improve the reliability of your service. However, these architectural redesigns are costly and could have been avoided had the SLOs been clear at the beginning. If your objectives are not clear, or not defined at all, you run the risk of not implementing sufficient measures to make your system reliable or implementing too many measures, leading to an overly complex system that can also easily become unreliable. SLOs must be clear enough to be, among others, understandable, measurable, and reachable within the context of the service. These criteria help to get the SLOs accepted within an organization, help teams select the right stability patterns, and justify to the organization why specific architectural stability patterns are needed. Subsequently, observability patterns around the three pillars, event logs, metrics, and tracing can be applied to make the system observable to measure the SLOs. Drawing on their real-world experience, Marco van der Linden and Tom Hofte demonstrate how to design reliable and observable systems based on clear SLOs. You’ll work in teams on a fictional case to define SLOs, apply stability patterns to ensure system reliability, and make the system observable. Join in to learn how to better define clear SLOs and translate them into a reliable and observable system, using well-established architectural patterns.