February 1, 2020

181 words 1 min read

Building successful site reliability engineering in large enterprises

Building successful site reliability engineering in large enterprises

Implementing site reliability (SRE) engineering doesn't have to be intimidating, and it isn't only for cloud-native organizations. Liz Fong-Jones and Dave Rensin share eight key lessons Google's customer reliability engineering team learned helping large enterprises adopt SRE as an operations engineering model.


Talk Title	Building successful site reliability engineering in large enterprises
Speakers	Liz Fong-Jones (Honeycomb), Dave Rensin (Google)
Conference	O’Reilly Velocity Conference
Conf Tag	Building and maintaining complex distributed systems
Location	New York, New York
Date	October 1-3, 2018
URL	Talk Page
Slides	Talk Slides
Video	Talk Video

Google’s customer reliability engineering team is a specialized group of SREs who go into the world and teach enterprise customers of public cloud infrastructure—via their actual production systems—how to “do SRE” in their orgs. In the team’s two years of existence, its members have found that some things they thought would be hard weren’t, while others were nigh on impossible. The team has written many postmortems and learned a bunch of lessons you can only learn the hard way. Liz Fong-Jones and Dave Rensin share eight of these key lessons. Topics include:

google reliability infrastructure cloud sre reliability engineering

comments powered by Disqus

Ansible for SRE teams

Ansible for SRE teams

February 1, 2020

Ansible is a "batteries included" automation, configuration management, and orchestration tool that's fast to learn and flexible enough for any architecture. Join James Meickle to get started with Ansible, with an eye toward sustainable development in cloud environments.

Bulk image processing using Kubernetes

Bulk image processing using Kubernetes

February 1, 2020

Mike Newswanger explains how he used Kubernetes and Google Cloud to burst and extend the capacity of a physical infrastructure for optimizing almost 10 million images in less than two weeks.

Migrating Spotify's runtime to Kubernetes

Migrating Spotify's runtime to Kubernetes

January 30, 2020

Spotify recently completed the migration of all services from running on bare-metal hardware to hosts in the cloud on GCP. Spotify is now in the exciting process of journeying from merely cloud hosted to cloud native via migrating the running of services to Kubernetes. James Wen discusses the work involved, lessons learned, and pitfalls encountered in moving services onto Kubernetes.

Cassandra versus cloud databases

Cassandra versus cloud databases

January 25, 2020

Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandras strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.

Network Reliability Engineering (NRE) and DevNetOps

Network Reliability Engineering (NRE) and DevNetOps

January 17, 2020

If big changes begin inside-out and not have-do-be but be-do-have, then goals of automation require us to focus on our core being and behaviors instead of products, tools and programmability/APIs. To …

Keynote: Kubernetes, Istio, Knative: The New Open Cloud Stack

Keynote: Kubernetes, Istio, Knative: The New Open Cloud Stack

January 11, 2020

Kubernetes has succeeded in its initial mission. Launched by Google as an open source platform built on the foundations of Borg, Kubernetes has grown into an enterprise platform, with a strong communi …