December 3, 2019

178 words 1 min read

Building successful site reliability engineering in large enterprises

Building successful site reliability engineering in large enterprises

Implementing site reliability (SRE) engineering doesn't have to be intimidating, and it isn't only for cloud-native organizations. Liz Fong-Jones and Dave Rensin share eight key lessons Google's customer reliability engineering team learned helping large enterprises adopt SRE as an operations engineering model.


Talk Title	Building successful site reliability engineering in large enterprises
Speakers	Liz Fong-Jones (Honeycomb), Dave Rensin (Google)
Conference	Velocity
Conf Tag	Build resilient systems at scale
Location	New York, New York
Date	September 20-22, 2016
URL	Talk Page
Slides	Talk Slides
Video	Talk Video

Google’s customer reliability engineering team is a specialized group of SREs who go into the world and teach enterprise customers of public cloud infrastructure—via their actual production systems—how to “do SRE” in their orgs. In the team’s two years of existence, its members have found that some things they thought would be hard weren’t, while others were nigh on impossible. The team has written many postmortems and learned a bunch of lessons you can only learn the hard way. Liz Fong-Jones and Dave Rensin share eight of these key lessons. Topics include:

google reliability infrastructure cloud sre reliability engineering

comments powered by Disqus

Ansible for SRE teams

Ansible for SRE teams

December 3, 2019

Ansible is a "batteries included" automation, configuration management, and orchestration tool that's fast to learn and flexible enough for any architecture. Join James Meickle to get started with Ansible, with an eye toward sustainable development in cloud environments.

Bulk image processing using Kubernetes

Bulk image processing using Kubernetes

December 3, 2019

Mike Newswanger explains how he used Kubernetes and Google Cloud to burst and extend the capacity of a physical infrastructure for optimizing almost 10 million images in less than two weeks.

Migrating Spotify's runtime to Kubernetes

Migrating Spotify's runtime to Kubernetes

November 30, 2019

Spotify recently completed the migration of all services from running on bare-metal hardware to hosts in the cloud on GCP. Spotify is now in the exciting process of journeying from merely cloud hosted to cloud native via migrating the running of services to Kubernetes. James Wen discusses the work involved, lessons learned, and pitfalls encountered in moving services onto Kubernetes.

A practical guide to monitoring and alerting with time series at scale

A practical guide to monitoring and alerting with time series at scale

November 28, 2019

Monitoring only sucks when the cost of maintenance scales proportionally with the size of the system being monitored. Recently, tools have emerged that assist with scaling out monitoring configurations sublinearly with the size of the system. Jamie Wilkinson explores time series-based alerting and offers practical examples that can be employed in your environment today.

Knative: Kubernetes, serverless, and you

Knative: Kubernetes, serverless, and you

December 1, 2019

It's a Kubernetes world. Join Ryan Gregg to learn about Knative, an open source collaboration between Google and other industry leaders to define the future of serverless on Kubernetes. Knative solves the difficult but boring aspects of running modern cloud applications on Kubernetes.

Lessons learned migrating HealthCare.gov to Terraform

Lessons learned migrating HealthCare.gov to Terraform

December 1, 2019

Christian Monaghan explains how he and his team successfully migrated HealthCare.gov, America's largest government website, to the cloud infrastructure provisioning tool Terraform, shares lessons learned along the way, and details how you can effectively use Terraform for your next project.