A good SRE is hard to find; or, The power of apprenticeship
February 13, 2020
Rowan Cota explains how BuzzFeed created a strong SRE team by growing the engineers it needed instead of waiting for them to fall out of the skyand how you can too. Rowan turns narrative examples into a framework that anyone can use to harness the power of growing potential to diversify and strengthen their teams.
High Altitude, Low Risk: Measuring Reliability in the Cloud Using Open Source Technology
February 5, 2020
With the financial convenience and flexibility of per-instance spend that cloud hosting allows, it follows that companies of all sizes have migrated their resources to the virtual world, putting their …
Building successful site reliability engineering in large enterprises
February 1, 2020
Implementing site reliability (SRE) engineering doesn't have to be intimidating, and it isn't only for cloud-native organizations. Liz Fong-Jones and Dave Rensin share eight key lessons Google's customer reliability engineering team learned helping large enterprises adopt SRE as an operations engineering model.
Frankenstein's microservices: How to avoid the monster
January 31, 2020
Many companies adopt microservices to break down monoliths, but they soon uncover a hidden cost: How do you manage all these new interconnected things popping up? Michael Hamrah explains how to avoid creating Frankenstein's monster by understanding elements of a microservice platform. . .so you can sleep at night.
Sell cron, buy Airflow: Modern data pipelines in finance
January 29, 2020
Quantopian integrates financial data from vendors around the globe. As the scope of its operations outgrew cron, the company turned to Apache Airflow, a distributed scheduler and task executor. James Meickle explains how in less than six months, Quantopian was able to rearchitect brittle crontabs into resilient, recoverable pipelines defined in code to which anyone could contribute.
How to cost-effectively and reliably build infrastructure for machine learning
January 22, 2020
Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.
Hudi: Unifying storage and serving for batch and near-real-time analytics
January 22, 2020
Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.
Design and analysis of the worlds most advanced microprocessors using Jupyter notebooks
January 9, 2020
Kerim Kalafala and Nicholai L'Esperance share their experiences using Jupyter notebooks as a critical aid in designing the next generation of IBM Power and Z processors, focusing on analytics on graphs consisting of hundreds of millions of nodes. Along the way, Kerim and Nicholai explain how they leverage Jupyter notebooks as part of their overall design system.
Mozillas journey from the data center to the cloud
December 28, 2019
Michael Van Kleeck leads a frank discussion of Mozillas multiyear journey to take all of its apps from the data center to the cloud. Join in to hear about the adventure, in which Mozilla vanquishes a multitude of organizational and technical challenges and emerges ready to empower its mission of protecting the open internet.
A retrospective on retrospectives: How to be a nonexpert expert in system resilience
December 23, 2019
Jessica DeVita tells the story of how a team at Microsoft challenged themselves to retrospect their retrospectives and shares what they learned about applying human factors ideas to software development.
From dandelion to tree: Scaling Slack
December 22, 2019
In 2016, Slack faced a problem: the load on its backend servers had increased by 1,000x. Bing Wei explains how rearchitecting the system with lazy loading, a publish/subscribe model, and an edge cache service overcame the problem with zero downtime, improved latency, and led to gains in reliability and availability.
Executive Briefing: What you need to know about fast data
December 9, 2019
Streaming data systems, so called fast data, promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler outlines what you need to know to exploit fast data successfully.
SRv6LB: Leveraging IPv6, Segment Routing, and VPP for Very Fast, Reliable, and Efficient Distributed Data Center Workload Balancing
December 1, 2019
In this talk, we present performance and scalability numbers from our open source implementation of the Maglev data-plane (part of Googles load balancing architecture as defined in ) in fd.io/VPP, …