reliability

A good SRE is hard to find; or, The power of apprenticeship

February 13, 2020

Rowan Cota explains how BuzzFeed created a strong SRE team by growing the engineers it needed instead of waiting for them to fall out of the skyand how you can too. Rowan turns narrative examples into a framework that anyone can use to harness the power of growing potential to diversify and strengthen their teams.

Workshop: Cloud-native Network Functions (CNF) Seminar

February 6, 2020

Two of the fastest-growing Linux Foundation projects ONAP (part of LF Networking) and Kubernetes (part of CNCF) are coming together in the next generation telecom architecture.Telcos are engaging …

High Altitude, Low Risk: Measuring Reliability in the Cloud Using Open Source Technology

February 5, 2020

With the financial convenience and flexibility of per-instance spend that cloud hosting allows, it follows that companies of all sizes have migrated their resources to the virtual world, putting their …

Building successful site reliability engineering in large enterprises

February 1, 2020

Implementing site reliability (SRE) engineering doesn't have to be intimidating, and it isn't only for cloud-native organizations. Liz Fong-Jones and Dave Rensin share eight key lessons Google's customer reliability engineering team learned helping large enterprises adopt SRE as an operations engineering model.

Frankenstein's microservices: How to avoid the monster

January 31, 2020

Many companies adopt microservices to break down monoliths, but they soon uncover a hidden cost: How do you manage all these new interconnected things popping up? Michael Hamrah explains how to avoid creating Frankenstein's monster by understanding elements of a microservice platform. . .so you can sleep at night.

Sell cron, buy Airflow: Modern data pipelines in finance

January 29, 2020

Quantopian integrates financial data from vendors around the globe. As the scope of its operations outgrew cron, the company turned to Apache Airflow, a distributed scheduler and task executor. James Meickle explains how in less than six months, Quantopian was able to rearchitect brittle crontabs into resilient, recoverable pipelines defined in code to which anyone could contribute.

How to cost-effectively and reliably build infrastructure for machine learning

January 22, 2020

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.

Hudi: Unifying storage and serving for batch and near-real-time analytics

January 22, 2020

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

Platform Approach for SDN Predictive Management Using AI and ML

January 19, 2020

AI and ML are the latest hot buzz words not only from the top technology companies but also from businesses large or small around the world. AT&T has promoted the cause of AI/ML over the last 20 years …

Network Reliability Engineering (NRE) and DevNetOps

January 17, 2020

If big changes begin inside-out and not have-do-be but be-do-have, then goals of automation require us to focus on our core being and behaviors instead of products, tools and programmability/APIs. To …

A New Software Engineering Methodology for Creating Resilient Microservices

January 15, 2020

Worldwide business houses are increasingly mandated to be agile and reliable in their operations, offerings and outputs. As IT is the greatest enabler of businesses, IT professionals and professors ac …

Lightning Talk: CoreDNS Over gRPC: Reliable Service Discovery for Kubernetes

January 11, 2020

While service discovery in Kubernetes may be provided via multiple mechanisms, DNS is the most commonly used and highly recommended for its ease of use. One challenge with DNS-based service discovery …

Container Platforms as Equalizers: Running Health Services Across the World

January 9, 2020

Praekelt.org creates and operates a number of health and youth-related services which are hosted on containerised clusters around the world, often in countries without an established cloud provider pr …

Design and analysis of the worlds most advanced microprocessors using Jupyter notebooks

January 9, 2020

Kerim Kalafala and Nicholai L'Esperance share their experiences using Jupyter notebooks as a critical aid in designing the next generation of IBM Power and Z processors, focusing on analytics on graphs consisting of hundreds of millions of nodes. Along the way, Kerim and Nicholai explain how they leverage Jupyter notebooks as part of their overall design system.

Highly Available Kubernetes Clusters - Best Practices

January 6, 2020

Everyone running a Kubernetes cluster in production wants reliability and high availability. Many clusters may implement a multi-master setup, but often this is not enough to consider a cluster highly …

Keynote: High Reliability Infrastructure Migrations

December 30, 2019

For companies with high availability requirements (99.99% uptime or higher), running new software in production comes with a lot of risks. But it's possible to make significant infrastructure changes …

Mozillas journey from the data center to the cloud

December 28, 2019

Michael Van Kleeck leads a frank discussion of Mozillas multiyear journey to take all of its apps from the data center to the cloud. Join in to hear about the adventure, in which Mozilla vanquishes a multitude of organizational and technical challenges and emerges ready to empower its mission of protecting the open internet.

Enterprise Machine Learning on K8s: Lessons Learned and the Road Ahead

December 24, 2019

Kubernetes as a platform is being asked to support an ever increasing range of workloads, including machine learning and big data processing. These new workloads introduce challenges both for both end …

A retrospective on retrospectives: How to be a nonexpert expert in system resilience

December 23, 2019

Jessica DeVita tells the story of how a team at Microsoft challenged themselves to retrospect their retrospectives and shares what they learned about applying human factors ideas to software development.

Day 2 with Stateful Applications - Implementing a Data Protection Strategy

December 23, 2019

As teams start to onboard mission-critical applications into production, theres a need to address day-2 concerns. Dealing with regulatory requirements, user error, ransomware and cluster upgrades - r …

From dandelion to tree: Scaling Slack

December 22, 2019

In 2016, Slack faced a problem: the load on its backend servers had increased by 1,000x. Bing Wei explains how rearchitecting the system with lazy loading, a publish/subscribe model, and an edge cache service overcame the problem with zero downtime, improved latency, and led to gains in reliability and availability.

Reliability from the ground up: Designing for five nines

December 20, 2019

Astrid Atkinson discusses techniques for building systems that are resilient by design.

Switching the Engine (DNS) in Kubernetes: Benchmarks and Possibilities

December 20, 2019

DNS is one of the core components making Kubernetes run. Its essential for most services and service discovery. Its critical, underappreciated and overlooked at the same time. With the recent switch …

Deep Dive: NATS

December 10, 2019

The NATS project and its ecosystem have been continuously evolving since joining the CNCF and in this session we will share a retrospective of what is the current state of the art and overall directio …

Executive Briefing: What you need to know about fast data

December 9, 2019

Streaming data systems, so called fast data, promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler outlines what you need to know to exploit fast data successfully.

SRv6LB: Leveraging IPv6, Segment Routing, and VPP for Very Fast, Reliable, and Efficient Distributed Data Center Workload Balancing

December 1, 2019

In this talk, we present performance and scalability numbers from our open source implementation of the Maglev data-plane (part of Googles load balancing architecture as defined in [1]) in fd.io/VPP, …

SIG Scheduling Deep Dive Bobby Salamat &

November 29, 2019

Please join us for in-depth understanding of Kubernetes Scheduler and its advanced features.In this presentation we talk about the internals of Kubernetes Scheduler and how it keeps track of the clust …

Challenges to Writing Cloud Native Applications

November 27, 2019

Cloud native means designing software explicitly for the cloud, not trying to deploy to the cloud in retrospect - shoving a single replica of a monolith into Kubernetes wont cut it. Developing for …