reliability

Going serverless: Security outside the box

January 22, 2020

The advent of serverless technologies and infrastructure as code has changed how we build and deploy security services, empowering teams to create low-cost, scalable, and secure services to protect organizations. Drawing on their real-world experiences, Jack Naglieri and Austin Byers explore tools and techniques for successfully building, deploying, and debugging serverless security applications.

Interconnection Track: Creating a Centralized Database for Peering and Colocation Services

January 20, 2020

Everyone in the industry needs access to reliable and accurate colocation and network data. PeeringDB is the most commonly used source for this information but its …

How to make a lion bulletproof: Setting up site reliability engineering (SRE) in a global financial organization

January 19, 2020

Did you read the OReilly book about Google SREs but doubt that SRE will work for your more traditional or more regulated company? Janna Brummel and Robin van Zijll explain how they implemented SRE in a global financial organization, providing an overview of methods and technologies and sharing lessons learned from a year of doing SRE.

Minesweeper and Propane: Two Tools for Improving Network Reliability

January 17, 2020

Over the past 4 decades, networks have become increasingly complex as scalability, quality of service, robustness, and fault-tolerance requirements have grown to m …

Application scaling over the edge: Microservice architecture in industrial applications

January 14, 2020

Driven by the need for data analytics in Industry 4.0, edge computing is gaining momentum to bring intelligence to the devices at the networks edge. Fei Li offers insights on a microservice-based architecture that keeps analytics applications on edge devices while dynamically utilizing resources on the cloud to achieve resilience and scalability in critical industrial applications.

Customer-centric observability

January 8, 2020

With the recent flourishing of observability systems, there's no shortage of things to monitor. Sadly, humans have limited capacity to process them all. Mark McBride outlines three key metricsrequest rate, success rate, and the latency histogramthat provide a high-level abstraction of the customer experience. If these three metrics are good, your system is healthy from a customer perspective.

CephFS The Stable Distributed Filesystem

January 7, 2020

Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. The POSIX-compatible CephFS was declared stable in its ea …

Thriving under a continuous self-inflicted DDoS attack

January 6, 2020

New Relic customers send monitoring data to New Relic servers every minutea continuous firehose of data. Drawing on his experience at New Relic, Kevin Beck shares best practices for building a streaming service based on Apache Kafka, self-monitoring for reliability and fault tolerance, and building a DevOps culture that anticipates and prevents outages.

Geospatial big data analysis at Uber

January 3, 2020

Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.

Comparison of Foss Distributed Storage

December 31, 2019

Marian will compare the performance and reliability of some of the most used distributed storage systems: - Ceph - GlusterFS - DRBD + NFS - OrangeFS - MooseFS In this talk you will not only see some s …

From Zero to Hero: Scalable 4K Video Encoding with Kubernetes and Other Open Source Tools

December 31, 2019

From zero to hero: Scalable 4k video encoding with kubernetes and other open source tools (Hygo Reinaldo, Xite Networks) - Encoding 4k videos can be very challenging due to aspects like encoding time, …

Intro to Ceph, the Distributed Storage System

December 26, 2019

Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. With an advanced placement algorithm, active storage node …

IBM LinuxONE: The Largest Scalable Linux Server

December 23, 2019

The Modernization possibilities on the Most Scalable Compute Platform for Secure Data Driven Workloads Open source has become a hub for innovation. New use cases such as containers, new classes of dat …

Jupyter notebooks and production data science workflows

December 22, 2019

Jupyter notebooks are a great tool for exploratory analysis and early development, but what do you do when it's time to move to production? A few years ago, the obvious answer was to export to a pure Python script, but now there are other options. Andrew Therriault dives into real-world cases to explore alternatives for integrating Jupyter into production workflows.

Database reliability engineering: What, why, and how?

December 16, 2019

SRE is becoming quite the ubiquitous term, but what about DBRE? Laine Campbell and Charity Majors dive into DBRE, exploring the paths to this craft and how to culturally evolve and support it. Laine and Charity focus on organizational scale, self-service, and force multipliers in recoverability, observability, availability, security, release management, and infrastructure.

Continuous Integration at Scale on Kubernetes [B]

December 10, 2019

eBay has a large community of developers working on several thousand applications at any time. To improve developer productivity, we offer Continuous Integration As A Service (CIAAS). This system prov …

Big data science, the IoT, and the transportation sector

December 5, 2019

Wael Elrifai leads a journey through the design and implementation of a predictive maintenance platform for Hitachi Rail. The industrial internet, the IoT, data science, and big data make for an exciting ride.

The Architecture of a Multi-Cloud Environment with Kubernetes [I]

December 4, 2019

Kubernetes is an orchestration platform that enables running distributed systems, which are designed with the philosophy of spreading wide to best prepare for outages. This is achieved by deploying yo …

Kubernetes in the Datacenter: Squarespaces Journey Towards Self-Service Infrastructure [I]

December 2, 2019

As Squarespaces engineering organization evolved, microservices became an obvious solution to quickly deliver new features and improve infrastructure reliability. We encountered significant challenge …

Webhooks for Automated Updates [B]

December 2, 2019

In most software projects, there is a tremendous focus on increasing efficiency and reliability. Rolling updates in Kubernetes is a really good example of how real-time updates to applications can be …

The state of Spark in the cloud

November 29, 2019

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

What Happens When Something Goes Wrong? On Kubernetes Reliability [I]

November 28, 2019

One of the best features of the Kubernetes is that it can automatically recover from various failures and keep your application working despite unfavorable circumstances. There are moments when this w …

Kubernetes Scheduling Features or How Can I Make the System Do What I Want? [I]

November 27, 2019

Each user has her own set of requirements and constraints on where their Pods should be placed in a cluster. Some want to increase utilization, thus they want to pack Pods as densely as possible. Othe …

Databases and Docker: A survival guide

November 26, 2019

Containers are great ephemeral vessels for your applications. But what about the data that drives your business? It must survive containers coming and going, maintain its availability and reliability, and grow when you need it. Alvin Richards does some live coding to show key strategies to help you survive the transition to production.

Tales from Lastminute.com Machine Room: Our Journey Towards a Full On-Premise Kubernetes Architecture in Production [I]

November 25, 2019

We sell travel services to more than 10 million customers worldwide in 15 languages across 35 countries, through hundreds of micro-services. What happens if you challenge the way you deliver your pr …

Kubernetes-Defined Monitoring [I]

November 23, 2019

Over the past few years weve all learned how Kubernetes can dramatically change the process of deploying an application, improve reliability, and accelerate operations. As Kubernetes matures, I belie …