November 26, 2019

298 words 2 mins read

Debugging distributed systems

Debugging distributed systems

Distributed systems are hard. They are complicated, hard to understand, and very challenging to manage. But they are critical to modern software, and when they have problems, we need to fix them. Donny Nadolny looks at what it takes to debug a problem in a distributed system like ZooKeeper, walking attendees through the process of finding and fixing one cause of many of these failures.


Talk Title	Debugging distributed systems
Speakers	Donny Nadolny (PagerDuty)
Conference	Velocity
Conf Tag	Build resilient systems at scale
Location	Santa Clara, California
Date	June 21-23, 2016
URL	Talk Page
Slides	Talk Slides
Video

Despite our best efforts, our systems fail. Sometimes it’s our fault—code that we wrote, bugs that we caused. But sometimes the fault is with systems that we have no direct control over. Distributed systems are hard. They are complicated, hard to understand, and very challenging to manage. But they are critical to modern software, and when they have problems, we need to fix them. ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed a lot. Donny Nadolny looks at what it takes to debug a problem in a distributed system like ZooKeeper, walking attendees through the process of finding and fixing one cause of many of these failures. Donny explains how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you will want to know about TCP, including an example of machines having a different view of the state of a TCP stream. If you are interested in distributed systems and how they can fail, this session is for you.

kafka code spark network distributed system

comments powered by Disqus

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

Sightseeing, venues, and friends: Predictive analytics with Spark ML and Cassandra

November 17, 2019

Which venues have similar visiting patterns? How can we detect when a user is on vacation? Can we predict which venues will be favorited by users by examining their friends' preferences? Natalino Busa explains how these predictive analytics tasks can be accomplished by using Spark SQL, Spark ML, and just a few lines of Scala code.

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

November 17, 2019

Moty Fania shares Intels IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platforms architecture.

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

October 25, 2019

Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

Apache Eagle: Secure Hadoop in real time

Apache Eagle: Secure Hadoop in real time

November 21, 2019

Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.

Simple, fast, and flexible risk aggregation in Hadoop

Simple, fast, and flexible risk aggregation in Hadoop

November 17, 2019

Value at risk (VaR) is a widely used risk measure. VaR is not simply additive, which provides unique challenges to report VaR at any aggregate level, as traditional database aggregation functions don't work. Deenar Toraskar explains how the Hive complex data types and user-defined functions can be used very effectively to provide simple, fast, and flexible VaR aggregation.