The secret sauce behind LinkedIn's self-managing Kafka clusters

LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.


Talk Title	The secret sauce behind LinkedIn's self-managing Kafka clusters
Speakers	Jiangjie Qin (LinkedIn)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

Kafka clusters should be run with rack awareness, a balanced utilization of all the hardware resources (CPU, disk, and network), and automatic recovery from failures. When running small clusters, the above requirements can be satisfied manually (e.g., by SREs monitoring the broker resource distribution and manually reassigning replicas from hot brokers to less loaded brokers). However, doing so becomes prohibitively expensive with a large Kafka deployment. LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity, and LinkedIn has been exploring ways to do so. There are a few challenges to be addressed: Jiangjie Qin explains how LinkedIn addressed the above problems and shares lessons learned from operating Kafka at scale with minimum human intervention. This experience can also be applied to the management of other stateful distributed systems. While there are existing products that help balance resource utilization in a distributed system, most of these are application agnostic and perform the rebalance by migrating the entire application process. While this works well for stateless systems, it typically falls short when it comes to stateful systems (e.g., Kafka) due to the large amount of state associated with the process. LinkedIn’s solution addresses this problem by trying to understand the application and migrating only a partial state—a solution that could be useful in any stateful distributed system.

The secret sauce behind LinkedIn's self-managing Kafka clusters

Apache Kafka + Apache Mesos = Highly scalable streaming microservices

Stream processing with Kafka

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Everything you Need to Know about Using GPUs with Kubernetes

Inside Kubernetes Resource Management (QoS) Mechanics and Lessons from the Field

Enabling Hyperledger Fabric Business Network Across Multiple Clouds using ChainDigit Decentralised Management Console