The secret sauce behind LinkedIn's self-managing Kafka clusters
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.
Talk Title | The secret sauce behind LinkedIn's self-managing Kafka clusters |
Speakers | Jiangjie Qin (LinkedIn) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 6-8, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Kafka clusters should be run with rack awareness, a balanced utilization of all the hardware resources (CPU, disk, and network), and automatic recovery from failures. When running small clusters, the above requirements can be satisfied manually (e.g., by SREs monitoring the broker resource distribution and manually reassigning replicas from hot brokers to less loaded brokers). However, doing so becomes prohibitively expensive with a large Kafka deployment. LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity, and LinkedIn has been exploring ways to do so. There are a few challenges to be addressed: Jiangjie Qin explains how LinkedIn addressed the above problems and shares lessons learned from operating Kafka at scale with minimum human intervention. This experience can also be applied to the management of other stateful distributed systems. While there are existing products that help balance resource utilization in a distributed system, most of these are application agnostic and perform the rebalance by migrating the entire application process. While this works well for stateless systems, it typically falls short when it comes to stateful systems (e.g., Kafka) due to the large amount of state associated with the process. LinkedIn’s solution addresses this problem by trying to understand the application and migrating only a partial state—a solution that could be useful in any stateful distributed system.