Cruise Control: Effortless management of Kafka clusters
Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control.
Talk Title | Cruise Control: Effortless management of Kafka clusters |
Speakers | Adem Efe Gencer (LinkedIn) |
Conference | Strata Data Conference |
Conf Tag | Big Data Expo |
Location | San Francisco, California |
Date | March 26-28, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Kafka incurs significant management overhead. Growing cluster sizes, the increasing volume and diversity of user traffic, and the age of network and server components further contribute to this overhead. The resulting increase in the frequency of hardware failures and load imbalance leads to frequent service interruptions, leading to poor user experience. In particular, reactive mitigation becomes insufficient due to the impact on the other services that have a Kafka dependency. Getting near-optimal performance from such an infrastructure service, maintaining its availability in the face of cascading failures, and achieving these objectives with minimal management overhead are critical but nontrivial tasks. Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Adam begins by outlining Cruise Control’s approach to monitoring load distribution in clusters, identifying an imbalance in them, and fixing this imbalance using replica and leadership movements. He then explains how Cruise Control detects fail-stop broker failures and SLO violations without human intervention and examines a more aggressive scenario, where Cruise Control proactively identifies and mitigates potential service disruptions.