September 28, 2019

217 words 2 mins read

Managing Large-Scale Kubernetes Clusters Effectively and Reliably

Managing Large-Scale Kubernetes Clusters Effectively and Reliably

As the business grows, we need to deploy Kubernetets into several data centers all around the world. There are more than ten thousands of Nodes in a single data center. The critical challenge we are f …

Talk Title Managing Large-Scale Kubernetes Clusters Effectively and Reliably
Speakers Yong Zhang (Senior Software Engineer, Ant Financial), Zhixian Lin (Senior Software Engineer, Ant Financial)
Conference KubeCon + CloudNativeCon
Conf Tag
Location Shanghai, China
Date Jun 23-26, 2019
URL Talk Page
Slides Talk Slides
Video

As the business grows, we need to deploy Kubernetets into several data centers all around the world. There are more than ten thousands of Nodes in a single data center. The critical challenge we are facing is how to manage several large-scale Kubernetes clusters across data centers with efficiency and reliability. In this talk, we will share the experince and practices of automating large-scale cluster management. At first, we will introduce fully automated Node lifecycle management, and how to automatically discover and recover Node failures based on NPD, Autoscalers and customized Operator. Then we will share the experience and solutions of Kubernetes cluster deployment and upgrading. Finally, we will share the risk prevention and control system based on Prometheus and Operator, which is the cornerstone of reliability with the ability of automatic faults detection and isolation.

comments powered by Disqus