No More Chaos: Audit and Inspect Kubernetes at Scale
Accuracy in fault detection and efficiency of issue analysis are important for availability and stability of Kubernetes clusters.While there are huge number of resources, events, and metrics in Kubern …
Talk Title | No More Chaos: Audit and Inspect Kubernetes at Scale |
Speakers | 陈杰 (技术专家, 阿里云), 马金晶 (高级开发工程师, 蚂蚁金服) |
Conference | KubeCon + CloudNativeCon |
Conf Tag | |
Location | Shanghai, China |
Date | Jun 23-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Accuracy in fault detection and efficiency of issue analysis are important for availability and stability of Kubernetes clusters.While there are huge number of resources, events, and metrics in Kubernetes. In our cluster, we noticed Kubernetes generates thousands of metrics data per second which makes it challenging to figure out the root cause from this ocean of data, not to mention analysis,data visualizion and alarms.In this talk, we will share experince and practices of auditing and inspecting Kubernetes at web scale. We’ll firstly talk about the how we design data metrics to reflect the stability of Kubernetes and how we consume these metrics and set out streaming alarm.We will use real cases to demo how we aggregate and analyze these metrics data.Finally,we will share the practices in Alibaba of building a automiatic system for real-time data inspection and analysis for Kubernetes.