October 6, 2019

207 words 1 min read

1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba

1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba

In cloud era, container based applications in enterprise grow rapidly, then container failure's possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guar …

Talk Title 1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba
Speakers Huan Xiong (Senior Engineer, Alibaba)
Conference KubeCon + CloudNativeCon
Conf Tag
Location Shanghai, China
Date Jun 23-26, 2019
URL Talk Page
Slides Talk Slides
Video

In cloud era, container based applications in enterprise grow rapidly, then container failure’s possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guarantee reliability of containers at scale without increasing resource investment is a really huge challenge cloud platform face. Alibaba has run millions of containers and put forward 1-5-10 thoery for recovering container-related failure: MTTD(Mean Time to Detect) is 1 min, MTTI(mean time to identity) is 5 min, MTTR(mean time to resolve) is 10 min. In this session we’ll discuss how to increase reliability of large-scaled containers by 1-5-10: 1. How to build an efficient agent locally to detect problems within 1 min; 2. How to diagnose container problem intelligently by expert’s knowledge base; 3. How to recover container problem automatically in one failure-driven way.

comments powered by Disqus