1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba
In cloud era, container based applications in enterprise grow rapidly, then container failure's possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guar …
Talk Title | 1-5-10: How to Fast Recover Container Failure at Large Scale - XiongHuan, Alibaba |
Speakers | Huan Xiong (Senior Engineer, Alibaba) |
Conference | KubeCon + CloudNativeCon |
Conf Tag | |
Location | Shanghai, China |
Date | Jun 23-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
In cloud era, container based applications in enterprise grow rapidly, then container failure’s possibility is amplified so much due to mannual operations, hardware failure and so on. Thus how to guarantee reliability of containers at scale without increasing resource investment is a really huge challenge cloud platform face. Alibaba has run millions of containers and put forward 1-5-10 thoery for recovering container-related failure: MTTD(Mean Time to Detect) is 1 min, MTTI(mean time to identity) is 5 min, MTTR(mean time to resolve) is 10 min. In this session we’ll discuss how to increase reliability of large-scaled containers by 1-5-10: 1. How to build an efficient agent locally to detect problems within 1 min; 2. How to diagnose container problem intelligently by expert’s knowledge base; 3. How to recover container problem automatically in one failure-driven way.