February 6, 2020

318 words 2 mins read

Container orchestrator to DL workload, Bing's approach: FrameworkLauncher

Container orchestrator to DL workload, Bing's approach: FrameworkLauncher

Bing in Microsoft runs large, complex workflows and services, but there was no existing solutions that met its needs. So it created and open-sourced FrameworkLauncher. Kai Liu, Yuqi Wang, and Bin Wang explore the solution, built to orchestrate workloads on YARN through the same interface without changes to the workloads, including large-scale long-running services, batch jobs, and streaming jobs.

Talk Title Container orchestrator to DL workload, Bing's approach: FrameworkLauncher
Speakers Kai Liu (BING) (Microsoft), Yuqi Wang (Microsoft), Bin Wang (Microsoft)
Conference O’Reilly Artificial Intelligence Conference
Conf Tag Put AI to Work
Location San Jose, California
Date September 10-12, 2019
URL Talk Page
Slides Talk Slides
Video

Bing in Microsoft has a large-scale deployment of Hadoop, Spark, Kafka, and other open source technologies with more than 1 million cores and 4 million GBs of RAM. It needs to run large, complex workflows and services top of the stack, but there are challenges to orchestrate containers for workflows and services at its scale and no existing solutions fully meet its needs. The company created and open-sourced a technology called FrameworkLauncher. It has a proven track record in Microsoft Bing’s large-scale production environment and has partially open-sourced as the most core component of Open Platform for AI. Kai Liu, Yuqi Wang, and Bin Wang explore the main feature set of FrameworkLauncher. It has high availability, where all launcher and Hadoop components are recoverable and work preserving, so user services are designed to remain uninterrupted when components shut down, crash, upgrade, or are out for a long time. It has high usability, so no user code changes are needed to run existing executable inside the container. It also includes services and batch jobs requirements, such as GPU scheduling, port scheduling, and gang scheduling, among others. And it has a number of cluster-provider related features, such as AskMode to extend machine maintenance time for uninterrupted workloads, workload deployment, and launcher watchdog and alert.

comments powered by Disqus