Multi-Cloud Machine Learning Data and Workflow with Kubernetes
Autonomous vehicles require hardware accelerated machine learning for critical problems such as tracking and classification. Momenta trains ML models in on-prem regions and public clouds, each comes w …
Talk Title | Multi-Cloud Machine Learning Data and Workflow with Kubernetes |
Speakers | Fei Xue (Product Manager, Ant Financial), Lei Xue (Infrastructure Tech Lead, Momenta) |
Conference | KubeCon + CloudNativeCon |
Conf Tag | |
Location | Shanghai, China |
Date | Jun 23-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Autonomous vehicles require hardware accelerated machine learning for critical problems such as tracking and classification. Momenta trains ML models in on-prem regions and public clouds, each comes with different GPUs and network interfaces (Infiniband, RoCE). In this talk we discuss how we use Kubernetes to build a multi-cloud ML platform - in particular how we manage training data across different environments; how we address multi-user and gang scheduling; and how we support heterogeneous hardware.