Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning
Large training datasets used by machine learning frameworks, such as Kubeflow, are usually stored in low cost and high capacity S3 or Google Cloud Storage. However, S3s rating limiting and slow data …
Talk Title | Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning |
Speakers | Huamin Chen (Principal Software Engineer, Red Hat), Yuan Zhou (Senior Software Development Engineer, Intel) |
Conference | KubeCon + CloudNativeCon North America |
Conf Tag | |
Location | Seattle, WA, USA |
Date | Dec 9-14, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Large training datasets used by machine learning frameworks, such as Kubeflow, are usually stored in low cost and high capacity S3 or Google Cloud Storage. However, S3’s rating limiting and slow data downloading significantly challenges training performance and limits compute scalability. We introduce NeZha and explain how it improves Kubeflow’s training. Nezha is an open source, community driven, and highly collaborative project, contributed by storage and big data engineers. Nezha is based on Kubernetes Initializer: it rewrites Pod spec, adds a sidecar S3 cache, and redirects Pod to use local cache to accelerate. Nezha is self contained and easy to use. It does not require modification to existing applications or user visible Pod changes. Nezha improves big data application performance. Our initial Kubeflow benchmark results using MNIST dataset shows NeZha achieves ~2x speedup.