December 27, 2019

216 words 2 mins read

Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning

Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning

Large training datasets used by machine learning frameworks, such as Kubeflow, are usually stored in low cost and high capacity S3 or Google Cloud Storage. However, S3s rating limiting and slow data …


Talk Title	Nezha: A Kubernetes Native Big Data Accelerator For Machine Learning
Speakers	Huamin Chen (Principal Software Engineer, Red Hat), Yuan Zhou (Senior Software Development Engineer, Intel)
Conference	KubeCon + CloudNativeCon North America
Conf Tag
Location	Seattle, WA, USA
Date	Dec 9-14, 2018
URL	Talk Page
Slides	Talk Slides
Video

Large training datasets used by machine learning frameworks, such as Kubeflow, are usually stored in low cost and high capacity S3 or Google Cloud Storage. However, S3’s rating limiting and slow data downloading significantly challenges training performance and limits compute scalability. We introduce NeZha and explain how it improves Kubeflow’s training. Nezha is an open source, community driven, and highly collaborative project, contributed by storage and big data engineers. Nezha is based on Kubernetes Initializer: it rewrites Pod spec, adds a sidecar S3 cache, and redirects Pod to use local cache to accelerate. Nezha is self contained and easy to use. It does not require modification to existing applications or user visible Pod changes. Nezha improves big data application performance. Our initial Kubeflow benchmark results using MNIST dataset shows NeZha achieves ~2x speedup.

google framework performance dataset open source big data machine learning cloud collaborative kubernetes

comments powered by Disqus

Distributed training of deep learning models

Distributed training of deep learning models

December 10, 2019

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

SPIFFE Deep Dive

SPIFFE Deep Dive

November 26, 2019

SPIFFE (Secure Production Infrastructure for Everyone) and SPIRE are two of the newest projects to join the CNCF. These projects build on designs first championed at Google, Twitter and elsewhere to p …

Open source and open standards in the age of cloud AI

Open source and open standards in the age of cloud AI

December 27, 2019

Tim O'Reilly considers how to extend the values and practices of open source in the age of AI, big data, and cloud computing.

The CIVIC platform: Collaborative data science in the cybernetic ecosystem

The CIVIC platform: Collaborative data science in the cybernetic ecosystem

December 25, 2019

Catherine Nikolovski, Michael Lange, and Jaron Heard offer an overview of Hack Oregon's CIVIC, a new approach to interactive computing inspired by complex information challenges in the civic space, which packages real-world data into universal standards and provides integration tools and powerful cloud computing to anyone with an internet connection.

Enterprise Machine Learning on K8s: Lessons Learned and the Road Ahead

Enterprise Machine Learning on K8s: Lessons Learned and the Road Ahead

December 24, 2019

Kubernetes as a platform is being asked to support an ever increasing range of workloads, including machine learning and big data processing. These new workloads introduce challenges both for both end …

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

December 24, 2019

John Dohoney and Kaitlin Carter walk you through deploying the SMACK stack on DC/OS. This architecture enables you to create modern streaming applications that make use of NoSQL databases with Cassandra and message streaming with Apache Kafka using analytics streaming with Apache Spark, all running under Apache Mesos implemented with Akka streaming and asynchronous Java libraries under DC/OS.