November 24, 2019

277 words 2 mins read

Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I]

Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I]

Big Data and Machine Learning have become extremely hot topics in recent years. Google has announced its AI-centric strategy and released the deep learning toolkit TensorFlow. TensorFlow soon became t …

Talk Title Building Distributed TensorFlow Using Both GPU and CPU on Kubernetes [I]
Speakers Huizhi Zhao (Software Engineer, Caicloud), Zeyu Zheng (Chief Data Scientist, Caicloud)
Conference CloudNativeCon + KubeCon Europe
Conf Tag
Location Berlin Congress Center
Date Mar 28-30, 2017
URL Talk Page
Slides Talk Slides
Video

Big Data and Machine Learning have become extremely hot topics in recent years. Google has announced its AI-centric strategy and released the deep learning toolkit TensorFlow. TensorFlow soon became the most popular open source toolkit for deep learning applications. However, it may take years to train large deep learning models on a single machine without GPU. In order to accelerate the training process, we build a distributed TensorFlow system on Kubernetes which support both CPUs and GPUs. In this presentation, I’d like to share our experiences about how to build this distributed TensorFlow system on Kubernetes. First, I’ll briefly introduce TensorFlow and how TensorFlow supports training model distributedly. However, the original distribution mechanism lacks lots of components such as scheduling, monitoring, life cycle managing and etc. to make it suitable for production usage. In the rest of the presentation, I’ll focus on how to leverage Kubernetes to solve those problem. The solution involves three components. First, I’ll introduce how to schedule TensorFlow jobs in a cluster with both CPUs and GPUs. Then I’ll share our experience in managing the life cycle of a distributed TensorFlow job. Finally, I’ll state our efforts in lowering the bar for using distributed TensorFlow

comments powered by Disqus