December 8, 2019

199 words 1 min read

Lightning Talk: Scaling Distributed Deep Learning with Service Discovery: How CoreDNS Helps Distributed TensorFlow Tasks

Lightning Talk: Scaling Distributed Deep Learning with Service Discovery: How CoreDNS Helps Distributed TensorFlow Tasks

Training models with modern deep learning architecture is often computationally intensive and requires an efficient distributed system at scale. Such systems in distributed machine learning community …

Talk Title Lightning Talk: Scaling Distributed Deep Learning with Service Discovery: How CoreDNS Helps Distributed TensorFlow Tasks
Speakers Yong Tang (Director of Engineering, MobileIron)
Conference KubeCon + CloudNativeCon Europe
Conf Tag
Location Copenhagen, Denmark
Date Apr 30-May 4, 2018
URL Talk Page
Slides Talk Slides
Video

Training models with modern deep learning architecture is often computationally intensive and requires an efficient distributed system at scale. Such systems in distributed machine learning community often have special requirements and may involve additional efforts. This talk discusses the usage of CoreDNS for service discovery on distributed TensorFlow clusters for resolving deep learning problems. While CoreDNS has been widely used for service discovery in Kubernetes, its unique plugin based design allows CoreDNS to be easily extended and deployed in non-traditional distributed systems as well. Deployed on cloud (AWS), our distributed TensorFlow clusters have been greatly helped by CoreDNS for robustness against partial node failures. The deployment has also been simplified for non-DevOps (e.g., machine learning researchers) to launch and execute deep learning tasks at great ease.

comments powered by Disqus