December 30, 2019

301 words 2 mins read

Distributed TensorFlow on Hops

Distributed TensorFlow on Hops

Fabio Buso offers demonstrations of frameworks for building distributed TensorFlow applications on the Hops platform and walks you through the whole model lifecycle, from debugging and visualizing models on TensorBoard to parallel experimentation and distributed training (with the help of Spark) to model deployment and inferencing using TensorFlow Serving and Kubernetes.


Talk Title	Distributed TensorFlow on Hops
Speakers	Fabio Buso (Logical Clocks AB)
Conference	O’Reilly Open Source Convention
Conf Tag	Put open source to work
Location	Portland, Oregon
Date	July 16-19, 2018
URL	Talk Page
Slides	Talk Slides
Video

Methods that scale with computation are the future of AI. Hyperscale AI companies produce the most accurate models and train their models faster with distributed deep learning. Fabio Buso shares the latest developments in distributed TensorFlow and shows how distribution can both massively reduce training time and enable parallel experimentation for hyperparameter optimization. You’ll explore different distributed architectures for TensorFlow, including the parameter server and “ring allreduce” models, with a focus on open source TensorFlow frameworks that leverage Apache Spark to manage distributed training, such as Yahoo’s TensorFlowOnSpark, Uber’s Horovod, and the Hops model. Fabio also covers the different programming models supported and highlights the importance of cluster support for managing GPUs as a resource. To this end, he demonstrates how Hops, an open source distribution of Hadoop with support for GPUs as a resource, can run TensorFlow applications from a Jupyter notebook using Apache Spark for distribution and walks you through an end-to-end demo for distributed TensorFlow from training to model deployment and inferencing using TensorFlow serving, using a well-known large machine learning dataset (9M images, a 1 TB extended version of ImageNet). The demo will cover important issues of how to debug, monitor, and visualize training with TensorBoard and how to deploy and use trained models for inferencing on Kubernetes.

cluster apache framework gpu dataset spark tensorflow hadoop open source ai jupyter programming hyperscale deep learning machine learning book kubernetes optimization

comments powered by Disqus

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

December 5, 2019

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.

Distributed deep learning with containers on heterogeneous GPU clusters

Distributed deep learning with containers on heterogeneous GPU clusters

November 26, 2019

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Deep learning with TensorFlow and Spark using GPUs and Docker containers

December 10, 2019

In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment.

Distributed training of deep learning models

Distributed training of deep learning models

December 10, 2019

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

November 30, 2019

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Apache Spark programming

Apache Spark programming

November 29, 2019

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Sparks streaming capabilities and machine learning APIs.