December 5, 2019

316 words 2 mins read

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.

Talk Title Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops
Speakers Jim Dowling (Logical Clocks)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date May 22-24, 2018
URL Talk Page
Slides Talk Slides
Video

State-of-the-art deep learning systems at hyperscale AI companies attack the toughest problems with distributed deep learning. Distributed deep learning systems help both AI researchers and practitioners be more productive and enable the training of models that would be intractable on a single GPU server. Hadoop provides the needed platform support for distributed deep learning with TensorFlow and Spark and offers a unified feature store and resource management platform for GPUs. Jim Dowling explores recent developments in supporting distributed deep learning on Hadoop—in particular, Hops, a distribution of Hadoop with support for distributed metadata. Jim discusses the need for better support for Python and GPUs as a resource and demonstrates how to build a feature store with Hive, Kafka, and Spark. Jim also explains why on-premises distributed deep learning is gaining traction and how commodity GPUs provide lower-cost access to massive amounts of GPU resources. Distributed deep learning can both massively reduce training time and parallel experimentation, using large-scale hyperparameter optimization. Jim offers an overview of recent transformative open source TensorFlow frameworks that leverage Apache Spark to manage distributed training, such as Yahoo’s TensorFlowOnSpark, Uber’s Horovod platform, and Hops’s tfspark, which reduce training time as well as neural network development time through parallel experimentation on different models across hundreds of GPUs, as is typically done in hyperparameter sweeps.

comments powered by Disqus