Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops
Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.
Talk Title | Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops |
Speakers | Jim Dowling (Logical Clocks) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | May 22-24, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
State-of-the-art deep learning systems at hyperscale AI companies attack the toughest problems with distributed deep learning. Distributed deep learning systems help both AI researchers and practitioners be more productive and enable the training of models that would be intractable on a single GPU server. Hadoop provides the needed platform support for distributed deep learning with TensorFlow and Spark and offers a unified feature store and resource management platform for GPUs. Jim Dowling explores recent developments in supporting distributed deep learning on Hadoop—in particular, Hops, a distribution of Hadoop with support for distributed metadata. Jim discusses the need for better support for Python and GPUs as a resource and demonstrates how to build a feature store with Hive, Kafka, and Spark. Jim also explains why on-premises distributed deep learning is gaining traction and how commodity GPUs provide lower-cost access to massive amounts of GPU resources. Distributed deep learning can both massively reduce training time and parallel experimentation, using large-scale hyperparameter optimization. Jim offers an overview of recent transformative open source TensorFlow frameworks that leverage Apache Spark to manage distributed training, such as Yahoo’s TensorFlowOnSpark, Uber’s Horovod platform, and Hops’s tfspark, which reduce training time as well as neural network development time through parallel experimentation on different models across hundreds of GPUs, as is typically done in hyperparameter sweeps.