December 5, 2019

316 words 2 mins read

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.


Talk Title	Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops
Speakers	Jim Dowling (Logical Clocks)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 22-24, 2018
URL	Talk Page
Slides	Talk Slides
Video

State-of-the-art deep learning systems at hyperscale AI companies attack the toughest problems with distributed deep learning. Distributed deep learning systems help both AI researchers and practitioners be more productive and enable the training of models that would be intractable on a single GPU server. Hadoop provides the needed platform support for distributed deep learning with TensorFlow and Spark and offers a unified feature store and resource management platform for GPUs. Jim Dowling explores recent developments in supporting distributed deep learning on Hadoop—in particular, Hops, a distribution of Hadoop with support for distributed metadata. Jim discusses the need for better support for Python and GPUs as a resource and demonstrates how to build a feature store with Hive, Kafka, and Spark. Jim also explains why on-premises distributed deep learning is gaining traction and how commodity GPUs provide lower-cost access to massive amounts of GPU resources. Distributed deep learning can both massively reduce training time and parallel experimentation, using large-scale hyperparameter optimization. Jim offers an overview of recent transformative open source TensorFlow frameworks that leverage Apache Spark to manage distributed training, such as Yahoo’s TensorFlowOnSpark, Uber’s Horovod platform, and Hops’s tfspark, which reduce training time as well as neural network development time through parallel experimentation on different models across hundreds of GPUs, as is typically done in hyperparameter sweeps.

kafka management apache framework gpu spark tensorflow large-scale hadoop open source ai network attack deep learning optimization python hyperscale neural network

comments powered by Disqus

Data science in the cloud

Data science in the cloud

November 27, 2019

In this talk Alex will discuss lessons learned from AWS SageMaker, an integrated framework for handling all stages of analysis. AWS uses open source components such as Jupyter, Docker containers, Python and well established deep learning frameworks such as Apache MxNet and TensorFlow for an easy to learn workflow.

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

November 30, 2019

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark

November 29, 2019

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Deep learning for domain-specific entity extraction from unstructured text

Deep learning for domain-specific entity extraction from unstructured text

November 27, 2019

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Distributed deep learning with containers on heterogeneous GPU clusters

Distributed deep learning with containers on heterogeneous GPU clusters

November 26, 2019

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

December 5, 2019

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the systems overall results.