January 21, 2020

206 words 1 min read

Leveraging Spark and deep learning frameworks to understand data at scale

Leveraging Spark and deep learning frameworks to understand data at scale

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kmpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.


Talk Title	Leveraging Spark and deep learning frameworks to understand data at scale
Speakers	Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

The increasing complexity of learning algorithms and deep neural networks, combined with size of data and parameters, has made it challenging to exploit existing large-scale data processing pipelines for training and inference. Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. You’ll explore different tools and frameworks, ranging from Spark for preprocessing to deep learning frameworks for training and inference, targeting the nuances in the datasets as they relate to algorithm optimization techniques, frameworks, and scale.

nasa framework algorithm dataset spark exploit ecosystem large-scale complexity video network deep learning optimization pipeline neural network

comments powered by Disqus

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

January 21, 2020

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidates probability of applying to specific jobs based on their rsum.

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud

Using Kubernetes to Offer Scalable Deep Learning on Alibaba Cloud

January 8, 2020

Running deep learning (DL) jobs requires end to end workflow to accelerate model training iteratively. It must be scalable on massive data and computational resources, and be framework agnostic to rel …

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

December 5, 2019

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

December 5, 2019

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the systems overall results.

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Using Siamese CNNs for removing duplicate entries from real estate listing databases

December 4, 2019

Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology.

Cuttlefish: Lightweight primitives for online tuning

Cuttlefish: Lightweight primitives for online tuning

November 28, 2019

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.