January 25, 2020

429 words 3 mins read

Data operations problems created by deep learning and how to fix them (sponsored by MapR)

Data operations problems created by deep learning and how to fix them (sponsored by MapR)

Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of deep learning projects and solutions while walking you through a customer use case.

Talk Title Data operations problems created by deep learning and how to fix them (sponsored by MapR)
Speakers Jim Scott (NVIDIA)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 11-13, 2018
URL Talk Page
Slides Talk Slides
Video

The exponential compute growth that has occurred in deep learning has opened the doors to creating and testing hundreds or thousands more models than were possible in the past. These models use and generate data for both batch and real time as well as for training and scoring use cases. As data becomes enriched and model parameters are explored, there is a real need for versioning everything, including the data. Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of projects in this space and solutions while walking you through a customer use case. The customer started with two input files for their core research area. This quickly grew to over 75 input files with nine different data formats, including CSV, HDF5, and PKL, among others. There were a variety of problems with certain data formats as well as a data versioning problem due to iterations of the models and data. The total number of models and parameters sets grew rapidly and when combined with the data versioning issues frustrations escalated. As model creation and management advanced, limitations and issues arose around notebook applications like Jupyter as well as workflow management to keep track of an execution pipeline. The total volume of log outputs grew quickly, and significant volumes of data movement were occurring—source data moving to the GPU, log data back to storage, and then the log data to machines to handle the distributed compute to perform postmodel analytics to evaluate the performance characteristics of the models. The problems expanded when preparing for production deployment of models and adapting them for real time and not just training and testing. Orchestration of the systems was a big problem in the early stages, and it was discovered that more thought was required to accommodate for further model development. Jim concludes with a follow-on with later model deployment and scoring with a canary and decoy model leveraging the rendezvous architecture. This session is sponsored by MapR.

comments powered by Disqus