Effective sampling methods within TensorFlow input functions
Many real-world machine learning applications require generative or reductive sampling of data. Laxmi Prajapat and William Fletcher demonstrate sampling techniques applied to training and testing data directly inside the input function using the tf.data API.
|Talk Title||Effective sampling methods within TensorFlow input functions|
|Speakers||Laxmi Prajapat (Datatonic), William Fletcher (Datatonic)|
|Conference||O’Reilly TensorFlow World|
|Location||Santa Clara, California|
|Date||October 28-31, 2019|
Many real-world machine learning applications require generative or reductive sampling of data. At training time this may be to deal with class imbalance (e.g., rarity of positives in a binary classification problem or a sparse user-item interaction matrix) or to augment the data stored on file; it may also simply be a matter of efficiency. Laxmi Prajapat and William Fletcher explore some sampling techniques in the context of recommender systems, using tools available in the tf.data API, and detail which methods are beneficial with given data and hardware demands. They present quantitative results, along with a closer examination of potential pros and cons. Naively, a precomputed subsample of data will make for a fast input function. But to take advantage of random samples, more must be done. Laxmi and William consider how to select from a large dataset containing all possible inputs, and they look at generating these in memory using tf.random and exploiting hash tables where appropriate. These methods grant additional flexibility and reduce data preparation workloads.