December 24, 2019

641 words 4 mins read

Faster ML over joins of tables

Faster ML over joins of tables

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python.

Talk Title Faster ML over joins of tables
Speakers Arun Kumar (University of California, San Diego)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Francisco, California
Date March 26-28, 2019
URL Talk Page
Slides Talk Slides
Video

Most relational/tabular datasets in real-world data-driven applications are multitable, connected by key-foreign key (KFK) relationships. Yet almost all ML training tools are designed for single-table data. This disconnect forces ML users to join all base tables to materialize a single table before ML. For example, in a recommender system, you have at least three tables: ratings, users, and products. Building, say, a content-based classifier requires joining all tables to materialize a single table. Alas, such join materialization can blow up the data in size, wasting memory and storage, while the data redundancy caused by joins increases the runtime of ML training, often by even an order of magnitude. In turn, these slowdowns can hurt ML user productivity. Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. First, inspired by database query optimization, Arun shows how to “avoid joins physically” (i.e., not materialize the KFK joins but instead push ML computations down through joins to the base tables). This technique, factorized ML, can dramatically reduce memory usage and runtimes for several ML methods such as popular generalized linear models, k-means clustering, and matrix factorization. Crucially, the ML model obtained, including its accuracy, are unaffected. In the recommender systems example, this means ML executes directly on the three base tables. Arun explains how this general technique can be realized in various system environments, including in-database ML, ML on Spark, and in-memory R and Python. He also generalizes this technique to arbitrary ML methods written with bulk matrix algebra. Arun presents software prototypes in both R and Python, including sample code of a few factorized ML methods, to show how ML users can reap these benefits. This technique was adopted or explored for internal use cases by LogicBlox, Microsoft, and Google, while Oracle explored it for a banking customer’s use case. Avito of Russia is exploring the tool in Python for production ecommerce use cases. Second, Arun connects learning theory with KFK joins to show that in some cases, you can also “avoid joins logically.” By this, he means a rather radical capability: some of the foreign tables being joined can be ignored outright without significantly reducing ML classifier accuracy. In the recommender systems example, this means you could sometimes ignore the products table for training, for instance. Arun explains how this is even possible using the theory of the bias-variance trade-off and discusses the pros and cons for accuracy and interpretability, including how to mitigate such issues. He distills this analysis into an easy-to-understand decision rule based on the numbers of tuples in the joining tables to help ML users quickly decide based on their error tolerance if a foreign table can be avoided—without even looking into the table’s data. This technique is even more widely applicable, since it’s agnostic to both the ML classifier (linear models, trees, neural networks, etc.) and the system environment. This technique has seen adoption in practice by numerous companies, including LogicBlox, Facebook, and MakeMyTrip.

comments powered by Disqus