Learning from incomplete, imperfect data with probabilistic programming
Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products.
Talk Title | Learning from incomplete, imperfect data with probabilistic programming |
Speakers | Mike Lee Williams (Cloudera Fast Forward Labs) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products. Michael begins by introducing Bayesian inference and using the approach to solve a famous problem (the German Tank Problem) in three lines of code. (The code we’ll write is so simple you won’t need to be a programmer or a mathematician to understand it.) Michael then offers an overview of two working Fast Forward Labs product prototypes that crucially depend on Bayesian inference—one that supports decisions about consumer loans and one that models the future of the NYC real estate market—to highlight the advantages and use cases of the Bayesian approach, which include domains where data is scarce, where prior institutional knowledge is important, and where quantifying risk is crucial. But as you’ll see, this naive approach to implementing Bayesian inference has serious limitations and is only useful for tiny problems. Michael explores the challenges involved in speeding it up and shares solutions ranging from classics like Metropolis Hastings and MCMC Monte Carl to modern industrial-strength algorithms like NUTS and ADVI. These algorithms are complicated, and implementing them so they give the right answer quickly is difficult. Which brings us to the real subject of this talk: probabilistic programming—a family of languages that define fundamental probabilistic ideas such as random variables and probability distributions as primitive objects, which makes code short, simple, and declarative. And they have expert-written, blazing-fast implementations of the latest and greatest inference algorithms built right in. Michael examines a handful of probabilistic programming languages, taking a particularly close look at Stan and PyMC3—comparing their performance and deployment trade-offs and showing how the German Tank Problem and our consumer loan and NYC real estate problems could be solved using them.