The dangers of statistical significance when studying weak effects in big data: From natural experiments to p-hacking
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking.
Talk Title | The dangers of statistical significance when studying weak effects in big data: From natural experiments to p-hacking |
Speakers | Robert Grossman (University of Chicago) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices by exploring three case studies to make it a bit less likely that you will be accused of p-hacking. The first case study concerns mutations in breast cancer and some of the complexities of understanding rare mutations and combinations of rare mutations. In the second case study, Robert dives into different methods for understanding whether there is an effect on the health of newborns when pregnant women are exposed to particulate matter (solid and liquid particles suspended in air). The third case study looks at a well-known published paper offering evidence for ESP. Robert extracts several techniques from these three case studies that have consistently proved useful and discusses how best these techniques can be used in practice. Topics include: