Automated data exploration: Building efficient analysis pipelines with dask
Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project.
|Talk Title||Automated data exploration: Building efficient analysis pipelines with dask|
|Speakers||Victor Zabalza (ASI Data Science)|
|Conference||Strata Data Conference|
|Conf Tag||Making Data Work|
|Location||London, United Kingdom|
|Date||May 23-25, 2017|
The first step in any data science project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, carrying it out usually means repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work, enabling efficient data exploration and kickstarting data science projects. The tool generates a summary for each dataset that includes general information about the dataset, including data quality of each of the columns; the distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; a 2D distribution between pairs of columns; and a correlation coefficient matrix for all numerical columns. Victor explains how building this tool has provided a unique view into the full Python data stack, from the parallelized analysis of a data frame within a dask custom execution graph to interactive visualization with Jupyter widgets and Plotly, and why it will become essential in the first steps of every data science project, cutting down the time data scientists spend making one-use exploratory graphs and getting them more quickly to deriving insights from the data.