December 5, 2019

340 words 2 mins read

Automated data exploration: Building efficient analysis pipelines with dask

Automated data exploration: Building efficient analysis pipelines with dask

Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project.


Talk Title	Automated data exploration: Building efficient analysis pipelines with dask
Speakers	Victor Zabalza (ASI Data Science)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

The first step in any data science project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, carrying it out usually means repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work, enabling efficient data exploration and kickstarting data science projects. The tool generates a summary for each dataset that includes general information about the dataset, including data quality of each of the columns; the distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; a 2D distribution between pairs of columns; and a correlation coefficient matrix for all numerical columns. Victor explains how building this tool has provided a unique view into the full Python data stack, from the parallelized analysis of a data frame within a dask custom execution graph to interactive visualization with Jupyter widgets and Plotly, and why it will become essential in the first steps of every data science project, cutting down the time data scientists spend making one-use exploratory graphs and getting them more quickly to deriving insights from the data.

automated dataset data science dask jupyter visualization python pipeline

comments powered by Disqus

Developer on the rise: Blurring the line between developer and data scientist with PixieDust

Developer on the rise: Blurring the line between developer and data scientist with PixieDust

November 26, 2019

Ready to dip your toe into data science? Va Barbosa explains why you should start with notebooks and PixieDust, a new open source library that helps data scientists and developers working in the Jupyter Notebook and Apache Spark be more efficient.

A contextual real-time bidding engine for search engine marketing

A contextual real-time bidding engine for search engine marketing

November 10, 2019

Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and explores Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot.

Making architecture choices for small and big data problems

Making architecture choices for small and big data problems

November 4, 2019

Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools.

Shifting left for continuous quality in an Agile data world

Shifting left for continuous quality in an Agile data world

November 2, 2019

Data warehouses are critical in driving business decisionswith SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality.

The enterprise geospatial platform: A perfect fusion of cloud and open source technologies

The enterprise geospatial platform: A perfect fusion of cloud and open source technologies

November 1, 2019

Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed and Martin Mendez-Costabel explain how Monsanto built a scalable geospatial platform using cloud and open source technologies.

From data dinosaurs to data stars in five weeks: Lessons from completing 80 data science projects

From data dinosaurs to data stars in five weeks: Lessons from completing 80 data science projects

December 3, 2019

More organizations are becoming aware of the value of data and want to get started and scaled up as quickly as possible. But how? Is it possible to get something useful done in five weeks? Kim Nilsson shares her experiences, both good and bad, delivering over 80 five-week data science projects to over 50 organizations, as well as some concrete tips on how to become a data star organization.