January 10, 2020

339 words 2 mins read

From BI to big data; Or, There and back again

From BI to big data; Or, There and back again

Francesco Mucio shares the basic tools he and his team had to learn (or relearn) moving from the coziness of their database to the big world of Spark, cloud, distributed systems, and continuous applications. It was an unexpected journey that ended exactly where it started: with an SQL query.

Talk Title From BI to big data; Or, There and back again
Speakers Francesco Mucio (Francescomuc.io)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date April 30-May 2, 2019
URL Talk Page
Slides Talk Slides
Video

There is always a point where a growing company has to accept that its infrastructure has to change to not hinder further growth. This was also the case of the BI infrastructure at Zalando. To be future proof we decided to embrace the cloud, the data lake, the big data. Not so fast. Moving the Business Intelligence team from the coziness of RDBMS, ACID transactions, and years of experiences required a lot of effort, even the most motivated BI engineers can be lost once presented with the needed skills, tools and patterns. First we had to learn new words to talk with our Big Data colleagues (or even google things), then we had to learn a new language to explain ourselves, finally we started building our data pipelines (or are just ETL processes?). In this presentation Francesco and Alberto will show how to:

  • Identify Bronze, Silver and Gold data and what these labels mean for a BI practitioner. - Convert an SQL query to Spark syntax. - Process streaming data with Structured Streaming and SparkSQL. - Generate surrogate keys in a distributed world. - And more. These are problems we had to tackle early to give our engineers the confidence to step into Spark and the cloud. In very little time they naturally started using Scala beside SparkSQL. Looking back at our journey, we noticed how much time we could have saved if we had recommendations or best practices for these problems. We are trying to share them here.
comments powered by Disqus