Shifting left for continuous quality in an Agile data world
Data warehouses are critical in driving business decisionswith SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality.
|Talk Title||Shifting left for continuous quality in an Agile data world|
|Speakers||Avinash Padmanabhan (Intuit)|
|Conference||Strata + Hadoop World|
|Conf Tag||Big Data Expo|
|Location||San Jose, California|
|Date||March 14-16, 2017|
There has been an exponential rise in the adoption of data pipelines based on Hadoop and massively parallel processing (MPP) databases like Vertica and Redshift. The journey of automated testing in these data pipelines and other big data projects has been rough. To a large extent, the business logic is implemented in SQL scripts, and performing quality checks on these SQL scripts has been a manual process so far. Unit testing is nonexistent, and other excellence metrics like code coverage for SQL scripts are not clearly defined. The fact that most data engineers and analysts are usually more comfortable with SQL than other languages like Java or Python that have established testing standards is another challenge in moving toward automated testing. If you are building a data pipeline, you should be baking in these engineering best practices to ensure that it has an optimum business impact. Avinash Padmanabhan describes how his team at Intuit is driving change in the way it builds and tests extract-transform-load (ETL) jobs. Avinash presents an automation solution that both data and quality engineers can use to build quality into the data pipeline, explaining how to use Docker to virtualize end-to-end data infrastructure pipelines inside local development environments in a way that requires low overhead and enables faster feedback, which allows problems to be fixed early in the development process versus late in the QA stage or, worse, in the production environment.