January 8, 2020

359 words 2 mins read

Mastering data with Spark and machine learning

Mastering data with Spark and machine learning

Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Sonal Goyal offers an overview of a modern master data application using Spark, Cassandra, ML, and Elastic.

Talk Title Mastering data with Spark and machine learning
Speakers Sonal Goyal (Nube)
Conference Strata Data Conference
Conf Tag Making Data Work
Location London, United Kingdom
Date April 30-May 2, 2019
URL Talk Page
Slides Talk Slides
Video

Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Further, each source and data type has its own schema and format, data volumes run into millions of records, and linking similar records is a fuzzy matching and computationally expensive exercise—making this a challenging undertaking. Sonal Goyal offers an overview of the design and architecture of a modern master data application using Spark, Cassandra, ML, and Elastic. The application unifies nontransactional master data in multiple data domains like customer, organization, and product through multiple systems like ERP, CRM, and custom applications of different business units, using the Spark Data Source API and machine learning. Sonal explains how the abstraction offered by the Data Source API allows users to consume and manipulate the different datasets easily. After aligning required attributes, Spark is used to cluster and classify probable matches using a human-in-the-loop feedback system. These matched and clustered records are persisted to Cassandra and exposed to data stewards through an AJAX-based GUI. The Spark job also indexes the records to Elastic, which lets the data steward query and search clusters more effectively. Sonal covers the end-to-end flow, design, and architecture of the different components as well as the configuration per source and type to support the different and unknown datasets and schemas. Along the way, she details the performance gains using Spark, machine learning for data matching, and stewardship as well as the role of Cassandra and Elastic in the application.

comments powered by Disqus