How a Spark-based feature store can accelerate big data adoption in financial services
Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, a library of reusable features that allows data scientists to solve business problems across the enterprise. Kaushik and Phil outline three challenges they facedsemantic data integration within a data lake, high-performance feature engineering, and metadata governanceand explain how they overcame them.
|Talk Title||How a Spark-based feature store can accelerate big data adoption in financial services|
|Conference||Strata + Hadoop World|
|Conf Tag||Make Data Work|
|Location||New York, New York|
|Date||September 27-29, 2016|
One of the ways to drive enterprise adoption of big data in financial services is to have a central standardized, reusable, transparent, and well-governed library of features (or metrics) that will empower data scientists and business analysts across a range of business problems. This is the central idea behind a feature store—a library of documented features for various analyses based on a shared data model that spans a wide variety of data sources resident within a bank’s data lake. Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them. The first challenge of building such a feature store is to project the data in a data lake into a common conceptual data model and then generate features from that model. The combination of data variety, formal analytical models, and long project cycles in financial services suggests that the application of data modeling to data lakes should yield significant advantages both in terms of a shared understanding of the domain-specific semantic ontology and also as an extensible data integration framework. In the discussed use case, the feature store was powered by one such semantically integrated data model for retail banking. The second challenge is to enable high-performance feature engineering at a customer level on top of the conceptual data model. There’s significant benefit to partitioning data at the customer level so that calculations don’t incur cross-node chatter on the network. Kaushik and Phil had to provide an API to access the data model for data scientists to create parameterized features. To accomplish these objectives, they developed an ETL pipeline in Spark that stored the instance data in Hadoop as a distributed collection of partitioned structured objects per customer. They then provided a parallelizable Spark API to access these structured customer objects. The third challenge is enforcing business metadata governance on the feature store. The agility of analytics and data democratization that a high-performing feature store can unleash has to be countered with sound metadata governance to prevent complete analytical anarchy. Regulatory pressures make this a necessity. In particular, data lineage, audits, and version control of source code have to be baked into the feature development workflows within the feature store.