Cloud data lakes: Analytic data warehouses in the cloud
John Hitchingham shares insights into the design and operation of FINRA's data lake in the AWS cloud, where FINRA extracts, transforms, and loads over 75B transactions per day. Users can query across petabytes of data in seconds on AWS S3 using Presto and Sparkall while maintaining security and data lineage.
Talk Title | Cloud data lakes: Analytic data warehouses in the cloud |
Speakers | John Hitchingham (FINRA) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 26-28, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
The Financial Industry Regulatory Authority (FINRA) is a private sector regulator responsible for analyzing over 90% of the equities and 65% of the option activity in the US to look for fraud, market manipulation, insider trading, and abuse. John Hitchingham shares insights into the design and operation of FINRA’s data lake in the AWS cloud, which provides storage, query, and catalog capability using S3, EMR, and a FINRA-developed data catalog and management system. Users can query across petabytes of data in seconds on AWS S3 using Presto and Spark—all while maintaining security and data lineage. FINRA implemented the cloud data warehouse to consolidate a series of data silos as part of a two-and-a-half-year all-in migration of FINRA’s Market Regulation systems to the cloud. It provides increased operational resiliency in response to market events such as Brexit while giving analysts and data scientists within FINRA increased insight into data. Leveraging S3 for storage provides a resilient, scalable, cost-effective storage layer for data in the cloud data warehouse. Data is stored in text format for archival queries and ORC format for performant queries. The herd data catalog provides a platform-independent way to track data. It supports data versioning, storage of business and technical metadata, and schema information that can be used to query registered data. AWS EMR provides a scalable and secure compute query platform for running ETL, batch analytics and interactive analytics against data stored on S3. Keeping data on S3 provides increased durability, along with the ability to rapidly scale compute up and down to match demand.