You call it data lake; we call it Data Historian.
There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security.
Talk Title | You call it data lake; we call it Data Historian. |
Speakers | Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer) |
Conference | Strata Data Conference |
Conf Tag | Making Data Work |
Location | London, United Kingdom |
Date | May 22-24, 2018 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto’s Data Historian platform, a cloud-based data platform built entirely from open source components that provides the user with the ability to efficient ingest, process, store, and access datasets without compromising ease of use, governance, or security. The platform was conceived to provide Monsanto with a simple tool to store files that reside on local computer drives and file shares into a central repository. Besides a user-friendly file ingestion interface, the original tool also gathered metadata both through user input and automatic parsing of files, and the uploaded content was immediately made available via an API. From those humble beginnings, Data Historian has turned into a full-blown well-managed data lake and is continuously being enhanced with new features. Data Historian provides batch, streaming, and API-based ingestion in addition to simple file ingestion. As data is ingested, metadata is collected at the time of ingestion, making datasets immediately searchable in other tools such as Monsanto’s enterprise metadata management system as well as in the enterprise data catalog. The data from Data Historian can be accessed via an API or SQL queries. Security on datasets is controlled through an existing entitlement work flow based on virtual directory services. Even though the system is relatively young, it is already being used by several predictive models that query data out of Data Historian using an access API. In addition, descriptive analytics have been enabled via ODBC/JDBC connectivity, allowing traditional BI tools to interact with the datasets directly, thus increasing the utility of the platform. Like other data lake platforms, Data Historian has numerous other features, such as scheduling and monitoring data loads, archiving data to low-cost storage, automated data deletion based on company data retention policies, and capturing and reporting platform adoption rate metrics, to name a few. The platform has been built using open source software, including Hadoop and AWS EMR as a processing engine, Sqoop for batch data loads, Ozzie for scheduling, Hive and Presto for query processing, Lambda for event triggering, and S3, Glacier, RDS, and DynamoDB for data storage. The platform is also fully integrated with AKAN and VDS (virtual directory service) and utilizes the OAuth 2.0 security model. Naghman and Brian explain how Monsanto built this platform, focusing on the technical design and various phases of the system build. They also cover the technical architecture and share insights into why the team chose certain open source components to instantiate the platform and lessons learned along the way. Along the way, Naghman and Brian explain how the system is being used to provide analytics on top of datasets loaded into the system.