You call it data lake; we call it Data Historian.


Talk Title	You call it data lake; we call it Data Historian.
Speakers	Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 22-24, 2018
URL	Talk Page
Slides	Talk Slides
Video

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto’s Data Historian platform, a cloud-based data platform built entirely from open source components that provides the user with the ability to efficient ingest, process, store, and access datasets without compromising ease of use, governance, or security. The platform was conceived to provide Monsanto with a simple tool to store files that reside on local computer drives and file shares into a central repository. Besides a user-friendly file ingestion interface, the original tool also gathered metadata both through user input and automatic parsing of files, and the uploaded content was immediately made available via an API. From those humble beginnings, Data Historian has turned into a full-blown well-managed data lake and is continuously being enhanced with new features. Data Historian provides batch, streaming, and API-based ingestion in addition to simple file ingestion. As data is ingested, metadata is collected at the time of ingestion, making datasets immediately searchable in other tools such as Monsanto’s enterprise metadata management system as well as in the enterprise data catalog. The data from Data Historian can be accessed via an API or SQL queries. Security on datasets is controlled through an existing entitlement work flow based on virtual directory services. Even though the system is relatively young, it is already being used by several predictive models that query data out of Data Historian using an access API. In addition, descriptive analytics have been enabled via ODBC/JDBC connectivity, allowing traditional BI tools to interact with the datasets directly, thus increasing the utility of the platform. Like other data lake platforms, Data Historian has numerous other features, such as scheduling and monitoring data loads, archiving data to low-cost storage, automated data deletion based on company data retention policies, and capturing and reporting platform adoption rate metrics, to name a few. The platform has been built using open source software, including Hadoop and AWS EMR as a processing engine, Sqoop for batch data loads, Ozzie for scheduling, Hive and Presto for query processing, Lambda for event triggering, and S3, Glacier, RDS, and DynamoDB for data storage. The platform is also fully integrated with AKAN and VDS (virtual directory service) and utilizes the OAuth 2.0 security model. Naghman and Brian explain how Monsanto built this platform, focusing on the technical design and various phases of the system build. They also cover the technical architecture and share insights into why the team chose certain open source components to instantiate the platform and lessons learned along the way. Along the way, Naghman and Brian explain how the system is being used to provide analytics on top of datasets loaded into the system.

You call it data lake; we call it Data Historian.

Executive Briefing: BI on big data

How to protect big data in a containerized environment

Smart agriculture: Blending IoT sensor data with visual analytics

Data science in the cloud

Get a farm-to-table view of your data: Track data lineage from source to analytics (sponsored by Syncsort)

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments