October 25, 2019

235 words 2 mins read

Filling the data lake

Filling the data lake

A major challenge in todays world of big data is getting data into the data lake in a simple, automated way. Coding scripts for disparate sources is time consuming and difficult to manage. Developers need a process that supports disparate sources by detecting and passing metadata automatically. Chuck Yarbrough and Mark Burnette explain how to simplify and automate your data ingestion process.


Talk Title	Filling the data lake
Speakers	Chuck Yarbrough (Pentaho), Mark Burnette (Pentaho, a Hitachi Group Company)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

A major challenge in today’s world of big data is getting data into the data lake in a simple, automated way. Many organizations use Python or another language to code their way through these processes. The problem is that with disparate sources of data numbering in the thousands, coding scripts for each source is time consuming and extremely difficult to manage and maintain. Developers need the ability to create one process that can support many disparate data sources by detecting metadata and passing metadata through what Pentaho calls “metadata injection.” With this capability, developers can parameterize ingestion processes and automate every step of the data pipeline. Chuck Yarbrough and Mark Burnette outline model-driven ingestion and explain how to simplify and automate your data ingestion processes. This session is sponsored by Pentaho.

automated code python model-driven big data data lake pipeline isp

comments powered by Disqus

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

Developing a big data business strategy

Developing a big data business strategy

October 25, 2019

Organizations do not need a big data strategy. They need a business strategy that incorporates big data. Most organizations lack a roadmap for using big data to uncover new business opportunities. Bill Schmarzo explains how to explore, justify, and plan big data projects with business management.

Docker for data scientists

Docker for data scientists

October 25, 2019

Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds familiar, you may have missed the increasing popularity of Linux containers in the DevOps world, in particular Docker. Michelangelo D'Agostino demonstrates why Docker deserves a place in every data scientists toolkit.

How to build a successful data lake

How to build a successful data lake

October 24, 2019

It is fashionable today to declare doom and gloom for the data lake. Alex Gorelik discusses best practices for Hadoop data lake success and provides real-world examples of successful data lake implementations in a non-vendor-specific talk.

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks

October 20, 2019

Celtra provides a platform for customers like Porsche and Fox to create, track, and analyze digital display advertising. Celtra's platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Grega Kepret outlines Celtras data-pipeline challenges and explains how it solved them by combining Snowflake's cloud data warehouse with Spark.

Self-service, interactive analytics at multipetabyte scale in capital markets regulation on the cloud

Self-service, interactive analytics at multipetabyte scale in capital markets regulation on the cloud

October 20, 2019

Scott Donaldson and Matt Cardillo detail the security measures and system architecture needed to bring alive a multipetabyte data warehouse via interactive analytics and directed graphs from several trillions of market events, using HBase, EMR, Hive, Redshift, and S3 technologies in a cost-efficient manner.