October 24, 2019

237 words 2 mins read

How to build a successful data lake

How to build a successful data lake

It is fashionable today to declare doom and gloom for the data lake. Alex Gorelik discusses best practices for Hadoop data lake success and provides real-world examples of successful data lake implementations in a non-vendor-specific talk.


Talk Title	How to build a successful data lake
Speakers	Alex Gorelik (Waterline Data)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Big data and data science promise to bring unprecedented levels of insight and efficiency to everything from working with data to working with customers to curing cancer. To successfully deliver on this promise, traditional enterprises are building data lakes, which bridge the gap between enterprise data warehouses, where data is a precious commodity carefully tended to by professional IT personnel, and the freewheeling culture of modern Internet companies. An enterprise data lake must provide three new capabilities: cost-effective scalable storage and computing; cost-effective data access and governance; and tiered, governed access, based on user needs, skill levels, and applicable data-governance policies. Drawing on a 30-year career developing leading-edge data technology and working with some of the world’s largest enterprises on their thorniest data problems, Alex Gorelik, author of the forthcoming O’Reilly book The Enterprise Data Lake, discusses the considerations of and best practices for building data lakes, with examples taken from from the world’s leading big data companies and enterprises. Topics include:

book data science data warehouse big data governance data lake scalable internet

comments powered by Disqus

Its all about me: From big data models to personalized experience

Its all about me: From big data models to personalized experience

October 15, 2019

Even though each of us is only 1 of 7 billion, we all want to feel special. Product personalization is transforming the Internet experience to be all about me (and you too!). Yao Morin explores how to create customer-centric experiences through data science and software engineering, using Intuit TurboTax as a case study.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

October 23, 2019

Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Self-service, interactive analytics at multipetabyte scale in capital markets regulation on the cloud

Self-service, interactive analytics at multipetabyte scale in capital markets regulation on the cloud

October 20, 2019

Scott Donaldson and Matt Cardillo detail the security measures and system architecture needed to bring alive a multipetabyte data warehouse via interactive analytics and directed graphs from several trillions of market events, using HBase, EMR, Hive, Redshift, and S3 technologies in a cost-efficient manner.