November 9, 2019

274 words 2 mins read

Architecting a next-generation data platform

Architecting a next-generation data platform

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.


Talk Title	Architecting a next-generation data platform
Speakers	Jonathan Seidman (Cloudera), Ted Malaska (Capital One), Mark Grover (Lyft), Gwen Shapira (Confluent)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

Apache Hadoop is rapidly moving from its batch processing roots to a more flexible platform supporting both batch and streaming workloads. Rapid advancements in the Hadoop ecosystem are causing a dramatic evolution in both the storage and processing capabilities of the Hadoop platform. These advancements include projects like: While these advancements to the Hadoop platform are exciting, they also add a new array of tools that architects and developers need to understand when architecting solutions with Hadoop. Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Along the way, they discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time big data architectures. Topics include:

kafka streaming apache sql spark ecosystem hadoop open source analytics big data

comments powered by Disqus

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Real-time analytics using Kudu at petabyte scale

Real-time analytics using Kudu at petabyte scale

November 3, 2019

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

November 9, 2019

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Driving enterprise open source adoption, from data lake to AI (sponsored by Teradata)

Driving enterprise open source adoption, from data lake to AI (sponsored by Teradata)

November 6, 2019

It is no surprise that reducing operational IT expenditures and increasing software capabilities is a top priority for large enterprises. Given its advantages, open source software has proliferated across the globe. Ron Bodkin explains how Teradata drives open source adoption inside enterprises using open source data management and AI techniques leveraged across the analytical ecosystem.

Big data for operational insights

Big data for operational insights

November 9, 2019

GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

November 2, 2019

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.