January 26, 2020

261 words 2 mins read

Architecting a next-generation data platform

Architecting a next-generation data platform

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.


Talk Title	Architecting a next-generation data platform
Speakers	Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Rapid advancements are causing a dramatic evolution in both the storage and processing capabilities in the open source enterprise data software ecosystem. These advancements include projects like: These storage and processing systems provide a powerful platform to implement data processing applications on batch and streaming data. While these advancements are exciting, they also add a new array of tools that architects and developers need to understand when architecting modern data processing solutions. Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging these components to reliably integrate multiple data sources, perform real-time and batch data processing, reliably store massive volumes of data, and efficiently query and process large datasets. Along the way, they discuss considerations and best practices for utilizing these components to implement solutions, cover common challenges and how to address them, and provide practical advice for building your own modern, real-time data architectures. Topics include:

streaming internet of things dataset ecosystem open source big data internet

comments powered by Disqus

Architecting a next-generation data platform

Architecting a next-generation data platform

December 12, 2019

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

From flat files to deconstructed database: The evolution and future of the big data ecosystem

From flat files to deconstructed database: The evolution and future of the big data ecosystem

January 22, 2020

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

StreamDM: Advanced data science with Spark Streaming

StreamDM: Advanced data science with Spark Streaming

December 5, 2019

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noahs Ark Lab and Tlcom ParisTech.

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

January 26, 2020

What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.

Edgility - Modelling and Orchestration in Edge Cloud Environments

Edgility - Modelling and Orchestration in Edge Cloud Environments

January 22, 2020

Edge computing is becoming the major component in 5G networks. While edge computing provides an u-low-latency solution, it brings with it other challenges. On one hand, edge clouds are small, and thei …

Hudi: Unifying storage and serving for batch and near-real-time analytics

Hudi: Unifying storage and serving for batch and near-real-time analytics

January 22, 2020

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.