January 7, 2020

222 words 2 mins read

Genji: A framework for building resilient near-real-time data pipelines

Genji: A framework for building resilient near-real-time data pipelines

Pinterest has to support real-time decision making while operating on petabyte-scale data. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest's real-time data pipeline (modeled on quasi-Kappa architecture), its impact on the company's systems, and tools and processes used and demonstrate how Pinterest models real-time ads analytics on the platform.


Talk Title	Genji: A framework for building resilient near-real-time data pipelines
Speakers	Swaminathan Sundaramurthy (Salesforce Inc), Mark Cho (Pinterest)
Conference	O’Reilly Velocity Conference
Conf Tag	Build resilient systems at scale
Location	New York, New York
Date	October 2-4, 2017
URL	Talk Page
Slides	Talk Slides
Video

Pinterest operates on data at petabyte scale. Previously, the company’s fact tables were generated daily using Hadoop, resulting in data that was frequently 24–48 hours old. In order to support real-time decision making, stats, and analytics, Pinterest modeled its warehouse on quasi-Kappa architecture, treating batch processing as a special case of stream processing and warehousing data with sub-15-minute lag. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest’s real-time data pipeline, discussing the company’s decision to warehouse data at near-real-time to enable downstream systems to operate on much fresher data, the platform’s architecture, and its impact on Pinterest’s systems, tools, and processes. They conclude by demonstrating how Pinterest models real-time ads analytics use cases on the platform and sharing lessons learned along the way.

warehousing framework hadoop analytics use case pipeline

comments powered by Disqus

Building big data applications on Azure

Building big data applications on Azure

January 4, 2020

As big data solutions are rapidly moving to the cloud, it's becoming increasingly important to know how to use Apache Hadoop, Spark, R Server, and other open source technologies in the cloud. Pranav Rastogi walks you through building big data applications on Azure HDInsight and other Azure services.

Data wrangling for insurance

Data wrangling for insurance

December 4, 2019

Drawing on use cases from Trifacta customers, the speaker explains how to leverage data wrangling solutions in the insurance industry to streamline, strengthen, and improve data analytics initiatives on Hadoop.

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

SMACK Stack and Beyond - Building Fast Data Pipelines

SMACK Stack and Beyond - Building Fast Data Pipelines

January 6, 2020

Our world seems to move faster and faster and so are our requirements for data analytics. For many use cases such as fraud detection or reacting on sensor data the response times of traditional batch …

An open source architecture for the IoT

An open source architecture for the IoT

January 5, 2020

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.