January 7, 2020

222 words 2 mins read

Genji: A framework for building resilient near-real-time data pipelines

Genji: A framework for building resilient near-real-time data pipelines

Pinterest has to support real-time decision making while operating on petabyte-scale data. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest's real-time data pipeline (modeled on quasi-Kappa architecture), its impact on the company's systems, and tools and processes used and demonstrate how Pinterest models real-time ads analytics on the platform.

Talk Title Genji: A framework for building resilient near-real-time data pipelines
Speakers Swaminathan Sundaramurthy (Salesforce Inc), Mark Cho (Pinterest)
Conference O’Reilly Velocity Conference
Conf Tag Build resilient systems at scale
Location New York, New York
Date October 2-4, 2017
URL Talk Page
Slides Talk Slides
Video

Pinterest operates on data at petabyte scale. Previously, the company’s fact tables were generated daily using Hadoop, resulting in data that was frequently 24–48 hours old. In order to support real-time decision making, stats, and analytics, Pinterest modeled its warehouse on quasi-Kappa architecture, treating batch processing as a special case of stream processing and warehousing data with sub-15-minute lag. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest’s real-time data pipeline, discussing the company’s decision to warehouse data at near-real-time to enable downstream systems to operate on much fresher data, the platform’s architecture, and its impact on Pinterest’s systems, tools, and processes. They conclude by demonstrating how Pinterest models real-time ads analytics use cases on the platform and sharing lessons learned along the way.

comments powered by Disqus