Genji: A framework for building resilient near-real-time data pipelines
Pinterest has to support real-time decision making while operating on petabyte-scale data. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest's real-time data pipeline (modeled on quasi-Kappa architecture), its impact on the company's systems, and tools and processes used and demonstrate how Pinterest models real-time ads analytics on the platform.
Talk Title | Genji: A framework for building resilient near-real-time data pipelines |
Speakers | Swaminathan Sundaramurthy (Salesforce Inc), Mark Cho (Pinterest) |
Conference | O’Reilly Velocity Conference |
Conf Tag | Build resilient systems at scale |
Location | New York, New York |
Date | October 2-4, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Pinterest operates on data at petabyte scale. Previously, the company’s fact tables were generated daily using Hadoop, resulting in data that was frequently 24–48 hours old. In order to support real-time decision making, stats, and analytics, Pinterest modeled its warehouse on quasi-Kappa architecture, treating batch processing as a special case of stream processing and warehousing data with sub-15-minute lag. Swaminathan Sundaramurthy and Mark Cho offer an overview of Pinterest’s real-time data pipeline, discussing the company’s decision to warehouse data at near-real-time to enable downstream systems to operate on much fresher data, the platform’s architecture, and its impact on Pinterest’s systems, tools, and processes. They conclude by demonstrating how Pinterest models real-time ads analytics use cases on the platform and sharing lessons learned along the way.