November 22, 2019

254 words 2 mins read

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Talk Title Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously
Speakers Henry Cai (Pinterest), Yi Yin (Pinterest)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Jose, California
Date March 6-8, 2018
URL Talk Page
Slides Talk Slides
Video

Pinterest helps people discover, save, and do things that they love. The company has a hundred billion core objects (pins, boards, and users) stored in MySQL at the scale of a hundred terabytes. Most of that data is used to build data-driven products, such as personalized recommendations, A/B experiments, and search indexes. As Pinterest is moving toward real-time computation, the company is faced with much stringent SLA requirements, such as making MySQL data available in S3/Hadoop within 15 minutes and serving DB data incrementally in stream processing. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest’s continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. The system can listen for MySQL BinLog changes, publish the MySQL change logs as an Apache Kafka change stream, and ingest and compact the stream into columnar tables in S3/Hadoop within 15 minutes. Topics include:

comments powered by Disqus