January 1, 2020

219 words 2 mins read

From Kafka to BigQuery: A guide for delivering billions of daily events

From Kafka to BigQuery: A guide for delivering billions of daily events

What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline.


Talk Title	From Kafka to BigQuery: A guide for delivering billions of daily events
Speakers	Ofir Sharony (MyHeritage)
Conference	Strata + Hadoop World
Conf Tag	Make Data Work
Location	Singapore
Date	December 6-8, 2016
URL	Talk Page
Slides	Talk Slides
Video

MyHeritage collects billions of events every day, including request logs from web servers and backend services, events describing user activities across different platforms, and change data capture logs recording every change made to its database records. Delivering these events to analytics is a complex task, requiring a robust and scalable data pipeline. Ofir Sharony shares MyHeritage’s journey to find a reliable and efficient way to achieve real-time analytics and offers an overview of the system the company decided on: shipping events to Apache Kafka and loading them to analysis in Google BigQuery. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline. Topics include: For more information, take a look at Ofir’s recent blog post on the subject.

kafka google apache guide analytics database scalable pipeline

comments powered by Disqus

Making sense of exactly-once semantics

Making sense of exactly-once semantics

November 18, 2019

Exactly-once semantics is a highly desirable property for streaming analytics. Ideally, all applications process events once and never twice, but making such guarantees in general either induces significant overhead or introduces other inconveniences, such as stalling. Flavio Junqueira explores what's possible and reasonable for streaming analytics to achieve when targeting exactly-once semantics.

Building DistributedLog, a high-performance replicated log service

Building DistributedLog, a high-performance replicated log service

October 27, 2019

DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter.

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

October 22, 2019

Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

R you ready for the cloud? Using R for operationalizing an enterprise-grade data science solution on Azure

December 30, 2019

R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analyticsall within R.

Elastic data services on Mesos via Mesospheres DC/OS

Elastic data services on Mesos via Mesospheres DC/OS

December 13, 2019

Adam Bordelon and Mohit Soni demonstrate how projects like Apache Myriad (incubating) can install Hadoop on Mesosphere DC/OS alongside other data center-scale applications, enabling efficient resource sharing and isolation across a variety of distributed applications while sharing the same cluster resources and hence breaking silos.

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing

Hadoop application architectures: Architecting a next-generation data platform for real-time ETL, data analytics, and data warehousing

December 12, 2019

Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.