January 21, 2020

249 words 2 mins read

Lessons learned building a scalable and extendable data pipeline for Call of Duty

Lessons learned building a scalable and extendable data pipeline for Call of Duty

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision.


Talk Title	Lessons learned building a scalable and extendable data pipeline for Call of Duty
Speakers	Yaroslav Tkachenko (Activision)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

What’s easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data (probably over HTTP), design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things, doesn’t it? And you probably want to make it highly scalable and available too. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision. Yaroslav shares lessons learned about scale pipelines, not only in terms of messages per second but also in terms of supporting more games and more use cases, as well as message schemas, Apache Kafka organization and tuning, topics naming conventions, structure and routing, reliable and scalable producers and the ingestion layer, and stream processing.

kafka routing apache data warehouse use case scalable pipeline cluster data stream

comments powered by Disqus

Building stream processing as a service at Netflix

Building stream processing as a service at Netflix

November 17, 2019

Steven Wu explains how Netflixs SPaaS platform empowers users to focus on extracting insights from data streams and build stream processing applications and shares lessons learned building and operating the largest SPaaS use case: Netflixs Keystone data pipeline, a self-serve platform for creating near-real-time event pipelines that processes three trillion events and 12 PB of data every day.

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

January 20, 2020

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

January 19, 2020

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.

Audi's journey to an enterprise big data platform

Audi's journey to an enterprise big data platform

December 12, 2019

Carsten Herbe and Matthias Graunitz detail Audi's journey from a Hadoop proof of concept to a multitenant enterprise platform, sharing lessons learned, the decisions Audi made, and how a number of use cases are implemented using the platform.

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

November 25, 2019

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

November 23, 2019

Lyfts business has grown over 100x in the past four years. Shenghu Yang explains how Lyfts data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.