October 27, 2019

272 words 2 mins read

Building DistributedLog, a high-performance replicated log service

Building DistributedLog, a high-performance replicated log service

DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter.


Talk Title	Building DistributedLog, a high-performance replicated log service
Speakers	Sijie Guo (StreamNative)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Systems like databases and messaging systems require durability. One common way to implement durability while keeping performance high is to use a log to persist updates to system state. The log is used to reconstruct the system state in the event of a crash. Moreover, logs are very powerful data structures for addressing challenging distributed-systems problems. DistributedLog is a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s publish-subscribe system and has been used widely elsewhere at Twitter in applications from the transactional database system to the search ingestion pipeline and the real-time data analytics platform. Sijie Guo offers an overview of DistributedLog, detailing why Twitter built DistributedLog, the technical decisions and challenges behind building DistributedLog, and how Twitter uses it to support different workloads with different characteristics from a strongly consistent distributed database to a real-time data analytics pipeline. Sijie also discusses how Twitter runs the same software stack in multiple data centers to achieve global consistency.

apache twitter messaging data analytics analytics database data center performance book pipeline

comments powered by Disqus

What's next for BDAS (the Berkeley Data Analytics Stack)?

What's next for BDAS (the Berkeley Data Analytics Stack)?

October 18, 2019

Michael Franklin offers an overview of the Berkeley Data Analytics Stack, outlines the current directions it's taking, and settles once and for all how BDAS should be pronounced.

High-performance clickstream analytics with Apache Phoenix and HBase

High-performance clickstream analytics with Apache Phoenix and HBase

October 25, 2019

Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

October 22, 2019

Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.

Putting Kafka into overdrive

Putting Kafka into overdrive

October 22, 2019

Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira explore how to monitor, optimize, and troubleshoot performance of your data pipelinesfrom producer to consumer, development to production.

Real-time Hadoop: What an ideal messaging system should bring to Hadoop

Real-time Hadoop: What an ideal messaging system should bring to Hadoop

October 21, 2019

Application messaging isnt newsolutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer.

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

October 27, 2019

Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.