November 18, 2019

401 words 2 mins read

Making sense of exactly-once semantics

Making sense of exactly-once semantics

Exactly-once semantics is a highly desirable property for streaming analytics. Ideally, all applications process events once and never twice, but making such guarantees in general either induces significant overhead or introduces other inconveniences, such as stalling. Flavio Junqueira explores what's possible and reasonable for streaming analytics to achieve when targeting exactly-once semantics.

Talk Title Making sense of exactly-once semantics
Speakers Flavio Junqueira (Dell EMC)
Conference Strata + Hadoop World
Conf Tag Making Data Work
Location London, United Kingdom
Date June 1-3, 2016
URL Talk Page
Slides Talk Slides

Exactly-once delivery is the holy grail of streaming analytics. Having duplicates of events processed in a streaming job is inconvenient and often undesirable depending on the nature of the application. For example, if billing applications miss an event or process an event twice, they could lose revenue or overcharge customers. Guaranteeing that such scenarios never happen is difficult; any project seeking such a property will need to make some choices with respect to availability and consistency. One main difficulty stems from the fact that a streaming pipeline might have multiple stages, and exactly-once delivery needs to happen at each stage. Another difficulty is that intermediate computations could potentially affect the final computation. And once results are exposed, retracting them causes problems. These observations lead to the conclusion that providing a general solution is very difficult, if not impossible. For some classes of applications, however, it is possible to obtain exactly-once behavior. If senders are willing to retry an unbounded number of times and the delivery is idempotent, then we can create a behavior equivalent to exactly-once semantics. One way to achieve the goal of making delivery idempotent is to rely on a distributed consensus primitive to have agreement on what has been delivered. For example, if the output has a key and a value, we can make sure that the key has been produced already. Assuming we rely on persistent queues to propagate events, we can use the queue position to also disambiguate updates using this consensus primitive. Flavio Junqueira explores what’s reasonable for streaming analytics to achieve when targeting exactly-once semantics, using examples based on systems that provide commit-log abstractions (like Apache Kafka and Apache BookKeeper) to demonstrate when it is possible to guarantee strong delivery semantics and how. Flavio shows that for a simple, but important, class of applications, it’s possible to provide strong delivery guarantees along with efficient implementations.

comments powered by Disqus