Distributed real-time highly available stream processing
Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability.
Talk Title | Distributed real-time highly available stream processing |
Speakers | Yu-Xi Lim (Teralytics), Michal Wegrzyn (Teralytics) |
Conference | Strata + Hadoop World |
Conf Tag | Make Data Work |
Location | Singapore |
Date | December 6-8, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Real-time event stream processing is increasingly becoming a staple of data-driven decision making in many organizations. However, providing such data processing capability is not without challenges. Real-time decision-making support systems must be able to handle high event traffic volume and execute sophisticated analyses and must be highly available and scalable with high data consistency and integrity guarantees. However, satisfying all these requirements simultaneously is difficult. Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. It can easily define an event flow (i.e., a stream-in/stream-out transformer) and shard the state, pass state fragments to independent state machines which would process independent stream shards, and then splice resulting state fragments into a global state if necessary. With proper orchestration, this framework becomes capable of performing asynchronous checkpointing and seamless recovery of individually failed instances, leading to a highly reliable, highly available framework. Yu-Xi and Michal illustrate the framework via a case study of Teralytics, a company that analyzes cellular location data to derive transportation patterns and compute transportation-related metrics. Yu-Xi and Michal demonstrate how the algorithm for computing one such metric can be formulated in the framework and how the resulting component reflects the traits given above.