December 1, 2019

264 words 2 mins read

Near-real-time ingest with Apache Flume and Apache Kafka at 1 million-events-per-second scale

Near-real-time ingest with Apache Flume and Apache Kafka at 1 million-events-per-second scale

Vodafone UKs new SIEM system relies on Apache Flume and Apache Kafka to ingest over 1 million events per second. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost.


Talk Title	Near-real-time ingest with Apache Flume and Apache Kafka at 1 million-events-per-second scale
Speakers	Tristan Stevens (Cloudera)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

There are two large obstacles to collecting metadata from a network as large as Vodafone’s (the UK’s second-largest telecoms provider): transporting the sheer volume of data (cumulative bandwidth) and processing it before the data no longer accurately reflects the state of the network (cumulative delay). Fortunately, combining Apache Flume and Apache Kafka using the Flafka pattern provides a means to move data into the EDH (Hadoop cluster) and readily scale the pipeline to address both transient and persistent spikes in data volume. Flume and Kafka are both capable of high-performance, low-latency event processing; however, careful tuning is required in order to achieve performance at this scale. Vodafone has deployed Flume and Kafka across the UK network in a geographically distributed architecture that achieves scale and resilience, having been tuned from around 10,000 events per second on initial deployment to 1,000,000 events per second using a three-node Kafka cluster. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost. Topics include:

kafka apache hadoop network performance telecom pipeline cluster hardware

comments powered by Disqus

Big data for operational insights

Big data for operational insights

November 9, 2019

GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.

A contextual real-time bidding engine for search engine marketing

A contextual real-time bidding engine for search engine marketing

November 10, 2019

Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and explores Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot.

Achieving real-time ingestion and analysis of security events through Kafka and Metron

Achieving real-time ingestion and analysis of security events through Kafka and Metron

November 10, 2019

Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats.

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Why stream? The advantages of working with streaming data

Why stream? The advantages of working with streaming data

October 31, 2019

Life doesnt happen in batches. Being able to work with data from continuous events as data streams is a better fit to the way life happens, but doing so presents some challenges. Ellen Friedman examines the advantages and issues involved in working with streaming data, takes a look at emerging technologies for streaming, and describes best practices for this style of work.

Distributed Database DevOps Dilemmas? Kubernetes to the Rescue

Distributed Database DevOps Dilemmas? Kubernetes to the Rescue

December 1, 2019

Distributed databases can make so many things easier for a developer… but not always for DevOps. OK, almost never for DevOps. Kubernetes has come to the rescue with an easy application orchestrati …