October 26, 2019

215 words 2 mins read

Designing a scalable real-time data platform using Akka, Spark Streaming, and Kafka

Designing a scalable real-time data platform using Akka, Spark Streaming, and Kafka

Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis.


Talk Title	Designing a scalable real-time data platform using Akka, Spark Streaming, and Kafka
Speakers	Alex Silva (Pluralsight)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

With the advent of reliable streaming technologies, real-time data pipelines have become a crucial component of any robust data initiative today. Compared to a traditional Hadoop-centric data hub, these real-time stacks provide high-levels of system availability and data integrity coupled with very low latency queries without incurring the overhead of inflexible schemas and batch analysis lag. Alex Silva demonstrates how to use Kafka, Spark Streaming, Akka, and Hadoop to orchestrate a real-time stack and explains how data flows through this system. This real-time data platform combines a mix of open source technologies and home-grown services aimed at providing a full end-to-end solution, starting from flexible data-ingestion protocols to fast data analysis and queries. Topics include:

kafka streaming spark hadoop open source scalable pipeline

comments powered by Disqus

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Scalable schema management for Hadoop and Spark applications

Scalable schema management for Hadoop and Spark applications

October 21, 2019

Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient.

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

October 25, 2019

Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.

How the oil and gas industry is igniting a spark with information fusion and metadata analytics

How the oil and gas industry is igniting a spark with information fusion and metadata analytics

October 24, 2019

Oil and gas organizations are at the forefront of big data, adopting technologies such as Hadoop and Spark to develop next-generation fusion systems. Brian Clark and Marco Ippolito introduce a case study from CGG, a builder of common data models to drive analytics of sensor data and associated metadata from fast-changing big data streams, to show how to derive richer value from big data assets.

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

October 22, 2019

Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.