October 25, 2019

236 words 2 mins read

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.


Talk Title	Fast data made easy with Apache Kafka and Apache Kudu (incubating)
Speakers	Ted Malaska (Capital One), Jeff Holoman (Cloudera)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Historically, use cases such as time series and mutable-profile datasets have been possible but difficult to achieve efficiently using traditional HDFS storage engines. These solutions might involve complex ingestion paths, deep understanding of file types, and compaction strategies. With the introduction of Kudu, many of these difficulties are eliminated. At the same time, interest in streaming solutions and low-latency analytics has surged with the growing popularity of tools like Apache Kafka. Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.

kafka code streaming apache sql dataset spark introduction analytics hdfs use case

comments powered by Disqus

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

October 22, 2019

Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

Driving the on-demand economy with predictive analytics

Driving the on-demand economy with predictive analytics

October 25, 2019

The next evolution in the on-demand economy is in predictive analytics fueled by live streams of datain effect knowing what customers want before they do. Eric Frenkiel explains how a real-time trinity of technologiesKafka, Spark, and MemSQLis enabling Uber and others to power their own revolutions with predictive apps and analytics.

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Docker for data scientists

Docker for data scientists

October 25, 2019

Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds familiar, you may have missed the increasing popularity of Linux containers in the DevOps world, in particular Docker. Michelangelo D'Agostino demonstrates why Docker deserves a place in every data scientists toolkit.

Embeddable data transformation for real-time streams

Embeddable data transformation for real-time streams

October 25, 2019

Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.