November 19, 2019

341 words 2 mins read

Building real-time BI systems with HDFS and Kudu

Building real-time BI systems with HDFS and Kudu

Ruhollah Farchtchi explores best practices for building systems that support ad hoc queries over real-time data and offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for use cases that require fast analytics on rapidly changing data with a simultaneous combination of sequential and random reads and writes.


Talk Title	Building real-time BI systems with HDFS and Kudu
Speakers	Ruhollah Farchtchi (Zoomdata)
Conference	Strata + Hadoop World
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	June 1-3, 2016
URL	Talk Page
Slides	Talk Slides
Video

One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows. Ruhollah Farchtchi explores best practices for dealing with these challenges and the append-only nature of HDFS and discusses how to make sure data is distributed appropriately. This is challenging to do with static data and even tougher with real-time, dynamic data. Ruhollah also explains how to deal with updates to existing data, whether due to restatements or a need to compact the data. Ruhollah then offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for fast analytics on rapidly changing data, demonstrates how Kudu simplifies the architecture of such systems, and reviews a number of lessons learned from working with Kudu, including how to use dictionary attributes to optimize storage of denormalized dimensional data; how to achieve a high degree of parallelization of queries via data distribution and sizing the right number of tablets based on available cores; and how to balance insert rates versus read-heavy workloads.

streaming bi hadoop analytics hdfs

comments powered by Disqus

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

Fast data made easy with Apache Kafka and Apache Kudu (incubating)

October 25, 2019

Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.

How the oil and gas industry is igniting a spark with information fusion and metadata analytics

How the oil and gas industry is igniting a spark with information fusion and metadata analytics

October 24, 2019

Oil and gas organizations are at the forefront of big data, adopting technologies such as Hadoop and Spark to develop next-generation fusion systems. Brian Clark and Marco Ippolito introduce a case study from CGG, a builder of common data models to drive analytics of sensor data and associated metadata from fast-changing big data streams, to show how to derive richer value from big data assets.

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics

October 22, 2019

Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.

HopsWorks: Multitenant Hadoop as a service

HopsWorks: Multitenant Hadoop as a service

November 18, 2019

Currently, multitenancy in Hadoop is limited to organizations running separate Hadoop clusters, and the secure sharing of resources is achieved using virtualization or containers. Jim Dowling describes how HopsWorks enables organizations to securely share a single Hadoop cluster using projects and a new metadata layer that enables protection domains while still allowing data sharing.

Making sense of exactly-once semantics

Making sense of exactly-once semantics

November 18, 2019

Exactly-once semantics is a highly desirable property for streaming analytics. Ideally, all applications process events once and never twice, but making such guarantees in general either induces significant overhead or introduces other inconveniences, such as stalling. Flavio Junqueira explores what's possible and reasonable for streaming analytics to achieve when targeting exactly-once semantics.

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

November 17, 2019

Moty Fania shares Intels IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platforms architecture.