November 9, 2019

325 words 2 mins read

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.


Talk Title	Apache Kylin 2.0: From classic OLAP to real-time data warehouse
Speakers	Yang Li (Kyligence)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse. Yang explores the latest features of Apache Kylin v2.0 and introduces the technical thinking and designs behind them. Apache Kylin used to support star schema only, which is quite a limitation for many real-world cases. In v2.0, by supporting snowflake schema directly, users can import arbitrary E-R model into Kylin, supporting the most comprehensive data model out of the box—a big step forward for business deployments. A new cubing engine based on Spark is introduced in v2.0. This is a long wanted feature by many. Implementing the same layered cubing algorithm, the Spark engine is about 2 times faster than the old MapReduce engine as experiment shows. Since v1.6, Apache Kylin has been able to support microbatch data loading from Kafka and enable minutes latency with near-real-time analysis. A demo will show how twitter messages are analyzed in real-time. And as always, Apache Kylin focuses on replacing online calculation with offline precalculation, making it quite different from other SQL-on-Hadoop solutions. With the ever-growing volume of data, precalculation (and Apache Kylin) may be the only way out to ensure a constant query response time on big data.

kafka streaming apache twitter sql spark algorithm hadoop data model data warehouse big data olap yang

comments powered by Disqus

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Architecting a next-generation data platform

Architecting a next-generation data platform

November 9, 2019

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Real-time analytics using Kudu at petabyte scale

Real-time analytics using Kudu at petabyte scale

November 3, 2019

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Compressed linear algebra in Apache SystemML

Compressed linear algebra in Apache SystemML

November 7, 2019

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Architecting an enterprise data hub in a 110-year-old company

Architecting an enterprise data hub in a 110-year-old company

November 9, 2019

Eric Richardson explains how ACS used Hadoop, HBase, Spark, Kafka, and Solr to create a hybrid cloud enterprise data hub that scales without drama and drives adoption by ease of use, covering the architecture, technologies used, the challenges faced and defeated, and problems yet to solve.

Big data for operational insights

Big data for operational insights

November 9, 2019

GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.