Apache Kylin 2.0: From classic OLAP to real-time data warehouse
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Talk Title | Apache Kylin 2.0: From classic OLAP to real-time data warehouse |
Speakers | Yang Li (Kyligence) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 14-16, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse. Yang explores the latest features of Apache Kylin v2.0 and introduces the technical thinking and designs behind them. Apache Kylin used to support star schema only, which is quite a limitation for many real-world cases. In v2.0, by supporting snowflake schema directly, users can import arbitrary E-R model into Kylin, supporting the most comprehensive data model out of the box—a big step forward for business deployments. A new cubing engine based on Spark is introduced in v2.0. This is a long wanted feature by many. Implementing the same layered cubing algorithm, the Spark engine is about 2 times faster than the old MapReduce engine as experiment shows. Since v1.6, Apache Kylin has been able to support microbatch data loading from Kafka and enable minutes latency with near-real-time analysis. A demo will show how twitter messages are analyzed in real-time. And as always, Apache Kylin focuses on replacing online calculation with offline precalculation, making it quite different from other SQL-on-Hadoop solutions. With the ever-growing volume of data, precalculation (and Apache Kylin) may be the only way out to ensure a constant query response time on big data.