October 25, 2019

278 words 2 mins read

Faster conclusions using in-memory columnar SQL and machine learning

Faster conclusions using in-memory columnar SQL and machine learning

Hadoops traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions.


Talk Title	Faster conclusions using in-memory columnar SQL and machine learning
Speakers	Wes McKinney (Two Sigma Investments), Jacques Nadeau (Dremio)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Data ages quickly. The longer it takes for you to reach a conclusion, the less value that conclusion can provide. In-memory columnar execution provides a way to get to Hadoop data scale with real-time response. In-memory columnar execution is a powerful paradigm for analyzing large amounts of data very quickly. It provides the ability for multiple applications to share a common data representation and perform operations using SIMD and vectorization. A number of key big data technologies, including Kudu, Ibis, Drill, and Impala, have or will soon have in-memory columnar capabilities. Wes McKinney and Jacques Nadeau give a quick overview of how each of these tools benefits from in-memory columnar execution and then get practical, going into detail about the capabilities of Ibis and how in-memory execution can speed up performance of key operations. Wes and Jacques explore Apache Drill as the backdrop for executing high speed in-memory transformations and machine learning algorithms and demonstrate how a powerful columnar UDF interface can allow organizations to take advantage of the performance of in-memory columnar execution within their custom requirements.

apache sql algorithm hadoop big data machine learning performance

comments powered by Disqus

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

October 21, 2019

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Inside Cigna's big data journey

Inside Cigna's big data journey

October 24, 2019

How do you implement Apache Hadoop in a large healthcare company with a mature data-analysis infrastructure? Jeffrey Shmain and Mohammad Quraishi describe Cigna's journey toward big data and Hadoop, including an overview of new Hadoop capabilities like heterogeneous data integration and large-scale machine learning.

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

October 19, 2019

What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital Ones novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop.

Hadoop in the cloud: Good fit or round peg in a square hole?

Hadoop in the cloud: Good fit or round peg in a square hole?

October 25, 2019

Thomas Phelan and Joel Baxter investigate the advantages and disadvantages of running specific Hadoop workloads in different infrastructure environments. Thomas and Joel then provide a set of rules to help users evaluate big data runtime environments and deployment options to determine which is best suited for a given application.

High-performance clickstream analytics with Apache Phoenix and HBase

High-performance clickstream analytics with Apache Phoenix and HBase

October 25, 2019

Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.

How the oil and gas industry is igniting a spark with information fusion and metadata analytics

How the oil and gas industry is igniting a spark with information fusion and metadata analytics

October 24, 2019

Oil and gas organizations are at the forefront of big data, adopting technologies such as Hadoop and Spark to develop next-generation fusion systems. Brian Clark and Marco Ippolito introduce a case study from CGG, a builder of common data models to drive analytics of sensor data and associated metadata from fast-changing big data streams, to show how to derive richer value from big data assets.