Optimizing analytical queries on Cassandra by 100x

Cassandra is one of the most popular datastores in big data and ML applications. Data analysis at scale with fast query response is critical for business needs, and while Cassandra with Spark integration allows running an analytical workload, it can be slow. Shradha Ambekar dives into the challenges faced at Intuit and the solutions her team implemented to improve performance by 100x.


Talk Title	Optimizing analytical queries on Cassandra by 100x
Speakers	Shradha Ambekar (Intuit)
Conference	O’Reilly Open Source Software Conference
Conf Tag	Fueling innovative software
Location	Portland, Oregon
Date	July 15-18, 2019
URL	Talk Page
Slides	Talk Slides
Video

Cassandra is one of the most popular datastores used in big data, real-time streaming, and machine learning applications. As the volume and velocity of data collected rapidly increases, it’s critical that the speed of data processing and analysis stays ahead in order to support today’s big data applications and meet end users’ service-level agreement (SLA) expectations, but the Cassandra storage model makes it difficult and sometimes very inefficient to run analytical queries. Spark with Cassandra integration mitigates this problem. Spark is a very fast in-memory data processing framework used as an execution engine for analytics workloads. Intuit uses Spark and Cassandra in real-time and ML applications. The writes to Cassandra via Spark scaled very well, but reads (analytical queries) did not. Basic queries (counts) involving IN clause were taking several minutes and sometimes crashing. Shradha Ambekar shares how her team debugged and solved performance problems at scale by concentrating on two concepts and architecture patterns: predicate push down and distributing workloads across partitions. Spark is very efficient in running analytical queries; however, if predicates are not pushed down to the datastore, it results in a full table scan and disastrous performance. With the Spark-Cassandra connector catalyst optimizer pushing predicates to Cassandra for the IN clause, queries were completed in a few seconds rather than several minutes (~30 minutes for a few TBs of data), resulting in a performance gain of 100×. But even if predicates are pushed down to datastore, performance may still suffer if the workload is not distributed across partitions. The modifications made to the Spark-Cassandra connector to create an optimal number of Spark partitions and distribute workload accordingly resulted in a 5x–10x faster query response in addition to the performance gain achieved by predicate pushdown. Join in to discover how Shadha’s team at Intuit debugged performance problems, found the root cause, and implemented the fix—along with lessons learned and best practices for running analytic queries on Cassandra.

Optimizing analytical queries on Cassandra by 100x

Turn devices into data scientistsat the edge

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Mastering data with Spark and machine learning

Veo5G Project: A 5G Use Case for Network Data Analytics

Overview of Data Governance