ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.


Talk Title	ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics
Speakers	Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Francisco, California
Date	March 26-28, 2019
URL	Talk Page
Slides	Talk Slides
Video

Most traditional big data systems store data in a columnar format and rely on sequential scans for query processing. This approach is efficient and cheap, but typical queries are slow and not suitable for online applications. Some other search systems, like Elasticsearch, build inverted indices on the data, which makes simple queries faster. However, they’re not able to satisfy complex queries, and the cost is a concern for bigger datasets. Igor Canadi and Dhruba Borthakur offer an overview of converged indexing with ROCKSET, a system that combines the columnar and search indices to enable both low-latency and complex queries. Two ideas make converged indexing performant and cost-effective: using log-structured merge trees as an underlying storage system and utilizing the elasticity and storage hierarchy provided by the cloud environment. Igor and Dhruba explain how Rockset maps a semistructured document into individual keys and values to be stored in a RocksDBCloud’s log-structured merge (LSM) tree. In a B-tree-based system, the performance is significantly affected by the number of indices. If one database update has to be reflected in many indices, it causes many random writes to storage. An LSM engine buffers random writes in memory and writes them as one big sequential write, making it efficient to maintain multiple indices. Using an LSM engine enables ROCKSET to efficiently store both columnar and search indices for all columns of a data record. The second component that makes the architecture economically feasible is the elasticity of the cloud environment. ROCKSET uses an open source technology called RocksDBCloud, which stores data for durability in S3 but is able to replicate said data into local SSD memory and RAM during times of load. Igor and Dhruba describe the custom-built ROCKSET scheduler that runs as part of Kubernetes and manages the placement of data across available resources and requests new resources from the cloud provider when needed.

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Presto: Tuning performance of SQL-on-anything analytics

Low Latency Multi-cluster Kubernetes Networking in AWS

Scaling visualization for big data and analytics in the cloud

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

Building a Database as a Service on Kubernetes