December 22, 2019

412 words 2 mins read

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.

Talk Title ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics
Speakers Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Conference Strata Data Conference
Conf Tag Big Data Expo
Location San Francisco, California
Date March 26-28, 2019
URL Talk Page
Slides Talk Slides
Video

Most traditional big data systems store data in a columnar format and rely on sequential scans for query processing. This approach is efficient and cheap, but typical queries are slow and not suitable for online applications. Some other search systems, like Elasticsearch, build inverted indices on the data, which makes simple queries faster. However, they’re not able to satisfy complex queries, and the cost is a concern for bigger datasets. Igor Canadi and Dhruba Borthakur offer an overview of converged indexing with ROCKSET, a system that combines the columnar and search indices to enable both low-latency and complex queries. Two ideas make converged indexing performant and cost-effective: using log-structured merge trees as an underlying storage system and utilizing the elasticity and storage hierarchy provided by the cloud environment. Igor and Dhruba explain how Rockset maps a semistructured document into individual keys and values to be stored in a RocksDBCloud’s log-structured merge (LSM) tree. In a B-tree-based system, the performance is significantly affected by the number of indices. If one database update has to be reflected in many indices, it causes many random writes to storage. An LSM engine buffers random writes in memory and writes them as one big sequential write, making it efficient to maintain multiple indices. Using an LSM engine enables ROCKSET to efficiently store both columnar and search indices for all columns of a data record. The second component that makes the architecture economically feasible is the elasticity of the cloud environment. ROCKSET uses an open source technology called RocksDBCloud, which stores data for durability in S3 but is able to replicate said data into local SSD memory and RAM during times of load. Igor and Dhruba describe the custom-built ROCKSET scheduler that runs as part of Kubernetes and manages the placement of data across available resources and requests new resources from the cloud provider when needed.

comments powered by Disqus