Gaining efficiency with time series in ELK

Christian Saide explains how NS1 was able to reduce infrastructure, maintenance, and operational costs while simultaneously increasing throughput and visibility of key metrics by leveraging Elasticsearch as a time series database.


Talk Title	Gaining efficiency with time series in ELK
Speakers	Christian Saide (NS1)
Conference	O’Reilly Velocity Conference
Conf Tag	Building and maintaining complex distributed systems
Location	San Jose, California
Date	June 12-14, 2018
URL	Talk Page
Slides	Talk Slides
Video

Elasticsearch is a highly scalable NoSQL document store specifically leveraging Lucene indexes in order to allow for deep data introspection. Elasticsearch is already the de facto system to use for log analysis but has recently branched out into time series data manipulation and analysis. Christian Saide explains how NS1 was able to reduce infrastructure, maintenance, and operational costs while simultaneously increasing throughput and visibility of key metrics by leveraging Elasticsearch as a time series database. NS1 historically used a time series database to do its operational metrics analysis, alongside Elasticsearch to do log analysis. This time series database and its supporting architecture quickly grew to the point where NS1 needed dedicated team members to manage it. This, coupled with the fact that NS1 also had an Elasticsearch cluster to manage, forced the company to rethink its solution. It needed to ensure the metrics throughput the current time series database would be supported, which at the time was in the rage of 150–200 thousand points per second ingested. Using a small set of 10 servers running its Elasticsearch cluster, NS1 was able to achieve throughput numbers of 650–700 thousand documents per second indexed, which proved that NS1 could and more importantly should combine the two systems. The deep data introspection offered by Elasticsearch is the key differentiator when compared to other classical time series databases. Due to its introspection capabilities, an operator is given the tools to allow for making connections that a standard time series database would not traditionally allow for. These capabilities are amplified by dramatically reducing operational burden through a thriving community of plugins and support networks. The combination of data introspection and lighter operational overhead enables operations teams to have more throughput and allows for easier access to the key data that they need to operate distributed infrastructure. This solution has the added benefit of also reducing the infrastructure and maintenance costs of operating two standalone pieces of technology. Topics include:

Gaining efficiency with time series in ELK

Sharded and Federated Prometheus Servers to Monitor Distributed Databases

Collecting Operational Metrics for a Cluster with 5,000 Namespaces

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Lessons learned while evolving Boxs database infrastructure

Tooling in the age of serverless computing

Encoding 250,000 Songs a Day with batch/v1 Jobs