November 9, 2019

308 words 2 mins read

Big data for big data: Machine-learning models of Hadoop cluster behavior

Big data for big data: Machine-learning models of Hadoop cluster behavior

Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.


Talk Title	Big data for big data: Machine-learning models of Hadoop cluster behavior
Speakers	Sean Suchter (Pepperdata), Shekhar Gupta (Pepperdata)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

The performance of batch processing systems such as YARN is generally determined by the throughput, which measures the amount of workload (tasks) completed in a given time window. For a given cluster size, the throughput can be increased by running as much workload as possible on each host, utilizing all the free resources available on the host. Because each node is running a complex combination of different tasks and containers, the performance characteristics of the cluster are dynamically changing. As a result, there is always a danger of overutilizing host memory, which can result in extreme swapping or thrashing. The impact of thrashing can be very severe; it can actually reduce the throughput instead of increasing it. Sean Suchter and Shekhar Gupta explain how they used very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events. (To build this system, they used hand-labeling of bad events combined with large-scale data processing using Hadoop, HBase, Spark, and IPython for experimentation.) By using very fine-grained (five-second) data from many production clusters running very different workloads, Sean and Shekhar have trained a generalized model that very rapidly detects the onset of thrashing within seconds of the first symptom. This detection has proven fast enough to enable effective mitigation of thrashing, allowing the hosts to continuously provide high throughput. Sean and Shekhar discuss the methods they used and share novel findings about big data cluster performance.

container performance spark large-scale hadoop labeling big data python cluster

comments powered by Disqus

Compressed linear algebra in Apache SystemML

Compressed linear algebra in Apache SystemML

November 7, 2019

Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.

Real-time analytics using Kudu at petabyte scale

Real-time analytics using Kudu at petabyte scale

November 3, 2019

Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

October 31, 2019

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

Apache Kylin 2.0: From classic OLAP to real-time data warehouse

November 9, 2019

Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.

Architecting a next-generation data platform

Architecting a next-generation data platform

November 9, 2019

Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Big data for operational insights

Big data for operational insights

November 9, 2019

GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.