November 5, 2019

191 words 1 min read

How Spark can fail or be confusing and what you can do about it

How Spark can fail or be confusing and what you can do about it

Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Yin Huai looks at the top causes of job failures customers encountered in production and examines ways to mitigate such problems by modifying Spark. He also shares a methodology for improving resilience: a combination of monitoring and debugging techniques for users.


Talk Title	How Spark can fail or be confusing and what you can do about it
Speakers	Yin Huai (Databricks)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

Apache Spark has become one of the most popular open source projects in big data. But like any six-year-old, Spark does not always do its job correctly and can be hard to understand. Yin Huai looksat the top causes of job failures customers encountered in production, which include resource exhaustion and hitting internal limits within Spark. Yin shares examples of common failures to highlight recent improvements and possible future work. He also shares a methodology for improving resilience: a combination of monitoring and debugging techniques for users.

apache spark open source big data monitoring

comments powered by Disqus

Paint the landscape and secure your data center with Apache Spot

Paint the landscape and secure your data center with Apache Spot

November 4, 2019

Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.

Sparklyr: An R interface for Apache Spark

Sparklyr: An R interface for Apache Spark

November 2, 2019

Sparklyr makes it easy and practical to analyze big data with Ryou can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API.

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

November 2, 2019

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

October 31, 2019

Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.

Leveraging deep learning to predict breast cancer proliferation scores with Apache Spark and Apache SystemML

Leveraging deep learning to predict breast cancer proliferation scores with Apache Spark and Apache SystemML

November 4, 2019

Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task.

Machines and the magic of fast learning (sponsored by MemSQL)

Machines and the magic of fast learning (sponsored by MemSQL)

November 4, 2019

Eric Frenkiel explains how to use real-time data as a vehicle for operationalizing machine-learning models by leveraging MemSQL, exploring advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change.