November 20, 2019

184 words 1 min read

Breaking Spark: Top five mistakes to avoid when using Apache Spark in production

Breaking Spark: Top five mistakes to avoid when using Apache Spark in production

Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark.


Talk Title	Breaking Spark: Top five mistakes to avoid when using Apache Spark in production
Speakers	Neelesh Salian (Stitch Fix)
Conference	Strata + Hadoop World
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	June 1-3, 2016
URL	Talk Page
Slides	Talk Slides
Video

Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark across five main areas: Attendees can use Neelesh’s observations to improve the usability and supportability of their Apache Spark deployments and avoid such issues in the future.

cluster spark framework apache

comments powered by Disqus

Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production

October 27, 2019

Spark has been growing in deployments for the past year. Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark in their projects.

Stream analytics in the enterprise: A look at Intels internal IoT implementation

Stream analytics in the enterprise: A look at Intels internal IoT implementation

November 17, 2019

Moty Fania shares Intels IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platforms architecture.

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

Building machine-learning apps with Spark: MLlib, ML Pipelines, and GraphX (Half Day)

October 27, 2019

Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.

Embeddable data transformation for real-time streams

Embeddable data transformation for real-time streams

October 25, 2019

Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.

IoT in the enterprise: A look at Intel (IoT) Inside

IoT in the enterprise: A look at Intel (IoT) Inside

October 23, 2019

Moty Fania shares Intels IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.

HopsWorks: Multitenant Hadoop as a service

HopsWorks: Multitenant Hadoop as a service

November 18, 2019

Currently, multitenancy in Hadoop is limited to organizations running separate Hadoop clusters, and the secure sharing of resources is achieved using virtualization or containers. Jim Dowling describes how HopsWorks enables organizations to securely share a single Hadoop cluster using projects and a new metadata layer that enables protection domains while still allowing data sharing.