October 21, 2019

272 words 2 mins read

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.

Talk Title Scala and the JVM as a big data platform: Lessons from Apache Spark
Speakers Dean Wampler (Anyscale)
Conference Strata + Hadoop World
Conf Tag Big Data Expo
Location San Jose, California
Date March 29-31, 2016
URL Talk Page
Slides Talk Slides
Video

Apache Spark is implemented in Scala, and its user-facing Scala API is very similar to Scala’s own Collections API. The power and concision of this API have already brought many developers to Scala. The core abstractions in Spark have created a flexible, extensible platform for applications like streaming, SQL queries, machine learning, and more. Scala offers many advantages over Java: Spark, like almost all open source, big data tools, leverages the JVM, which is an excellent general-purpose platform for scalable computing. However, its management of objects is suboptimal for high-performance data crunching. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data to make them better tools for our needs. For example, the way objects are organized in memory and the subsequent impact that has on garbage collection can be improved for the special case of big data. Hence, the Spark project has recently started Project Tungsten to build internal optimizations using the following techniques:

comments powered by Disqus