Scala and the JVM as a big data platform: Lessons from Apache Spark
The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.
Talk Title | Scala and the JVM as a big data platform: Lessons from Apache Spark |
Speakers | Dean Wampler (Anyscale) |
Conference | Strata + Hadoop World |
Conf Tag | Big Data Expo |
Location | San Jose, California |
Date | March 29-31, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Apache Spark is implemented in Scala, and its user-facing Scala API is very similar to Scala’s own Collections API. The power and concision of this API have already brought many developers to Scala. The core abstractions in Spark have created a flexible, extensible platform for applications like streaming, SQL queries, machine learning, and more. Scala offers many advantages over Java: Spark, like almost all open source, big data tools, leverages the JVM, which is an excellent general-purpose platform for scalable computing. However, its management of objects is suboptimal for high-performance data crunching. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data to make them better tools for our needs. For example, the way objects are organized in memory and the subsequent impact that has on garbage collection can be improved for the special case of big data. Hence, the Spark project has recently started Project Tungsten to build internal optimizations using the following techniques: