October 21, 2019

272 words 2 mins read

Scala and the JVM as a big data platform: Lessons from Apache Spark

Scala and the JVM as a big data platform: Lessons from Apache Spark

The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.


Talk Title	Scala and the JVM as a big data platform: Lessons from Apache Spark
Speakers	Dean Wampler (Anyscale)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 29-31, 2016
URL	Talk Page
Slides	Talk Slides
Video

Apache Spark is implemented in Scala, and its user-facing Scala API is very similar to Scala’s own Collections API. The power and concision of this API have already brought many developers to Scala. The core abstractions in Spark have created a flexible, extensible platform for applications like streaming, SQL queries, machine learning, and more. Scala offers many advantages over Java: Spark, like almost all open source, big data tools, leverages the JVM, which is an excellent general-purpose platform for scalable computing. However, its management of objects is suboptimal for high-performance data crunching. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data to make them better tools for our needs. For example, the way objects are organized in memory and the subsequent impact that has on garbage collection can be improved for the special case of big data. Hence, the Spark project has recently started Project Tungsten to build internal optimizations using the following techniques:

api jvm management streaming apache sql spark tungsten open source big data machine learning performance scalable optimization

comments powered by Disqus

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

October 19, 2019

What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital Ones novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop.

Python scalability: A convenient truth

Python scalability: A convenient truth

October 21, 2019

Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)

October 20, 2019

Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning.

What's next for BDAS (the Berkeley Data Analytics Stack)?

What's next for BDAS (the Berkeley Data Analytics Stack)?

October 18, 2019

Michael Franklin offers an overview of the Berkeley Data Analytics Stack, outlines the current directions it's taking, and settles once and for all how BDAS should be pronounced.

Scale your code with Scala.js

Scale your code with Scala.js

October 14, 2019

Web apps are complex and comprised of many technologies. It can be difficult to simultaneously scale large server and client codebases. Scala is an expressive, performant language that can now run in your browser as well as on the JVM. Paul Draper explains how Scala's presence on the two most ubiquitous runtimes greatly assists web developers.

Scalable schema management for Hadoop and Spark applications

Scalable schema management for Hadoop and Spark applications

October 21, 2019

Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient.