November 22, 2019

345 words 2 mins read

Playing well together: Big data beyond the JVM with Spark and friends

Playing well together: Big data beyond the JVM with Spark and friends

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).


Talk Title	Playing well together: Big data beyond the JVM with Spark and friends
Speakers	Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
Conference	Strata Data Conference
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 6-8, 2018
URL	Talk Page
Slides	Talk Slides
Video

Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space. Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.

kafka code spark ecosystem hadoop data science data analytics analytics big data machine learning python

comments powered by Disqus

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

November 20, 2019

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.

Smart agriculture: Blending IoT sensor data with visual analytics

Smart agriculture: Blending IoT sensor data with visual analytics

November 21, 2019

Mike Prorock offers an overview of mesur.io, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. Mesur.io enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.

What's new in Hadoop 3.0

What's new in Hadoop 3.0

November 19, 2019

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Architecting an advanced analytics platform for machine learning

Architecting an advanced analytics platform for machine learning

November 18, 2019

Georgios Gkekas shares ING's advanced analytics journey to promote modern machine and deep learning techniques internally through a central, best-of-breed technical platform tailored for data science activities. The platform offers only the necessary automated tools to replace the tedious, repetitive, and error-prone steps in a typical data science pipeline.

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

November 20, 2019

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.

Working with the data of sports

Working with the data of sports

November 18, 2019

Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, sports teams face challenges in data management, data engineering, and analytics. Thomas Miller details the challenges faced by a Major League Baseball team as it sought competitive advantage through data science and deep learning.