December 4, 2019

235 words 2 mins read

Building containerized Spark on a solid foundation with Quobyte and Kubernetes

Building containerized Spark on a solid foundation with Quobyte and Kubernetes

Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Burer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.


Talk Title	Building containerized Spark on a solid foundation with Quobyte and Kubernetes
Speakers	Daniel Bäurer (inovex GmbH), Sascha Askani (inovex GmbH)
Conference	Strata Data Conference
Conf Tag	Making Data Work
Location	London, United Kingdom
Date	May 23-25, 2017
URL	Talk Page
Slides	Talk Slides
Video

There are many challenges when deploying distributed applications on containers. One of the biggest is the lack of a stable and performant distributed filesystem. HDFS works very well with legacy Hadoop installations on commodity hardware in classic IT environments since it is very cheap to store a large amount of data on your compute nodes (data locality), but cloud-native environments do not allow HDFS to play out its advantages. Data locality on compute nodes, for example, stands contrary to the idea behind containers or cloud infrastructures. For this reason, many cloud-first implementations fall back to object stores like Amazon S3, Google Cloud Storage, or OpenStack Swift for persistence. Those solutions however lack many features of a real filesystem and suffer from low performance due to overhead. Daniel Bäurer and Sascha Askani share a solution using Spark on Kubernetes with Quobyte as an advanced, distributed, software defined storage system to deliver elastic and stable Spark performance in a container environment.

container google openstack performance spark hadoop infrastructure hdfs cloud kubernetes hardware

comments powered by Disqus

Hadoop and object stores: Can we do it better?

Hadoop and object stores: Can we do it better?

December 3, 2019

Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …

Running Mesos Frameworks on Kubernetes with the Open-Source Universal Resource Broker

Running Mesos Frameworks on Kubernetes with the Open-Source Universal Resource Broker

November 29, 2019

While Kubernetes continues to gain in popularity for cloud applications, many organizations run popular frameworks deployed on Mesos. The need to support multiple orchestration frameworks can result i …

The state of Spark in the cloud

The state of Spark in the cloud

November 29, 2019

Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.

BoFs: Data-Aware Scheduling in Kubernetes [I]

BoFs: Data-Aware Scheduling in Kubernetes [I]

November 24, 2019

In order to provide prompt results and efficiently deal with data-intensive workloads, Big Data applications execute their jobs on compute slots across large clusters. Also, for optimal performance, t …

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)

Virtualizing Hadoop and Spark: Architecture, performance, and best practices (sponsored by VMware)

October 31, 2019

Justin Murray outlines the benefits of virtualizing Hadoop and Spark, covering the main architectural approaches at a technical level and demonstrating how the core Hadoop architecture maps into virtual machines and how those relate to physical servers. You'll gain a set of design approaches and best practices to make your application infrastructure fit well with the virtualization layer.