January 2, 2020

202 words 1 min read

HDFS on Kubernetes: Lessons learned

HDFS on Kubernetes: Lessons learned

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.


Talk Title	HDFS on Kubernetes: Lessons learned
Speakers	Kimoon Kim (Pepperdata)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 26-28, 2017
URL	Talk Page
Slides	Talk Slides
Video

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark, explaining how the Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons and how to provide the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.

container spark hdfs kubernetes

comments powered by Disqus

Building containerized Spark on a solid foundation with Quobyte and Kubernetes

Building containerized Spark on a solid foundation with Quobyte and Kubernetes

December 4, 2019

Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Burer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.

How machine learning with open source tools helps everyone build better products

How machine learning with open source tools helps everyone build better products

January 2, 2020

Michelle Casbon explores the machine learning and natural language processing that enables teams to build products that feel native to every user and explains how Qordoba is tackling the underserved domain of localization using open source tools, including Kubernetes, Docker, Scala, Apache Spark, Apache Cassandra, and Apache PredictionIO (incubating).

Modern Big Data Pipelines over Kubernetes [I]

Modern Big Data Pipelines over Kubernetes [I]

December 3, 2019

Big data used to be synonymous with Hadoop, but our ecosystem has evolved over time with new database, streaming and machine learning solutions which dont necessarily benefit from the Hadoop deployme …

BoFs: Data-Aware Scheduling in Kubernetes [I]

BoFs: Data-Aware Scheduling in Kubernetes [I]

November 24, 2019

In order to provide prompt results and efficiently deal with data-intensive workloads, Big Data applications execute their jobs on compute slots across large clusters. Also, for optimal performance, t …

Dude, Where's My Microservice?

Dude, Where's My Microservice?

January 2, 2020

In this talk I will focus on Discovery Service and communication between microservices. I'll present possible methods and show strong and weak sides of them. For each method I'll provide reference imp …

Lightweight Containerization at Facebook

Lightweight Containerization at Facebook

January 1, 2020

In Facebook's new container system we started to heavily utilize Btrfs, cgroups2 and systemd. The combination of these tools and some additional internal code allowed us to create a lightweight, fast …