January 2, 2020

202 words 1 min read

HDFS on Kubernetes: Lessons learned

HDFS on Kubernetes: Lessons learned

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.

Talk Title HDFS on Kubernetes: Lessons learned
Speakers Kimoon Kim (Pepperdata)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 26-28, 2017
URL Talk Page
Slides Talk Slides
Video

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark, explaining how the Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons and how to provide the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.

comments powered by Disqus