HDFS on Kubernetes: Lessons learned
There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.
Talk Title | HDFS on Kubernetes: Lessons learned |
Speakers | Kimoon Kim (Pepperdata) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 26-28, 2017 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark, explaining how the Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons and how to provide the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.