January 16, 2020

375 words 2 mins read

What's the Hadoop-la about Kubernetes?

What's the Hadoop-la about Kubernetes?

Kubernetes (K8s)the open source container orchestration system for modern big data workloadsis increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.

Talk Title What's the Hadoop-la about Kubernetes?
Speakers Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)
Conference Strata Data Conference
Conf Tag Make Data Work
Location New York, New York
Date September 11-13, 2018
URL Talk Page
Slides Talk Slides

Containers offer significant value to businesses, including increased developer agility and the ability to move applications between on-premises servers and cloud instances and across data centers. Organizations have embarked on the journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services, such as frontend UIs and simple, content-centric experiences, are often great candidates for stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload. Stateful applications, on the other hand, are services that require backing storage, and keeping state is critical to running the service. Hadoop, Spark, and to a lesser extent, NoSQL platforms such as Cassandra, MongoDB, Postgres, and MySQL are great examples. They require some form of persistent storage that will survive service restarts. Anant Chintamaneni and Nanda Vijaydev highlight the key gaps and considerations based on a real-world implementation of big data cluster orchestration on Kubernetes. There are several attributes of stateful, multiservice big data applications that need to be considered. Hadoop and Spark are not exactly monolithic applications but are close with their multiple, cooperating services with dynamic APIs. Service startup/teardown ordering requirements with different sets of services running on different hosts (nodes) result in tricky service interdependencies that impact scalability. There is also lots of configuration (aka state), such as host name, IP address, ports and service-specific settings, that needs to be maintained to run fault-tolerant clusters. Anant and Nanda detail technical configurations and customizations required to run Hadoop distributions on Kubernetes and explore the gaps when comparing Hadoop on Kubernetes to the standard deployment of Hadoop on physical servers or virtual machines. Topics include:

comments powered by Disqus