What's the Hadoop-la about Kubernetes?

Kubernetes (K8s)the open source container orchestration system for modern big data workloadsis increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.


Talk Title	What's the Hadoop-la about Kubernetes?
Speakers	Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)
Conference	Strata Data Conference
Conf Tag	Make Data Work
Location	New York, New York
Date	September 11-13, 2018
URL	Talk Page
Slides	Talk Slides
Video

Containers offer significant value to businesses, including increased developer agility and the ability to move applications between on-premises servers and cloud instances and across data centers. Organizations have embarked on the journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services, such as frontend UIs and simple, content-centric experiences, are often great candidates for stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload. Stateful applications, on the other hand, are services that require backing storage, and keeping state is critical to running the service. Hadoop, Spark, and to a lesser extent, NoSQL platforms such as Cassandra, MongoDB, Postgres, and MySQL are great examples. They require some form of persistent storage that will survive service restarts. Anant Chintamaneni and Nanda Vijaydev highlight the key gaps and considerations based on a real-world implementation of big data cluster orchestration on Kubernetes. There are several attributes of stateful, multiservice big data applications that need to be considered. Hadoop and Spark are not exactly monolithic applications but are close with their multiple, cooperating services with dynamic APIs. Service startup/teardown ordering requirements with different sets of services running on different hosts (nodes) result in tricky service interdependencies that impact scalability. There is also lots of configuration (aka state), such as host name, IP address, ports and service-specific settings, that needs to be maintained to run fault-tolerant clusters. Anant and Nanda detail technical configurations and customizations required to run Hadoop distributions on Kubernetes and explore the gaps when comparing Hadoop on Kubernetes to the standard deployment of Hadoop on physical servers or virtual machines. Topics include:

What's the Hadoop-la about Kubernetes?

The SMACK stack on Mesosphere DC/OS using cloud infrastructure

Intro: KubeVirt BoF

Pangeo: Big data climate science in the cloud

Container4NFV: How Refactoring VM-based VNF to Container-Based Microservice VNF

Cloud Jumping with Kubernetes

Building an Enterprise/Cloud Analytics Platform with Jupyter Enterprise Gateway