Downscaling: The Achilles heel of autoscaling Spark clusters
Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO.
Talk Title | Downscaling: The Achilles heel of autoscaling Spark clusters |
Speakers | Prakhar Jain (Microsoft), Sourabh Goyal (Qubole) |
Conference | Strata Data Conference |
Conf Tag | Make Data Work |
Location | New York, New York |
Date | September 24-26, 2019 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Adding nodes at runtime (upscaling) to already running Spark on YARN clusters is fairly easy. But taking away these nodes (downscaling) when the workload is low at some later point is difficult. To remove a node from a running cluster, you need to make sure that it isn’t used for compute and storage. But on production workloads, many nodes can’t be taken away because nodes are running some containers, although they are not fully utilized. That means all containers are fragmented on different nodes. For example, each node is running one or two containers or executors, although they have resources to run f containers. Long-running Spark executors makes it even more difficult. Or nodes have some shuffle data in the local disk that will be consumed by a Spark application running on this cluster later. In this case, the resource manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages or tasks. Prakhar Jain and Sourabh Goyal explore how to improve downscaling in Spark on YARN clusters under the presence of such constraints. They cover changes in scheduling strategy for container allocation in the YARN and Spark task scheduler, which together helps achieve better packing of containers. This makes sure that containers are defragmented on fewer sets of nodes and some nodes don’t have any compute. By being careful in how you assign containers in the first place, you can prevent the chance of running into situations where containers of the same application are running over different nodes. They also examine enhancements to the Spark driver and external shuffle service (ESS) which helps you proactively delete shuffle data that you already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data—thus freeing them from storage and making them available for reclamation for faster downscaling.