HARP: An efficient and elastic GPU-sharing system

Pengfei Fan and Lingling Jin offer an overview of an efficient and elastic GPU-sharing system for users who do research and development with TensorFlow.


Talk Title	HARP: An efficient and elastic GPU-sharing system
Speakers	Pengfei Fan (Alibaba), Lingling Jin (Alibaba)
Conference	O’Reilly TensorFlow World
Conf Tag
Location	Santa Clara, California
Date	October 28-31, 2019
URL	Talk Page
Slides	Talk Slides
Video

Many TensorFlow users buy GPUs to accelerate workloads. However, the GPU use in AI clusters is generally very low for various reasons. In the R&D environment, users who requested GPU instances spend much of their time on coding without running any GPU workloads on the servers. This is a great waste of expensive GPU resources. Pengfei Fan and Lingling Jin offer an overview of an efficient and elastic GPU-sharing system that solves this problem. It detects GPU API calls and allocates GPU resources when necessary and automatically retrieves them when no workloads are running. Combining this scheme with Kubernetes, it’s possible to successfully run coding and editing on CPU pods as well as debugging and execution on a remote GPU instance. This elastic system drastically improves GPU cluster use. Since the system forwards GPU API calls to a remote GPU server, Pengfei and Lingling introduced extra latency in application execution. To mitigate the performance issue, they made a few optimizations to make TensorFlow run more efficiently on the system. Considering TensorFlow does memcpy and GPU kernel launching asynchronously, they changed important CUDA APIs’ behavior slightly and kept their functions correct in their virtualization layer to make local CPU and remote GPU run asynchronously. This approach hides substantial network latency, and they obtained 2x+ speedup. They also modified the TensorFlow framework to use additional CUDA streams in remote execution, and it showed more performance gain on the system than local-running mode, which also uses multiple CUDA streams. Changing the graph partitioning algorithm between CPU and GPU nodes to minimize the data movement between the CPU and remote GPU server also brought benefits in some cases. Since remote storage is used in their system, they also use GPU to direct-access remote SSDs to avoid data getting copied to CPU nodes. Building such an elastic GPU platform also demands a modified set of GPU monitoring and debugging software. Their system includes a powerful profiling part that can collect profiling data from both local and remote servers and visualize them in a web client. They modified the TensorFlow framework and inserted some tags with the NVIDA Tools Extension (NVTX) library, which makes it so the changed framework can run on the normal GPU machine and their system. These tags give them some useful information, like the start and end of critical operators. And they can be visualized in the web client together with other profiling data. As AI accelerators’ computation power grows rapidly and network speed improves, Pengfei and Lingling believe that pooling these accelerators together and providing services over networks is the future trend. They’re in the process of deploying their software in their R&D environment, with plans to open source the partial or whole solution so that their framework can work with any AI accelerator, not just GPUs.

HARP: An efficient and elastic GPU-sharing system

Deploying deep learning models on GPU-enabled Kubernetes clusters

Large Scale Distributed Deep Learning on Kubernetes Clusters

From inception to insight: Accelerating AI productivity with GPUs (sponsored by Dell Technologies)

Deep learning with Horovod and Spark using GPUs and Docker containers

Deep learning on Apache Spark at CERNs Large Hadron Collider with Analytics Zoo

Deep learning for recommender systems