February 23, 2020

527 words 3 mins read

HARP: An efficient and elastic GPU-sharing system

HARP: An efficient and elastic GPU-sharing system

Pengfei Fan and Lingling Jin offer an overview of an efficient and elastic GPU-sharing system for users who do research and development with TensorFlow.

Talk Title HARP: An efficient and elastic GPU-sharing system
Speakers Pengfei Fan (Alibaba), Lingling Jin (Alibaba)
Conference O’Reilly TensorFlow World
Conf Tag
Location Santa Clara, California
Date October 28-31, 2019
URL Talk Page
Slides Talk Slides
Video

Many TensorFlow users buy GPUs to accelerate workloads. However, the GPU use in AI clusters is generally very low for various reasons. In the R&D environment, users who requested GPU instances spend much of their time on coding without running any GPU workloads on the servers. This is a great waste of expensive GPU resources. Pengfei Fan and Lingling Jin offer an overview of an efficient and elastic GPU-sharing system that solves this problem. It detects GPU API calls and allocates GPU resources when necessary and automatically retrieves them when no workloads are running. Combining this scheme with Kubernetes, it’s possible to successfully run coding and editing on CPU pods as well as debugging and execution on a remote GPU instance. This elastic system drastically improves GPU cluster use. Since the system forwards GPU API calls to a remote GPU server, Pengfei and Lingling introduced extra latency in application execution. To mitigate the performance issue, they made a few optimizations to make TensorFlow run more efficiently on the system. Considering TensorFlow does memcpy and GPU kernel launching asynchronously, they changed important CUDA APIs’ behavior slightly and kept their functions correct in their virtualization layer to make local CPU and remote GPU run asynchronously. This approach hides substantial network latency, and they obtained 2x+ speedup. They also modified the TensorFlow framework to use additional CUDA streams in remote execution, and it showed more performance gain on the system than local-running mode, which also uses multiple CUDA streams. Changing the graph partitioning algorithm between CPU and GPU nodes to minimize the data movement between the CPU and remote GPU server also brought benefits in some cases. Since remote storage is used in their system, they also use GPU to direct-access remote SSDs to avoid data getting copied to CPU nodes. Building such an elastic GPU platform also demands a modified set of GPU monitoring and debugging software. Their system includes a powerful profiling part that can collect profiling data from both local and remote servers and visualize them in a web client. They modified the TensorFlow framework and inserted some tags with the NVIDA Tools Extension (NVTX) library, which makes it so the changed framework can run on the normal GPU machine and their system. These tags give them some useful information, like the start and end of critical operators. And they can be visualized in the web client together with other profiling data. As AI accelerators’ computation power grows rapidly and network speed improves, Pengfei and Lingling believe that pooling these accelerators together and providing services over networks is the future trend. They’re in the process of deploying their software in their R&D environment, with plans to open source the partial or whole solution so that their framework can work with any AI accelerator, not just GPUs.

comments powered by Disqus