Deep neural network model compression and an efficient inference engine
Neural networks are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. Song Han explains how deep compression addresses this limitation by reducing the storage requirement of neural networks without affecting their accuracy and proposes an energy-efficient inference engine (EIE) that works with this model.
Talk Title | Deep neural network model compression and an efficient inference engine |
Speakers | Song Han (Stanford University) |
Conference | O’Reilly Artificial Intelligence Conference |
Conf Tag | |
Location | New York, New York |
Date | September 26-27, 2016 |
URL | Talk Page |
Slides | Talk Slides |
Video | |
Neural networks are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. Song Han explains how deep compression addresses this limitation by reducing the storage requirement of neural networks without affecting their accuracy. (On the ImageNet dataset, this method reduced the storage required by AlexNet by 35x from 240 MB to 6.9 MB and VGG-16 by 49x from 552 MB to 11.3 MB, both with no loss of accuracy.) The deep compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. This also allows fitting the model into an on-chip SRAM cache rather than off-chip DRAM memory. Song also proposes an energy-efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the inherent modified sparse matrix-vector multiplication. When compared to CPU and GPU implementations of the DNN without compression evaluated on nine DNN benchmarks, EIE is 189x and 13x faster respectively. An EIE with processing power of 102 GOPS at only 600 mW is also 24,000x and 3,000x more energy efficient on respective CPUs and GPUs.