ABSTRACT
Deep Neural Networks expose a high degree of parallelism, making them amenable to highly data parallel architectures. However, data-parallel architectures often accept inefficiency in individual computations for the sake of overall efficiency. We show that on average, activation values of convolutional layers during inference in modern Deep Convolutional Neural Networks (CNNs) contain 92% zero bits. Processing these zero bits entails ineffectual computations that could be skipped. We propose Pragmatic (PRA), a massively data-parallel architecture that eliminates most of the ineffectual computations on-the-fly, improving performance and energy efficiency compared to state-of-the-art high-performance accelerators [5]. The idea behind PRA is deceptively simple: use serial-parallel shift-and-add multiplication while skipping the zero bits of the serial input. However, a straightforward implementation based on shift-and-add multiplication yields unacceptable area, power and memory access overheads compared to a conventional bit-parallel design. PRA incorporates a set of design decisions to yield a practical, area and energy efficient design.
Measurements demonstrate that for convolutional layers, PRA is 4.31X faster than DaDianNao [5] (DaDN) using a 16-bit fixed-point representation. While PRA requires 1.68X more area than DaDN, the performance gains yield a 1.70X increase in energy efficiency in a 65nm technology. With 8-bit quantized activations, PRA is 2.25X faster and 1.31X more energy efficient than an 8-bit version of DaDN.
- "How to Quantize Neural Networks with TensorFlow." {Online}. Available: https://www.tensorflow.org/performance/quantizationGoogle Scholar
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in 2016 IEEE/ACM International Conference on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
- H. Alemdar, N. Caldwell, V. Leroy, A. Prost-Boucle, and F. Pétrot, "Ternary neural networks for resource-efficient AI applications," CoRR, vol. abs/1609.00222, 2016. {Online}. Available: http://arxiv.org/abs/1609.00222Google Scholar
- A. D. Booth, "A signed binary multiplication technique," The Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236--240, 1951.Google ScholarCross Ref
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, Dec 2014, pp. 609--622. Google ScholarDigital Library
- Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers, 2016, pp. 262--263.Google Scholar
- M. Courbariaux, Y. Bengio, and J.-P. David, "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," ArXiv e-prints, Nov. 2015.Google Scholar
- H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 365--376. Google ScholarDigital Library
- R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," CoRR, vol. abs/1311.2524, 2013. Google ScholarDigital Library
- R. Gonzalez and M. Horowitz, "Energy dissipation in general purpose microprocessors," Solid-State Circuits, IEEE Journal of, vol. 31, no. 9, pp. 1277--1284, Sep 1996.Google ScholarCross Ref
- Google, "Low-precision matrix multiplication," https://github.com/google/gemmlowp, 2016.Google Scholar
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network," arXiv:1602.01528 {cs}, Feb. 2016, arXiv: 1602.01528. {Online}. Available: http://arxiv.org/abs/1602.01528 Google ScholarDigital Library
- S. Han, H. Mao, and W. J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding," arXiv:1510.00149 {cs}, Oct. 2015, arXiv: 1510.00149. {Online}. Available: http://arxiv.org/abs/1510.00149Google Scholar
- A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, "Deep speech: Scaling up end-to-end speech recognition," CoRR, vol. abs/1412.5567, 2014.Google Scholar
- F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size," CoRR, vol. abs/1602.07360, 2016. {Online}. Available: http://arxiv.org/abs/1602.07360Google Scholar
- P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. Enright Jerger, and A. Moshovos, "Proteus: Exploiting numerical precision variability in deep neural networks," in Workshop On Approximate Computing (WAPCO), 2016.Google Scholar
- P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun, and A. Moshovos, "Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets, arXiv:1511.05236v4 {cs.LG}," arXiv.org, 2015.Google Scholar
- P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, "Stripes: Bit-serial Deep Neural Network Computing," in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-49, 2016.Google Scholar
- P. Judd, J. Albericio, and A. Moshovos, "Stripes: Bit-serial Deep Neural Network Computing," Computer Architecture Letters, 2016.Google Scholar
- J. Kim, K. Hwang, and W. Sung, "X1000 real-time phoneme recognition VLSI using feed-forward deep neural networks," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 7510--7514.Google Scholar
- A. J. Martin, M. Nyström, and P. I. Pénzes, "Et2: A metric for time and energy efficiency of computation," in Power aware computing. Springer, 2002, pp. 293--315. Google ScholarDigital Library
- N. Muralimanohar and R. Balasubramonian, "Cacti 6.0: A tool to understand large caches."Google Scholar
- V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807--814. Google ScholarDigital Library
- A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA '17. New York, NY, USA: ACM, 2017, pp. 27--40. {Online}. Available Google ScholarDigital Library
- M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, "Destiny: A tool for modeling emerging 3d nvm and edram caches," in Design, Automation Test in Europe Conference Exhibition (DATE), 2015, March 2015, pp. 1543--1546. Google ScholarDigital Library
- B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 267--278. Google ScholarDigital Library
- Synopsys, "Design Compiler," http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages.Google Scholar
- C. S. Wallace, "A suggestion for a fast multiplier," IEEE Trans. Electronic Computers, vol. 13, no. 1, pp. 14--17, 1964. {Online}. AvailableGoogle ScholarCross Ref
- P. Warden, "Low-precision matrix multiplication," https://petewarden.com, 2016.Google Scholar
- H. H. Yao and E. E. Swartzlander, "Serial-parallel multipliers," in Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, Nov. 1993, pp. 359--363 vol.1.Google Scholar
Index Terms
- Bit-pragmatic deep neural network computing
Recommendations
Computing discrete transforms on the Cell Broadband Engine
Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet ...
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has ...
Multifold Acceleration of Neural Network Computations Using GPU
ICANN '09: Proceedings of the 19th International Conference on Artificial Neural Networks: Part IWith emergence of graphics processing units (GPU) of the latest generation, it became possible to undertake neural network based computations using GPU on serially produced video display adapters. In this study, NVIDIA CUDA technology has been used to ...
Comments