skip to main content
10.1145/3123939.3123982acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Bit-pragmatic deep neural network computing

Published:14 October 2017Publication History

ABSTRACT

Deep Neural Networks expose a high degree of parallelism, making them amenable to highly data parallel architectures. However, data-parallel architectures often accept inefficiency in individual computations for the sake of overall efficiency. We show that on average, activation values of convolutional layers during inference in modern Deep Convolutional Neural Networks (CNNs) contain 92% zero bits. Processing these zero bits entails ineffectual computations that could be skipped. We propose Pragmatic (PRA), a massively data-parallel architecture that eliminates most of the ineffectual computations on-the-fly, improving performance and energy efficiency compared to state-of-the-art high-performance accelerators [5]. The idea behind PRA is deceptively simple: use serial-parallel shift-and-add multiplication while skipping the zero bits of the serial input. However, a straightforward implementation based on shift-and-add multiplication yields unacceptable area, power and memory access overheads compared to a conventional bit-parallel design. PRA incorporates a set of design decisions to yield a practical, area and energy efficient design.

Measurements demonstrate that for convolutional layers, PRA is 4.31X faster than DaDianNao [5] (DaDN) using a 16-bit fixed-point representation. While PRA requires 1.68X more area than DaDN, the performance gains yield a 1.70X increase in energy efficiency in a 65nm technology. With 8-bit quantized activations, PRA is 2.25X faster and 1.31X more energy efficient than an 8-bit version of DaDN.

References

  1. "How to Quantize Neural Networks with TensorFlow." {Online}. Available: https://www.tensorflow.org/performance/quantizationGoogle ScholarGoogle Scholar
  2. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in 2016 IEEE/ACM International Conference on Computer Architecture (ISCA), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. H. Alemdar, N. Caldwell, V. Leroy, A. Prost-Boucle, and F. Pétrot, "Ternary neural networks for resource-efficient AI applications," CoRR, vol. abs/1609.00222, 2016. {Online}. Available: http://arxiv.org/abs/1609.00222Google ScholarGoogle Scholar
  4. A. D. Booth, "A signed binary multiplication technique," The Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236--240, 1951.Google ScholarGoogle ScholarCross RefCross Ref
  5. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, Dec 2014, pp. 609--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers, 2016, pp. 262--263.Google ScholarGoogle Scholar
  7. M. Courbariaux, Y. Bengio, and J.-P. David, "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," ArXiv e-prints, Nov. 2015.Google ScholarGoogle Scholar
  8. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA '11. New York, NY, USA: ACM, 2011, pp. 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," CoRR, vol. abs/1311.2524, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Gonzalez and M. Horowitz, "Energy dissipation in general purpose microprocessors," Solid-State Circuits, IEEE Journal of, vol. 31, no. 9, pp. 1277--1284, Sep 1996.Google ScholarGoogle ScholarCross RefCross Ref
  11. Google, "Low-precision matrix multiplication," https://github.com/google/gemmlowp, 2016.Google ScholarGoogle Scholar
  12. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network," arXiv:1602.01528 {cs}, Feb. 2016, arXiv: 1602.01528. {Online}. Available: http://arxiv.org/abs/1602.01528 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Han, H. Mao, and W. J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding," arXiv:1510.00149 {cs}, Oct. 2015, arXiv: 1510.00149. {Online}. Available: http://arxiv.org/abs/1510.00149Google ScholarGoogle Scholar
  14. A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, "Deep speech: Scaling up end-to-end speech recognition," CoRR, vol. abs/1412.5567, 2014.Google ScholarGoogle Scholar
  15. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size," CoRR, vol. abs/1602.07360, 2016. {Online}. Available: http://arxiv.org/abs/1602.07360Google ScholarGoogle Scholar
  16. P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. Enright Jerger, and A. Moshovos, "Proteus: Exploiting numerical precision variability in deep neural networks," in Workshop On Approximate Computing (WAPCO), 2016.Google ScholarGoogle Scholar
  17. P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun, and A. Moshovos, "Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets, arXiv:1511.05236v4 {cs.LG}," arXiv.org, 2015.Google ScholarGoogle Scholar
  18. P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, "Stripes: Bit-serial Deep Neural Network Computing," in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-49, 2016.Google ScholarGoogle Scholar
  19. P. Judd, J. Albericio, and A. Moshovos, "Stripes: Bit-serial Deep Neural Network Computing," Computer Architecture Letters, 2016.Google ScholarGoogle Scholar
  20. J. Kim, K. Hwang, and W. Sung, "X1000 real-time phoneme recognition VLSI using feed-forward deep neural networks," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 7510--7514.Google ScholarGoogle Scholar
  21. A. J. Martin, M. Nyström, and P. I. Pénzes, "Et2: A metric for time and energy efficiency of computation," in Power aware computing. Springer, 2002, pp. 293--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Muralimanohar and R. Balasubramonian, "Cacti 6.0: A tool to understand large caches."Google ScholarGoogle Scholar
  23. V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA '17. New York, NY, USA: ACM, 2017, pp. 27--40. {Online}. Available Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, "Destiny: A tool for modeling emerging 3d nvm and edram caches," in Design, Automation Test in Europe Conference Exhibition (DATE), 2015, March 2015, pp. 1543--1546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 267--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Synopsys, "Design Compiler," http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages.Google ScholarGoogle Scholar
  28. C. S. Wallace, "A suggestion for a fast multiplier," IEEE Trans. Electronic Computers, vol. 13, no. 1, pp. 14--17, 1964. {Online}. AvailableGoogle ScholarGoogle ScholarCross RefCross Ref
  29. P. Warden, "Low-precision matrix multiplication," https://petewarden.com, 2016.Google ScholarGoogle Scholar
  30. H. H. Yao and E. E. Swartzlander, "Serial-parallel multipliers," in Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, Nov. 1993, pp. 359--363 vol.1.Google ScholarGoogle Scholar

Index Terms

  1. Bit-pragmatic deep neural network computing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
          October 2017
          850 pages
          ISBN:9781450349529
          DOI:10.1145/3123939

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 October 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate484of2,242submissions,22%

          Upcoming Conference

          MICRO '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader