skip to main content
10.1109/ISCA.2018.00069acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks

Published:02 June 2018Publication History

Editorial Notes

A Corrected Version of Record for this paper was published in the ACM Digital Library on June 7, 2023, in keeping with an agreement with IEEE, which had consented to the addition of an author after the paper was originally published. For reference purposes, the Version of Record can be accessed via the Supplemental Material section of this page.

ABSTRACT

Hardware acceleration of Deep Neural Networks (DNNs) aims to tame their enormous compute intensity. Fully realizing the potential of acceleration in this domain requires understanding and leveraging algorithmic properties of DNNs. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of Bit Fusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss [1] and Stripes [2]. In the same area, frequency, and process technology, Bit Fusion offers 3.9X speedup and 5.1X energy savings over Eyeriss. Compared to Stripes, Bit Fusion provides 2.6X speedup and 3.9X energy reduction at 45 nm node when Bit Fusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, Bit Fusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while Bit Fusion merely consumes 895 milliwatts of power.

Skip Supplemental Material Section

Supplemental Material

References

  1. Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, "Stripes: Bit-serial deep neural network computing," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," JSSC, 2017.Google ScholarGoogle Scholar
  4. M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, "Tetris: Scalable and efficient neural network acceleration with 3d memory," in ASPLOS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Delmas, S. Sharify, P. Judd, and A. Moshovos, "Tartan: Accelerating fully-connected and convolutional layers in deep learning networks by exploiting numerical precision variability," arXiv, 2017.Google ScholarGoogle Scholar
  7. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., "Dadiannao: A machine-learning supercomputer," in MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, "Pudiannao: A polyvalent machine learning accelerator," in ASPLOS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: shifting vision processing closer to the sensor," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, "Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: ineffectual-neuron-free deep neural network computing," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An instruction set architecture for neural networks," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 g-ops/s mobile coprocessor for deep neural networks," in CVPRW, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim, "14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems," in ISSCC, 2016.Google ScholarGoogle Scholar
  18. F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters," in DATE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, "Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family," in DAC, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, "C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization," in DAC, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, "From high-level deep neural models to fpgas," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerators," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, "Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks," in FPGA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., "Going deeper with embedded fpga platform for convolutional neural network," in FPGA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. L. Song, X. Qian, H. Li, and Y. Chen, "Pipelayer: A pipelined reram-based accelerator for deep learning," in HPCA, 2017.Google ScholarGoogle Scholar
  29. E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman, C. Boehn, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, T. Juhasz, R. K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, S. Reinhardt, A. Sapek, R. Seera, B. Sridharan, L. Woods, P. Yi-Xiao, R. Zhao, and D. Burger, "Accelerating persistent neural networks at datacenter scale," in HotChips, 2017.Google ScholarGoogle Scholar
  30. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., "In-datacenter performance analysis of a tensor processing unit," in ISCA, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. "Apple a11-bionic." https://en.wikipedia.org/wiki/Apple_A11.Google ScholarGoogle Scholar
  32. S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv, 2016.Google ScholarGoogle Scholar
  33. C. Zhu, S. Han, H. Mao, and W. J. Dally, "Trained ternary quantization," arXiv, 2016.Google ScholarGoogle Scholar
  34. F. Li, B. Zhang, and B. Liu, "Ternary weight networks," arXiv, 2016.Google ScholarGoogle Scholar
  35. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations," arXiv, 2016.Google ScholarGoogle Scholar
  36. A. K. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, "WRPN: wide reduced-precision networks," arXiv, 2017.Google ScholarGoogle Scholar
  37. B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, "Dvafs: Trading computational accuracy for energy through dynamic-voltage-accuracy-frequency-scaling," in DATE, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Sharify, A. D. Lascorz, P. Judd, and A. Moshovos, "Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks," arXiv, 2017.Google ScholarGoogle Scholar
  39. J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, "Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision," in ISSCC, 2018.Google ScholarGoogle Scholar
  40. A. Krizhevsky, "One weird trick for parallelizing convolutional neural networks," arXiv, 2014.Google ScholarGoogle Scholar
  41. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, "Reading digits in natural images with unsupervised feature learning," in NIPS workshop on deep learning and unsupervised feature learning, 2011.Google ScholarGoogle Scholar
  42. A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," Computer Science Department, University of Toronto, Tech. Rep, 2009.Google ScholarGoogle Scholar
  43. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  44. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv, 2014.Google ScholarGoogle Scholar
  45. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in CVPR, 2016.Google ScholarGoogle Scholar
  46. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, 1997.Google ScholarGoogle Scholar
  47. M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of english: The penn treebank," Computational linguistics, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques" in ICCAD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. "Nvidia tensor rt 4.0." https://developer.nvidia.com/tensorrt.Google ScholarGoogle Scholar
  50. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. T. Rzayev, S. Moradi, D. H. Albonesi, and R. Manohar, "Deeprecon: Dynamically reconfigurable architecture for accelerating deep neural networks," IJCNN, 2017.Google ScholarGoogle Scholar
  52. B. Moons and M. Verhelst, "A 0.3--2.6 tops/w precision-scalable processor for real-time large-scale convnets," in VLSI-Circuits, 2016.Google ScholarGoogle Scholar
  53. Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, "Finn: A framework for fast, scalable binarized neural network inference," in FPGA, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. R. Andri, L. Cavigelli, D. Rossi, and L. Benini, "Yodann: An ultra-low power convolutional neural network accelerator based on binary weights," arXiv, 2016.Google ScholarGoogle Scholar
  55. K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, T. Kuroda, et al., "Brein memory: A 13-layer 4.2 k neuron/0.8 m synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm cmos," in VLSI, 2017.Google ScholarGoogle Scholar
  56. H. Kim, J. Sim, Y. Choi, and L.-S. Kim, "A kernel decomposition architecture for binary-weight convolutional neural networks," in DAC, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks," in ISCA, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. A. Yazdanbakhsh, H. Falahati, P. J. Wolfe, K. Samadi, H. Esmaeilzadeh, and N. S. Kim, "GANAX: A Unified SIMD-MIMD Acceleration for Generative Adversarial Network," in ISCA, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. K. Gupta, "Snapea: Predictive early activation for reducing computation in deep convolutional neural networks," in ISCA, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerator," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Y. Shen, M. Ferdman, and P. Milder, "Escher: A cnn accelerator with flexible buffering to minimize off-chip transfer," in FCCM, 2017.Google ScholarGoogle Scholar
  62. M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks," arXiv, 2016.Google ScholarGoogle Scholar
  63. E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, "Core fusion: accommodating software diversity in chip multiprocessors," in ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. C. Kim, S. Sethumadhavan, M. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler, "Composable lightweight processors," in MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader