research-article

Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks

Authors:
Hardik Sharma

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Jongse Park

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Naveen Suda

Arm, Inc.

Arm, Inc.
View Profile

,
Liangzhen Lai

Arm, Inc.

Arm, Inc.
View Profile

,
Benson Chau

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Joon Kyung Kim

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Vikas Chandra

Arm, Inc.

Arm, Inc.
View Profile

,
Hadi Esmaeilzadeh

University of California, San Diego

University of California, San Diego
View Profile

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitectureJune 2018Pages 764–775https://doi.org/10.1109/ISCA.2018.00069

Published:02 June 2018Publication History

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Pages 764–775

Editorial Notes

A Corrected Version of Record for this paper was published in the ACM Digital Library on June 7, 2023, in keeping with an agreement with IEEE, which had consented to the addition of an author after the paper was originally published. For reference purposes, the Version of Record can be accessed via the Supplemental Material section of this page.

ABSTRACT

Hardware acceleration of Deep Neural Networks (DNNs) aims to tame their enormous compute intensity. Fully realizing the potential of acceleration in this domain requires understanding and leveraging algorithmic properties of DNNs. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of Bit Fusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss [1] and Stripes [2]. In the same area, frequency, and process technology, Bit Fusion offers 3.9X speedup and 5.1X energy savings over Eyeriss. Compared to Stripes, Bit Fusion provides 2.6X speedup and 3.9X energy reduction at 45 nm node when Bit Fusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, Bit Fusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while Bit Fusion merely consumes 895 milliwatts of power.

Supplemental Material

Available for Download

pdf

p764-sharma-vor.pdf (857.7 KB)

Version of Record for "Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks" by Sharma et al., Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA '18).

References

Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, 2016. Google ScholarDigital Library
P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, "Stripes: Bit-serial deep neural network computing," in MICRO, 2016. Google ScholarDigital Library
Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," JSSC, 2017.Google Scholar
M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, "Tetris: Scalable and efficient neural network acceleration with 3d memory," in ASPLOS, 2017. Google ScholarDigital Library
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in ISCA, 2016. Google ScholarDigital Library
A. Delmas, S. Sharify, P. Judd, and A. Moshovos, "Tartan: Accelerating fully-connected and convolutional layers in deep learning networks by exploiting numerical precision variability," arXiv, 2017.Google Scholar
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., "Dadiannao: A machine-learning supercomputer," in MICRO, 2014. Google ScholarDigital Library
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, 2014. Google ScholarDigital Library
D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, "Pudiannao: A polyvalent machine learning accelerator," in ASPLOS, 2015. Google ScholarDigital Library
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: shifting vision processing closer to the sensor," in ISCA, 2015. Google ScholarDigital Library
D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, "Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory," in ISCA, 2016. Google ScholarDigital Library
B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in ISCA, 2016. Google ScholarDigital Library
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: ineffectual-neuron-free deep neural network computing," in ISCA, 2016. Google ScholarDigital Library
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An instruction set architecture for neural networks," in ISCA, 2016. Google ScholarDigital Library
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in MICRO, 2016. Google ScholarDigital Library
V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 g-ops/s mobile coprocessor for deep neural networks," in CVPRW, 2014. Google ScholarDigital Library
J. Sim, J. S. Park, M. Kim, D. Bae, Y. Choi, and L. S. Kim, "14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems," in ISSCC, 2016.Google Scholar
F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters," in DATE, 2015. Google ScholarDigital Library
Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, "Deepburning: Automatic generation of fpga-based learning accelerators for the neural network family," in DAC, 2016. Google ScholarDigital Library
L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, "C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization," in DAC, 2016. Google ScholarDigital Library
C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, 2015. Google ScholarDigital Library
H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, "From high-level deep neural models to fpgas," in MICRO, 2016. Google ScholarDigital Library
M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerators," in MICRO, 2016. Google ScholarDigital Library
N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, "Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks," in FPGA, 2016. Google ScholarDigital Library
J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., "Going deeper with embedded fpga platform for convolutional neural network," in FPGA, 2016. Google ScholarDigital Library
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in ISCA, 2016. Google ScholarDigital Library
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory," in ISCA, 2016. Google ScholarDigital Library
L. Song, X. Qian, H. Li, and Y. Chen, "Pipelayer: A pipelined reram-based accelerator for deep learning," in HPCA, 2017.Google Scholar
E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengil, M. Liu, D. Lo, S. Alkalay, M. Haselman, C. Boehn, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, T. Juhasz, R. K. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, S. Reinhardt, A. Sapek, R. Seera, B. Sridharan, L. Woods, P. Yi-Xiao, R. Zhao, and D. Burger, "Accelerating persistent neural networks at datacenter scale," in HotChips, 2017.Google Scholar
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., "In-datacenter performance analysis of a tensor processing unit," in ISCA, 2017. Google ScholarDigital Library
"Apple a11-bionic." https://en.wikipedia.org/wiki/Apple_A11.Google Scholar
S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv, 2016.Google Scholar
C. Zhu, S. Han, H. Mao, and W. J. Dally, "Trained ternary quantization," arXiv, 2016.Google Scholar
F. Li, B. Zhang, and B. Liu, "Ternary weight networks," arXiv, 2016.Google Scholar
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations," arXiv, 2016.Google Scholar
A. K. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, "WRPN: wide reduced-precision networks," arXiv, 2017.Google Scholar
B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, "Dvafs: Trading computational accuracy for energy through dynamic-voltage-accuracy-frequency-scaling," in DATE, 2017. Google ScholarDigital Library
S. Sharify, A. D. Lascorz, P. Judd, and A. Moshovos, "Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks," arXiv, 2017.Google Scholar
J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, "Unpu: A 50.6 tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision," in ISSCC, 2018.Google Scholar
A. Krizhevsky, "One weird trick for parallelizing convolutional neural networks," arXiv, 2014.Google Scholar
Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, "Reading digits in natural images with unsupervised feature learning," in NIPS workshop on deep learning and unsupervised feature learning, 2011.Google Scholar
A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images," Computer Science Department, University of Toronto, Tech. Rep, 2009.Google Scholar
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.Google ScholarCross Ref
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv, 2014.Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in CVPR, 2016.Google Scholar
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, 1997.Google Scholar
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, "Building a large annotated corpus of english: The penn treebank," Computational linguistics, 1993. Google ScholarDigital Library
S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques" in ICCAD, 2011. Google ScholarDigital Library
"Nvidia tensor rt 4.0." https://developer.nvidia.com/tensorrt.Google Scholar
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, "Dark silicon and the end of multicore scaling," in ISCA, 2011. Google ScholarDigital Library
T. Rzayev, S. Moradi, D. H. Albonesi, and R. Manohar, "Deeprecon: Dynamically reconfigurable architecture for accelerating deep neural networks," IJCNN, 2017.Google Scholar
B. Moons and M. Verhelst, "A 0.3--2.6 tops/w precision-scalable processor for real-time large-scale convnets," in VLSI-Circuits, 2016.Google Scholar
Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, "Finn: A framework for fast, scalable binarized neural network inference," in FPGA, 2017. Google ScholarDigital Library
R. Andri, L. Cavigelli, D. Rossi, and L. Benini, "Yodann: An ultra-low power convolutional neural network accelerator based on binary weights," arXiv, 2016.Google Scholar
K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, T. Kuroda, et al., "Brein memory: A 13-layer 4.2 k neuron/0.8 m synapse binary/ternary reconfigurable in-memory deep neural network accelerator in 65 nm cmos," in VLSI, 2017.Google Scholar
H. Kim, J. Sim, Y. Choi, and L.-S. Kim, "A kernel decomposition architecture for binary-weight convolutional neural networks," in DAC, 2017. Google ScholarDigital Library
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks," in ISCA, 2017. Google ScholarDigital Library
A. Yazdanbakhsh, H. Falahati, P. J. Wolfe, K. Samadi, H. Esmaeilzadeh, and N. S. Kim, "GANAX: A Unified SIMD-MIMD Acceleration for Generative Adversarial Network," in ISCA, 2018. Google ScholarDigital Library
V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. K. Gupta, "Snapea: Predictive early activation for reducing computation in deep convolutional neural networks," in ISCA, 2018. Google ScholarDigital Library
M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerator," in MICRO, 2016. Google ScholarDigital Library
Y. Shen, M. Ferdman, and P. Milder, "Escher: A cnn accelerator with flexible buffering to minimize off-chip transfer," in FCCM, 2017.Google Scholar
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks," arXiv, 2016.Google Scholar
E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, "Core fusion: accommodating software diversity in chip multiprocessors," in ISCA, 2007. Google ScholarDigital Library
C. Kim, S. Sethumadhavan, M. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler, "Composable lightweight processors," in MICRO, 2007. Google ScholarDigital Library

Index Terms

Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
  2. Very large scale integration design

Index terms have been assigned to the content through auto-classification.

Recommendations

In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Read More
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA'17

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Read More
GANAX: a unified MIMD-SIMD acceleration for generative adversarial networks
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Generative Adversarial Networks (GANs) are one of the most recent deep learning models that generate synthetic data from limited genuine datasets. GANs are on the frontier as further extension of deep learning into many domains (e.g., medicine, robotics,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture
June 2018
884 pages
ISBN:9781538659847
Sponsors
In-Cooperation
Publisher
IEEE Press
Publication History
- Published: 2 June 2018
Check for updates
Author Tags
CNN
DNN
LSTM
RNN
accelerators
bit brick
bit fusion
bit-level composability
convolutional neural networks
deep neural networks
dynamic composability
long short-term memory
quantization
recurrent neural networks
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 50
  Total Citations
  View Citations
- 718
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Editorial Notes

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

In-Datacenter Performance Analysis of a Tensor Processing Unit

In-Datacenter Performance Analysis of a Tensor Processing Unit

GANAX: a unified MIMD-SIMD acceleration for generative adversarial networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks

ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

Editorial Notes

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

In-Datacenter Performance Analysis of a Tensor Processing Unit

In-Datacenter Performance Analysis of a Tensor Processing Unit

GANAX: a unified MIMD-SIMD acceleration for generative adversarial networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media