skip to main content
10.1109/ISCA.2018.00012acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

A configurable cloud-scale DNN processor for real-time AI

Published:02 June 2018Publication History

ABSTRACT

Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models---aka "realtime AI". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates.

This paper describes the NPU architecture for Project Brainwave, a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed microarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.

References

  1. E. Chung et al., "Accelerating persistent neural networks at datacenter scale," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  2. E. Chung et al., "Serving DNNs in Real Time at Datacenter Scale with Project Brainwave," in IEEE MICRO: Hot Chips, April 2018.Google ScholarGoogle Scholar
  3. N. Toon and S. Knowles, "Graphcore," https://www.graphcore.ai, 2017.Google ScholarGoogle Scholar
  4. Y. Chen et al., "DaDianNao: A Machine-Learning Supercomputer," in Proc. 47th Annu. Int. Symp. on Microarchitecture (MICRO), 2014, pp. 609--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Putnam et al., "A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services," in Proc. 41st Annu. Int. Symp. on Computer Architecture (ISCA), 2014, pp. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Abadi et al., "TensorFlow: A system for large-scale machine learning," in 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI), 2016, pp. 265--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Comput., vol. 9, no. 8, pp. 1735--1780, Nov. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. He et al., "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. L. Lawson et al., "Basic Linear Algebra Subprograms for Fortran Usage," ACM Trans. Math. Softw., vol. 5, no. 3, pp. 308--323, Sep. 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. P. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit," in Proc. 44th Annu. Int. Symp. on Computer Architecture (ISCA), 2017, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Gupta et al., "Deep Learning with Limited Numerical Precision," in Proc. 32nd Int. Conf. on Machine Learning - Volume 37, 2015, pp. 1737--1746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), 2016, pp. 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Courbariaux and Y. Bengio, "BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1," CoRR, vol. abs/1602.02830, 2016.Google ScholarGoogle Scholar
  14. U. Köster et al., "Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks," in NIPS, 2017.Google ScholarGoogle Scholar
  15. J. H. Wilkinson, Rounding Errors in Algebraic Processes, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 1963.Google ScholarGoogle Scholar
  16. S. Narang and G. Diamos, "Baidu DeepBench," https://github.com/baidu-research/DeepBench, 2017.Google ScholarGoogle Scholar
  17. A. Y. Hannun et al., "Deep Speech: Scaling up end-to-end speech recognition," CoRR, vol. abs/1412.5567, 2014.Google ScholarGoogle Scholar
  18. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Proc. 25th International Conference on Neural Information Processing Systems - Volume 1, 2012, pp. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Jia et al., "Caffe: Convolutional Architecture for Fast Feature Embedding," in Proc. 22nd ACM International Conference on Multimedia, 2014, pp. 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Chen et al., "DianNao Family: Energy-efficient Hardware Accelerators for Machine Learning," Commun. ACM, vol. 59, no. 11, pp. 105--112, Oct. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Whatmough, "DNN ENGINE: A 16nm sub-uj deep neural network inference accelerator for the embedded masses," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  22. C. Farabet et al., "Neuflow: A runtime-reconfigurable dataflow processor for vision," in Proc. Embedded Computer Vision Workshop (ECVW'11), 2011, (invited paper).Google ScholarGoogle Scholar
  23. B. Moons and M. Verhelst, "A 0.3--2.6 TOPS/W Precision-Scalable Processor for Real-Time Large-Scale ConvNets," 2016.Google ScholarGoogle ScholarCross RefCross Ref
  24. R. LiKamWa et al., "RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), 2016, pp. 255--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. Chi et al., "PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 27--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Chakradhar et al., "A Dynamically Configurable Coprocessor for Convolutional Neural Networks," in Proc. 37th Annu. Int. Symp. on Computer Architecture (ISCA), 2010, pp. 247--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Venkataramani et al., "ScaleDeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks," in Proc. 44th Annu. Int. Symp. on Computer Architecture (ISCA), 2017, pp. 13--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Li et al., "DRISA: A DRAM-based Reconfigurable In-Situ Accelerator," in Proc. 50th Annu. Int. Symp. on Microarchitecture (MICRO), 2017, pp. 288--301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Reagen et al., "Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 267--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Peemen et al., "Memory-centric accelerator design for Convolutional Neural Networks," in Proc. 31st IEEE Int. Conf. on Computer Design (ICCD), Oct 2013, pp. 13--19.Google ScholarGoogle Scholar
  31. S. W. Park et al., "An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications," IEEE Trans on Biomed Circuits Syst, vol. 9, no. 6, pp. 838--848, Dec 2015.Google ScholarGoogle Scholar
  32. V. Gokhale et al., "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," in 2014 IEEE Conf. on Computer Vision and Pattern Recognition Workshops, June 2014, pp. 696--701. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Chen et al., "DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning," in Proc. 19th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014, pp. 269--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Z. Du et al., "ShiDianNao: Shifting vision processing closer to the sensor," in Proc. 42nd Annu. Int. Symp. on Computer Architecture (ISCA), June 2015, pp. 92--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Liu et al., "PuDianNao: A Polyvalent Machine Learning Accelerator," in Proc. 20th Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). New York, NY, USA: ACM, 2015, pp. 369--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. H. Chen, J. Emer, and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 367--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. Judd et al., "Stripes: Bit-serial deep neural network computing," in Proc. 49th Annu. Int. Symp. on Microarchitecture (MICRO), Oct 2016, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Albericio et al., "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Albericio et al., "Bit-pragmatic Deep Neural Network Computing," in Proc. 50th Annu. Int. Symp. on Microarchitecture (MICRO), 2017, pp. 382--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Han et al., "ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. Han, H. Mao, and W. J. Dally, "Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding," in International Conference on Learning Representations, 2016.Google ScholarGoogle Scholar
  42. K. Guo et al., "From model to FPGA: Software-hardware co-design for efficient neural network acceleration," in 2016 IEEE Hot Chips 28 Symposium, Aug 2016, pp. 1--27.Google ScholarGoogle Scholar
  43. C. Young, "Evaluation of the Tensor Processing Unit: A Deep Neural Network Accelerator for the Datacenter," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  44. J. Dean, "Recent Advances in Artificial Intelligence via Machine Learning and the Implications for Computer System Design," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  45. A. Parashar et al., "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks," SIGARCH Comput. Archit. News, vol. 45, no. 2, pp. 27--40, Jun. 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Nicol, "A dataflow processing chip for training deep neural networks," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  47. D. Moloney, "Embedded deep neural networks: The cost of everything and the value of nothing," in 2016 IEEE Hot Chips 28 Symposium, Aug 2016, pp. 1--20.Google ScholarGoogle Scholar
  48. V. Sze et al., "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295--2329, Dec 2017.Google ScholarGoogle ScholarCross RefCross Ref
  49. S. Liu et al., "Cambricon: An Instruction Set Architecture for Neural Networks," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 393--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Rastegari et al., "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," in In Proceedings of European Conference on Computer Vision, 2016.Google ScholarGoogle Scholar
  51. B. Moons et al., "Energy-efficient ConvNets through approximate computing," in IEEE Winter Conf. on Appl. of Computer Vision (WACV), 2016, pp. 1--8.Google ScholarGoogle Scholar
  52. M. Courbariaux, Y. Bengio, and J.-P. David, "Binaryconnect: Training deep neural networks with binary weights during propagations," in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS'15. Cambridge, MA, USA: MIT Press, 2015, pp. 3123--3131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. P. Judd et al., "Reduced-precision strategies for bounded memory in deep neural nets," CoRR, vol. abs/1511.05236, 2015.Google ScholarGoogle Scholar
  54. P. Gysel, M. Motamedi, and S. Ghiasi, "Hardware-oriented approximation of convolutional neural networks," CoRR, vol. abs/1604.03168, 2016.Google ScholarGoogle Scholar
  55. P. Colangelo et al., "Fine-grained acceleration of binary neural networks using Intel Xeon processor with integrated FPGA," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 135--135.Google ScholarGoogle Scholar
  56. J. Cong and B. Xiao, Minimizing Computation in Convolutional Neural Networks. Cham: Springer International Publishing, 2014, pp. 281--290.Google ScholarGoogle Scholar
  57. J. Yu et al., "Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism," in Proc. 44th Annu. Int. Symp. on Computer Architecture (ISCA), 2017, pp. 548--560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. T.-J. Yang, Y.-H. Chen, and V. Sze, "Designing Energy-Efficient Con-volutional Neural Networks Using Energy-Aware Pruning," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071--6079, 2017.Google ScholarGoogle Scholar
  59. Y. Kim et al., "Compression of deep convolutional neural networks for fast and low power mobile applications," CoRR, vol. abs/1511.06530, 2015.Google ScholarGoogle Scholar
  60. F. N. Iandola et al., "SqueezeNet: Alexnet-level accuracy with 50x fewer parameters and <1MB model size," CoRR, vol. abs/1602.07360, 2016.Google ScholarGoogle Scholar
  61. S. Han et al., "Learning Both Weights and Connections for Efficient Neural Networks," in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS'15. Cambridge, MA, USA: MIT Press, 2015, pp. 1135--1143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. G. Diamos et al., "Persistent RNNs: Stashing Recurrent Weights On-Chip," in Proc. 33rd Int. Conf. on Machine Learning, 2016, pp. 2024--2033. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN Accelerator Efficiency Through Resource Partitioning," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA '17. New York, NY, USA: ACM, 2017, pp. 535--547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. J. Ouyang et al., "SDA: Software-defined accelerator for large-scale dnn systems," in 2014 IEEE Hot Chips 26 Symposium, Aug 2014, pp. 1--23.Google ScholarGoogle Scholar
  65. J. Ouyang et al., "SDA: Software-defined accelerator for general-purpose big data analysis system," in 2016 IEEE Hot Chips 28 Symposium, Aug 2016, pp. 1--23.Google ScholarGoogle Scholar
  66. J. Ouyang, "XPU: A programmable FPGA accelerator for diverse workloads," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  67. D. Shin and H.-J. Yoo, "DNPU: An energy-efficient deep neural network processor with on-chip stereo matching," in 2017 IEEE Hot Chips 29 Symposium, Aug 2017.Google ScholarGoogle Scholar
  68. A. Rahman, J. Lee, and K. Choi, "Efficient FPGA Acceleration of Convolutional Neural Networks Using Logical-3D Compute Array," in Conf. on Design, Automation & Test in Europe (DATE), 2016, pp. 1393--1398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. A. Podili, C. Zhang, and V. Prasanna, "Fast and efficient implementation of convolutional neural networks on FPGA," in Proc. 28th IEEE Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP), July 2017, pp. 11--18.Google ScholarGoogle Scholar
  70. S. Li et al., "An FPGA design framework for CNN sparsification and acceleration," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 28--28.Google ScholarGoogle Scholar
  71. L. Lu et al., "Evaluating fast algorithms for convolutional neural networks on FPGAs," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 101--108.Google ScholarGoogle Scholar
  72. Y. Shen, M. Ferdman, and P. Milder, "Escher: A CNN accelerator with flexible buffering to minimize off-chip transfer," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 93--100.Google ScholarGoogle ScholarCross RefCross Ref
  73. M. Samragh, M. Ghasemzadeh, and F. Koushanfar, "Customizing neural networks for efficient FPGA implementation," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 85--92.Google ScholarGoogle Scholar
  74. E. Kousanakis et al., "An architecture for the acceleration of a hybrid leaky integrate and fire SNN on the convey HC-2ex FPGA-based processor," in Proc. 25th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), April 2017, pp. 56--63.Google ScholarGoogle ScholarCross RefCross Ref
  75. S. Yin et al., "Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only)," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 295--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. S. I. Venieris and C. S. Bouganis, "fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs," in Proc. 24th IEEE Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), May 2016, pp. 40--47.Google ScholarGoogle ScholarCross RefCross Ref
  77. Y. Li et al., "A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA '17. New York, NY, USA: ACM, 2017, pp. 290--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. H. Nakahara et al., "A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only)," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 290--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Y. Umuroglu et al., "FINN: A Framework for Fast, Scalable Binarized Neural Network Inference," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. U. Aydonat et al., "An OpenCL<sup>™</sup>Deep Learning Accelerator on Arria 10," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA'17. New York, NY, USA: ACM, 2017, pp. 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Y. Ma et al., "Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. C. Zhang and V. Prasanna, "Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. J. Zhang and J. Li, "Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. R. Zhao et al., "Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. E. Nurvitadhi et al., "Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?" in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. J. Qiu et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2016, pp. 26--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. N. Suda et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2016, pp. 16--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. C. Zhang et al., "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2015, pp. 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. W. Qadeer et al., "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing," in Proc. 40th Annu. Int. Symp. on Computer Architecture (ISCA), 2013, pp. 24--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. K. Ovtcharov et al., "Toward accelerating deep learning at scale using specialized hardware in the datacenter," in 2015 IEEE Hot Chips 27 Symposium, Aug 2015, pp. 1--38.Google ScholarGoogle Scholar
  91. C. Farabet et al., "CNP: An FPGA-based processor for Convolutional Networks," in 2009 International Conference on Field Programmable Logic and Applications, Aug 2009, pp. 32--37.Google ScholarGoogle Scholar
  92. C. Farabet, C. Poulet, and Y. LeCun, "An FPGA-based stream processor for embedded real-time vision with Convolutional Networks," in 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, Sept 2009, pp. 878--885.Google ScholarGoogle Scholar
  93. C. Ding et al., "CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-circulant Weight Matrices," in Proc. 50th Annu. Int. Symp. on Microarchitecture (MICRO), 2017, pp. 395--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. M. Alwani et al., "Fused-layer CNN accelerators," in Proc. 49th Annu. Int. Symp. on Microarchitecture (MICRO), Oct 2016, pp. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. D. Lewis et al., "The Stratix<sup>™</sup>10 Highly Pipelined FPGA Architecture," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2016, pp. 159--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. D. Kim et al., "Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 380--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. A. Shafiee et al., "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars," in Proc. 43rd Annu. Int. Symp. on Computer Architecture (ISCA), June 2016, pp. 14--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. S. B. Eryilmaz et al., "Neuromorphic architectures with electronic synapses," in Proc. 17th Int. Symp. on Quality Electronic Design (ISQED), March 2016, pp. 118--123.Google ScholarGoogle Scholar
  99. S. K. Esser et al., "Convolutional networks for fast, energy-efficient neuromorphic computing," CoRR, vol. abs/1603.08270, 2016.Google ScholarGoogle Scholar
  100. A. Ling and J. Anderson, "The Role of FPGAs in Deep Learning," in Proc. ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 3--3. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A configurable cloud-scale DNN processor for real-time AI
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader