ABSTRACT
Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability.
This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.
- M. Courbariaux, Y. Bengio, J-P. David "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," NIPS 2015.Google Scholar
- M. Courbariaux, I. Hubara, et al., "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1," arXiv:1602.02830 [cs.LG].Google Scholar
- M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," arXiv:1603.05279 [cs.CV]Google Scholar
- F. Li, B. Liu. "Ternary Weight Networks," arXiv:1605.04711 [cs.CV]Google Scholar
- G. Venkatesh, E. Nurvitadhi, D. Marr, ".Accelerating Deep Convolutional Networks Using Low-Precision and Sparsity," ICASSP, 2017.Google Scholar
- S. Han, H. Mao, W. J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding," ICLR 2016.Google Scholar
- P. Gysel, et al., "Hardware-Oriented Approximation of Convolutional Neural Networks," ICLR Workshop 2016.Google Scholar
- J. Albericio, P. Judd, T. Hetherington, et al, "Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing," ISCA 2016.Google ScholarDigital Library
- S. Han, X. Liu, et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," ISCA 2016.Google ScholarDigital Library
- N. Suda, V. Chandra, et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks," ISFPGA 2016. Google ScholarDigital Library
- J. Qiu, et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," ISFPGA 2016. Google ScholarDigital Library
- P.K. Gupta, "Accelerating Datacenter Workloads," Keynote at FPL 2016. Slides available at www.fpl2016.org.Google Scholar
- A. Putnam, A. M. Caulfield, et al., "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," ISCA 2014. Google ScholarCross Ref
- S. Y. Kung, "VLSI Array Processors," Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1987.Google Scholar
- A. Pedram, et al., "A High-Performance, Low-Power Linear Algebra Core," ASAP 2011. Google ScholarDigital Library
- Altera Arria 10 Website. https://www.altera.com/products/fpga/arria-series/arria-10/overview.htmlGoogle Scholar
- Altera Stratix 10 Website. https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.htmlGoogle Scholar
- Nvidia Titan X Website. http://www.geforce.com/hardware/10series/titan-x-pascalGoogle Scholar
- Altera's PowerPlay Early Power Estimators (EPE) and Power Analyzer, https://www.altera.com/support/support-resources/operation-and-testing/power/pow-powerplay.htmlGoogle Scholar
- S. Gross, M. Wilber, "Training and investigating Residual Nets," http://torch.ch/blog/2016/02/04/resnets.htmlGoogle Scholar
- J. C. Johnson, "cnn-benchmarks", available at https://github.com/jcjohnson/cnn-benchmarksGoogle Scholar
- G. Baeckler, "HyperPipelining of High-Speed Interface Logic," ISFPGA Tutorial, 2016. Google ScholarDigital Library
- A. Lavin, S. Gray, "Fast Algorithms for Convolutional Neural Networks," arXiv:1509.09308 [cs.NE].Google Scholar
- P. D'Alberto, P. A. Milder, et al., "Generating FPGA Accelerated DFT Libraries," FCCM 2007. Google ScholarDigital Library
- W. Chen, J. Wilson, et al., "Compressing Neural Networks with the Hashing Trick," ICML 2015.Google Scholar
- Y. Chen, T. Luo, S. Liu, et al., "Dadiannao: A machine-learning supercomputer," Int. Symposium on Microarchitecture (MICRO), 2014. Google ScholarDigital Library
- S. Kestur, et al., "BLAS Comparison on FPGA, CPU and GPU," IEEE Annual Sym. on VLSI (ISVLSI), 2010Google Scholar
- E. Nurvitadhi, J. Sim, D. Sheffield, et al, "Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC," FPL 2016.Google Scholar
- MAGMA: Matrix Algebra on GPU and Multicore Architectures. Website: http://icl.cs.utk.edu/magma/Google Scholar
- E. Nurvitadhi, D. Sheffield, J. Sim, et al, "Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC," FPT 2016.Google Scholar
Index Terms
- Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
Recommendations
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run ...
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureMany architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysThe notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two ...
Comments