research-article

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

Authors:
Eriko Nurvitadhi

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Ganesh Venkatesh

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Jaewoong Sim

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Debbie Marr

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Randy Huang

Intel, San Jose, USA

Intel, San Jose, USA
View Profile

,
Jason Ong Gee Hock

Intel, Penang, Malaysia

Intel, Penang, Malaysia
View Profile

,
Yeong Tat Liew

Intel, Penang, Malaysia

Intel, Penang, Malaysia
View Profile

,
Krishnan Srivatsan

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Duncan Moss

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Suchit Subhaschandra

Intel, Hillsboro, OR, USA

Intel, Hillsboro, OR, USA
View Profile

,
Guy Boudoukh

Intel, Haifa, Israel

Intel, Haifa, Israel
View Profile

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2017Pages 5–14https://doi.org/10.1145/3020078.3021740

Published:22 February 2017Publication History

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 5–14

ABSTRACT

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for accelerating DNNs. Current FPGAs offer superior energy efficiency (Ops/Watt), but they do not offer the performance of today's GPUs on DNNs. In this paper, we look at upcoming FPGA technology advances, the rapid pace of innovation in DNN algorithms, and consider whether future high-performance FPGAs will outperform GPUs for next-generation DNNs. The upcoming Intel® 14-nm Stratix? 10 FPGAs will have thousands of hard floating-point units (DSPs) and on-chip RAMs (M20K memory blocks). They will also have high bandwidth memories (HBMs) and improved frequency (HyperFlex? core architecture). This combination of features brings FPGA raw floating point performance within striking distance of GPUs. Meanwhile, DNNs are quickly evolving. For example, recent innovations that exploit sparsity (e.g., pruning) and compact data types (e.g., 1-2 bit) result in major leaps in algorithmic efficiency. However, these innovations introduce irregular parallelism on custom data types, which are difficult for GPUs to handle but would be a great fit for FPGA's extreme customizability.

This paper evaluates a selection of emerging DNN algorithms on two generations of Intel FPGAs (Arria'10, Stratix'10) against the latest highest performance Titan X Pascal GPU. We created a customizable DNN accelerator template for FPGAs and used it in our evaluations. First, we study various GEMM operations for next-generation DNNs. Our results show that Stratix 10 FPGA is 10%, 50%, and 5.4x better in performance (TOP/sec) than Titan X Pascal GPU on GEMM operations for pruned, Int6, and binarized DNNs, respectively. Then, we present a detailed case study on accelerating Ternary ResNet which relies on sparse GEMM on 2-bit weights (i.e., weights constrained to 0,+1,-1) and full-precision neurons. The Ternary ResNet accuracy is within ~1% of the full-precision ResNet which won the 2015 ImageNet competition. On Ternary-ResNet, the Stratix 10 FPGA can deliver 60% better performance over Titan X Pascal GPU, while being 2.3x better in performance/watt. Our results indicate that FPGAs may become the platform of choice for accelerating next-generation DNNs.

References

M. Courbariaux, Y. Bengio, J-P. David "BinaryConnect: Training Deep Neural Networks with binary weights during propagations," NIPS 2015.Google Scholar
M. Courbariaux, I. Hubara, et al., "Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1," arXiv:1602.02830 [cs.LG].Google Scholar
M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," arXiv:1603.05279 [cs.CV]Google Scholar
F. Li, B. Liu. "Ternary Weight Networks," arXiv:1605.04711 [cs.CV]Google Scholar
G. Venkatesh, E. Nurvitadhi, D. Marr, ".Accelerating Deep Convolutional Networks Using Low-Precision and Sparsity," ICASSP, 2017.Google Scholar
S. Han, H. Mao, W. J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding," ICLR 2016.Google Scholar
P. Gysel, et al., "Hardware-Oriented Approximation of Convolutional Neural Networks," ICLR Workshop 2016.Google Scholar
J. Albericio, P. Judd, T. Hetherington, et al, "Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing," ISCA 2016.Google ScholarDigital Library
S. Han, X. Liu, et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," ISCA 2016.Google ScholarDigital Library
N. Suda, V. Chandra, et al., "Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks," ISFPGA 2016. Google ScholarDigital Library
J. Qiu, et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," ISFPGA 2016. Google ScholarDigital Library
P.K. Gupta, "Accelerating Datacenter Workloads," Keynote at FPL 2016. Slides available at www.fpl2016.org.Google Scholar
A. Putnam, A. M. Caulfield, et al., "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," ISCA 2014. Google ScholarCross Ref
S. Y. Kung, "VLSI Array Processors," Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1987.Google Scholar
A. Pedram, et al., "A High-Performance, Low-Power Linear Algebra Core," ASAP 2011. Google ScholarDigital Library
Altera Arria 10 Website. https://www.altera.com/products/fpga/arria-series/arria-10/overview.htmlGoogle Scholar
Altera Stratix 10 Website. https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.htmlGoogle Scholar
Nvidia Titan X Website. http://www.geforce.com/hardware/10series/titan-x-pascalGoogle Scholar
Altera's PowerPlay Early Power Estimators (EPE) and Power Analyzer, https://www.altera.com/support/support-resources/operation-and-testing/power/pow-powerplay.htmlGoogle Scholar
S. Gross, M. Wilber, "Training and investigating Residual Nets," http://torch.ch/blog/2016/02/04/resnets.htmlGoogle Scholar
J. C. Johnson, "cnn-benchmarks", available at https://github.com/jcjohnson/cnn-benchmarksGoogle Scholar
G. Baeckler, "HyperPipelining of High-Speed Interface Logic," ISFPGA Tutorial, 2016. Google ScholarDigital Library
A. Lavin, S. Gray, "Fast Algorithms for Convolutional Neural Networks," arXiv:1509.09308 [cs.NE].Google Scholar
P. D'Alberto, P. A. Milder, et al., "Generating FPGA Accelerated DFT Libraries," FCCM 2007. Google ScholarDigital Library
W. Chen, J. Wilson, et al., "Compressing Neural Networks with the Hashing Trick," ICML 2015.Google Scholar
Y. Chen, T. Luo, S. Liu, et al., "Dadiannao: A machine-learning supercomputer," Int. Symposium on Microarchitecture (MICRO), 2014. Google ScholarDigital Library
S. Kestur, et al., "BLAS Comparison on FPGA, CPU and GPU," IEEE Annual Sym. on VLSI (ISVLSI), 2010Google Scholar
E. Nurvitadhi, J. Sim, D. Sheffield, et al, "Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC," FPL 2016.Google Scholar
MAGMA: Matrix Algebra on GPU and Multicore Architectures. Website: http://icl.cs.utk.edu/magma/Google Scholar
E. Nurvitadhi, D. Sheffield, J. Sim, et al, "Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC," FPT 2016.Google Scholar

Index Terms

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs

Recommendations

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run ...
Read More
In-Datacenter Performance Analysis of a Tensor Processing Unit
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates ...
Read More
Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2017
312 pages
ISBN:9781450343541
DOI:10.1145/3020078
General Chair:
Jonathan Greene
Microsemi, USA
,
Program Chair:
Jason H. Anderson
University of Toronto, Canada
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
GPU
accelerator
deep learning
intel stratix 10
Qualifiers
- research-article
Conference

Acceptance Rates
FPGA '17 Paper Acceptance Rate25of101submissions,25%Overall Acceptance Rate125of627submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 289
  Total Citations
  View Citations
- 6,122
  Total Downloads
- Downloads (Last 12 months)275
- Downloads (Last 6 weeks)38
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

In-Datacenter Performance Analysis of a Tensor Processing Unit

Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

In-Datacenter Performance Analysis of a Tensor Processing Unit

Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media