research-article

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Authors:
Victor W. Lee

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Changkyu Kim

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Jatin Chhugani

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Michael Deisher

Intel Corporation, Hillsboro, OR, USA

Intel Corporation, Hillsboro, OR, USA
View Profile

,
Daehyun Kim

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Anthony D. Nguyen

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Nadathur Satish

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Mikhail Smelyanskiy

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Srinivas Chennupaty

Intel Corporation, Hillsboro, OR, USA

Intel Corporation, Hillsboro, OR, USA
View Profile

,
Per Hammarlund

Intel Corporation, Hillsboro, OR, USA

Intel Corporation, Hillsboro, OR, USA
View Profile

,
Ronak Singhal

Intel Corporation, Hillsboro, OR, USA

Intel Corporation, Hillsboro, OR, USA
View Profile

,
Pradeep Dubey

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureJune 2010Pages 451–460https://doi.org/10.1145/1815961.1816021

Published:19 June 2010Publication History

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Pages 451–460

ABSTRACT

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

References

CUDA BLAS Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUBLAS_Library_2.1.pdf, 2008.Google Scholar
CUDA CUFFT Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUFFT_Library_2.1.pdf, 2008.Google Scholar
General-purpose computation on graphics hardware. http://gpgpu.org/, 2009.Google Scholar
D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving predictable performance through better memory controller placement in many-core cmps. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, 2009. Google ScholarDigital Library
A. R. Alameldeen. Using compression to improve chip multiprocessor performance. PhD thesis, Madison, WI, USA, 2006. Adviser-Wood, David A. Google ScholarDigital Library
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.Google Scholar
D. H. Bailey. A high-performance fft algorithm for vector supercomputers-abstract. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, page 114, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009. Google ScholarDigital Library
C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games and derivatives pricing. http://quantcatalyst.com/download.php? file=DerivativesPricing.pdf.Google Scholar
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong. Multi-execution: multicore caching for data-similar executions. SIGARCH Comput. Archit. News, 37(3):164--173, 2009. Google ScholarDigital Library
B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 469--479, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V.W. Lee, A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.Google ScholarCross Ref
Y.-K. Chen, J. Chhugani, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. Lin, A. D. Nguyen, E. Sifakis, and M. Smelyanskiy. High-performance physical simulations on next-generation architecture with many cores. Intel Technology Journal, 11, 2007.Google Scholar
J. Chhugani, A. D. Nguyen, V. W. Lee,W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, special issue on "Signal Processing on Platforms with Multiple Cores", 26(6):90--102, 2009.Google Scholar
M. Frigo, Steven, and G. Johnson. The design and implementation of fftw3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.Google ScholarCross Ref
L. Genovese. Graphic processing units: A possible answer to HPC. In 4th ABINIT Developer Workshop, 2009.Google Scholar
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325--336, NY, USA, 2006. ACM. Google ScholarDigital Library
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009. Google ScholarDigital Library
Intel Advanced Vector Extensions Programming Reference.Google Scholar
Intel. SSE4 Programming Reference. 2007.Google Scholar
C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 185--196, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
J. R. Johnson, R.W. Johnson, D. Rodriquez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits Syst. Signal Process., 9(4):449--500, 1990. Google ScholarDigital Library
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In ACM SIGMOD, 2010. Google ScholarDigital Library
S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 162--173, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic vector operations on chip multiprocessors. In ISCA '08: Proceedings of the 35th International Symposium on Computer Architecture, pages 441--452, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.Google Scholar
P. Lyman and H. R. Varian. How much information. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.Google Scholar
NVIDIA. NVIDIA CUDA Zone. http://www.nvidia.com/object/cuda_home.html, 2009.Google Scholar
Owens, D. John, Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn, E. Aaron, Purcell, and J. Timothy. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, March 2007.Google ScholarCross Ref
V. Podlozhnyuk and M. Harris. Monte Carlo Option Pricing. http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/MonteCarlo/doc/MonteCarlo.pdf.Google Scholar
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.Google ScholarCross Ref
R. Ramanathan. Extending the world.s most popular processor architecture. Intel Whitepaper.Google Scholar
K. K. Rangan, G.-Y.Wei, and D. Brooks. Thread motion: fine-grained power management for multi-core systems. SIGARCH Comput. Archit. News, 37(3):302--313, 2009. Google ScholarDigital Library
R. Sathe and A. Lake. Rigid body collision detection on the gpu. In SIGGRAPH '06: ACM SIGGRAPH 2006 Research posters, page 49, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009. Google ScholarDigital Library
N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort. In ACM SIGMOD, 2010. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008. Google ScholarDigital Library
M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309--318, June 2008. Google ScholarDigital Library
M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V.W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009. Google ScholarDigital Library
The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.Google Scholar
J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for 3D CFD. In International Journal of Computational Fluid Dynamics, volume 22, pages 443--456, 2008. Google ScholarDigital Library
N. Univ. of Illinois. Technical reference: Base operating system and extensions, volume 2, 2009.Google Scholar
F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.Google Scholar
V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.Google Scholar
V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google ScholarDigital Library
W. Xu and K. Mueller. A performance-driven study of regularization methods for gpu-accelerated iterative ct. In Workshop on High Performance Image Reconstruction (HPIR), 2009.Google Scholar
Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA. In International Conference on Computer Science and Software Engineering, volume 3, pages 198--201, 2008. Google ScholarDigital Library

Index Terms

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Recommendations

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
ISCA '10

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of ...
Read More
Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
Read More
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
June 2010
520 pages
ISBN:9781450300537
DOI:10.1145/1815961
General Chair:
André Seznec
INRIA Rennes
,
Program Chairs:
Uri Weiser
Technion
,
Ronny Ronen
Intel
ACM SIGARCH Computer Architecture News Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cpu architecture
gpu architecture
performance analysis
performance measurement
software optimization
throughput computing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 535
  Total Citations
  View Citations
- 23,612
  Total Downloads
- Downloads (Last 12 months)589
- Downloads (Last 6 weeks)70
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Many-core GPU computing with NVIDIA CUDA

Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures