skip to main content
10.1145/1815961.1816021acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Published:19 June 2010Publication History

ABSTRACT

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

References

  1. CUDA BLAS Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUBLAS_Library_2.1.pdf, 2008.Google ScholarGoogle Scholar
  2. CUDA CUFFT Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUFFT_Library_2.1.pdf, 2008.Google ScholarGoogle Scholar
  3. General-purpose computation on graphics hardware. http://gpgpu.org/, 2009.Google ScholarGoogle Scholar
  4. D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving predictable performance through better memory controller placement in many-core cmps. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. R. Alameldeen. Using compression to improve chip multiprocessor performance. PhD thesis, Madison, WI, USA, 2006. Adviser-Wood, David A. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.Google ScholarGoogle Scholar
  7. D. H. Bailey. A high-performance fft algorithm for vector supercomputers-abstract. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, page 114, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games and derivatives pricing. http://quantcatalyst.com/download.php? file=DerivativesPricing.pdf.Google ScholarGoogle Scholar
  10. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong. Multi-execution: multicore caching for data-similar executions. SIGARCH Comput. Archit. News, 37(3):164--173, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 469--479, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V.W. Lee, A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  14. Y.-K. Chen, J. Chhugani, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. Lin, A. D. Nguyen, E. Sifakis, and M. Smelyanskiy. High-performance physical simulations on next-generation architecture with many cores. Intel Technology Journal, 11, 2007.Google ScholarGoogle Scholar
  15. J. Chhugani, A. D. Nguyen, V. W. Lee,W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, special issue on "Signal Processing on Platforms with Multiple Cores", 26(6):90--102, 2009.Google ScholarGoogle Scholar
  18. M. Frigo, Steven, and G. Johnson. The design and implementation of fftw3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  19. L. Genovese. Graphic processing units: A possible answer to HPC. In 4th ABINIT Developer Workshop, 2009.Google ScholarGoogle Scholar
  20. N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325--336, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Intel Advanced Vector Extensions Programming Reference.Google ScholarGoogle Scholar
  24. Intel. SSE4 Programming Reference. 2007.Google ScholarGoogle Scholar
  25. C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 185--196, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. R. Johnson, R.W. Johnson, D. Rodriquez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits Syst. Signal Process., 9(4):449--500, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In ACM SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 162--173, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic vector operations on chip multiprocessors. In ISCA '08: Proceedings of the 35th International Symposium on Computer Architecture, pages 441--452, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.Google ScholarGoogle Scholar
  31. P. Lyman and H. R. Varian. How much information. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.Google ScholarGoogle Scholar
  32. NVIDIA. NVIDIA CUDA Zone. http://www.nvidia.com/object/cuda_home.html, 2009.Google ScholarGoogle Scholar
  33. Owens, D. John, Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn, E. Aaron, Purcell, and J. Timothy. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, March 2007.Google ScholarGoogle ScholarCross RefCross Ref
  34. V. Podlozhnyuk and M. Harris. Monte Carlo Option Pricing. http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/MonteCarlo/doc/MonteCarlo.pdf.Google ScholarGoogle Scholar
  35. M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  36. R. Ramanathan. Extending the world.s most popular processor architecture. Intel Whitepaper.Google ScholarGoogle Scholar
  37. K. K. Rangan, G.-Y.Wei, and D. Brooks. Thread motion: fine-grained power management for multi-core systems. SIGARCH Comput. Archit. News, 37(3):302--313, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Sathe and A. Lake. Rigid body collision detection on the gpu. In SIGGRAPH '06: ACM SIGGRAPH 2006 Research posters, page 49, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort. In ACM SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309--318, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V.W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.Google ScholarGoogle Scholar
  45. J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for 3D CFD. In International Journal of Computational Fluid Dynamics, volume 22, pages 443--456, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. N. Univ. of Illinois. Technical reference: Base operating system and extensions, volume 2, 2009.Google ScholarGoogle Scholar
  47. F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.Google ScholarGoogle Scholar
  48. V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.Google ScholarGoogle Scholar
  49. V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. W. Xu and K. Mueller. A performance-driven study of regularization methods for gpu-accelerated iterative ct. In Workshop on High Performance Image Reconstruction (HPIR), 2009.Google ScholarGoogle Scholar
  53. Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA. In International Conference on Computer Science and Software Engineering, volume 3, pages 198--201, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
          June 2010
          520 pages
          ISBN:9781450300537
          DOI:10.1145/1815961
          • cover image ACM SIGARCH Computer Architecture News
            ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
            ISCA '10
            June 2010
            508 pages
            ISSN:0163-5964
            DOI:10.1145/1816038
            Issue’s Table of Contents

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 June 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate543of3,203submissions,17%

          Upcoming Conference

          ISCA '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader