ABSTRACT
Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.
- CUDA BLAS Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUBLAS_Library_2.1.pdf, 2008.Google Scholar
- CUDA CUFFT Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUFFT_Library_2.1.pdf, 2008.Google Scholar
- General-purpose computation on graphics hardware. http://gpgpu.org/, 2009.Google Scholar
- D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving predictable performance through better memory controller placement in many-core cmps. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, 2009. Google ScholarDigital Library
- A. R. Alameldeen. Using compression to improve chip multiprocessor performance. PhD thesis, Madison, WI, USA, 2006. Adviser-Wood, David A. Google ScholarDigital Library
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.Google Scholar
- D. H. Bailey. A high-performance fft algorithm for vector supercomputers-abstract. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, page 114, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009. Google ScholarDigital Library
- C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games and derivatives pricing. http://quantcatalyst.com/download.php? file=DerivativesPricing.pdf.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong. Multi-execution: multicore caching for data-similar executions. SIGARCH Comput. Archit. News, 37(3):164--173, 2009. Google ScholarDigital Library
- B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 469--479, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V.W. Lee, A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.Google ScholarCross Ref
- Y.-K. Chen, J. Chhugani, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. Lin, A. D. Nguyen, E. Sifakis, and M. Smelyanskiy. High-performance physical simulations on next-generation architecture with many cores. Intel Technology Journal, 11, 2007.Google Scholar
- J. Chhugani, A. D. Nguyen, V. W. Lee,W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
- K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, special issue on "Signal Processing on Platforms with Multiple Cores", 26(6):90--102, 2009.Google Scholar
- M. Frigo, Steven, and G. Johnson. The design and implementation of fftw3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.Google ScholarCross Ref
- L. Genovese. Graphic processing units: A possible answer to HPC. In 4th ABINIT Developer Workshop, 2009.Google Scholar
- N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325--336, NY, USA, 2006. ACM. Google ScholarDigital Library
- N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009. Google ScholarDigital Library
- Intel Advanced Vector Extensions Programming Reference.Google Scholar
- Intel. SSE4 Programming Reference. 2007.Google Scholar
- C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 185--196, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- J. R. Johnson, R.W. Johnson, D. Rodriquez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits Syst. Signal Process., 9(4):449--500, 1990. Google ScholarDigital Library
- C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In ACM SIGMOD, 2010. Google ScholarDigital Library
- S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 162--173, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic vector operations on chip multiprocessors. In ISCA '08: Proceedings of the 35th International Symposium on Computer Architecture, pages 441--452, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.Google Scholar
- P. Lyman and H. R. Varian. How much information. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.Google Scholar
- NVIDIA. NVIDIA CUDA Zone. http://www.nvidia.com/object/cuda_home.html, 2009.Google Scholar
- Owens, D. John, Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn, E. Aaron, Purcell, and J. Timothy. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, March 2007.Google ScholarCross Ref
- V. Podlozhnyuk and M. Harris. Monte Carlo Option Pricing. http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/MonteCarlo/doc/MonteCarlo.pdf.Google Scholar
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.Google ScholarCross Ref
- R. Ramanathan. Extending the world.s most popular processor architecture. Intel Whitepaper.Google Scholar
- K. K. Rangan, G.-Y.Wei, and D. Brooks. Thread motion: fine-grained power management for multi-core systems. SIGARCH Comput. Archit. News, 37(3):302--313, 2009. Google ScholarDigital Library
- R. Sathe and A. Lake. Rigid body collision detection on the gpu. In SIGGRAPH '06: ACM SIGGRAPH 2006 Research posters, page 49, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009. Google ScholarDigital Library
- N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort. In ACM SIGMOD, 2010. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008. Google ScholarDigital Library
- M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309--318, June 2008. Google ScholarDigital Library
- M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V.W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009. Google ScholarDigital Library
- The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.Google Scholar
- J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for 3D CFD. In International Journal of Computational Fluid Dynamics, volume 22, pages 443--456, 2008. Google ScholarDigital Library
- N. Univ. of Illinois. Technical reference: Base operating system and extensions, volume 2, 2009.Google Scholar
- F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.Google Scholar
- V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.Google Scholar
- V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009. Google ScholarDigital Library
- W. Xu and K. Mueller. A performance-driven study of regularization methods for gpu-accelerated iterative ct. In Workshop on High Performance Image Reconstruction (HPIR), 2009.Google Scholar
- Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA. In International Conference on Computer Science and Software Engineering, volume 3, pages 198--201, 2008. Google ScholarDigital Library
Index Terms
- Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Recommendations
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
ISCA '10Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of ...
Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingIn the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures
The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version ...
Comments