skip to main content
10.5555/2337159.2337210acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Published:09 June 2012Publication History

ABSTRACT

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads.

We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

References

  1. S. J. Aarseth. Gravitational N-body Simulations Tools and Algorithms. 2003.Google ScholarGoogle ScholarCross RefCross Ref
  2. N. Arora, A. Shringarpure, and R. W. Vuduc. Direct N-body Kernels for Multicore Platforms. In ICPP, pages 379--387, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.Google ScholarGoogle Scholar
  4. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, pages 72--81, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637--654, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  6. G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In SPAA, pages 189--199, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Bordawekar, U. Bondhugula, and R. Rao. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU. IBM Research Report, RC25033, August 2010.Google ScholarGoogle Scholar
  8. A. Brace, D. Gatarek, and M. Musiela. The Market Model of Interest Rate Dynamics. Mathematical Finance, 7(2):127--155, 1997.Google ScholarGoogle Scholar
  9. B. Casper. Reinventing DRAM with the Hybrid Memory Cube. blogs.intel.com/research/2011/09/hmc.php, 2010. Research@Intel.Google ScholarGoogle Scholar
  10. R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald. Parallel Programming in OpenMP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, et al. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Chhugani, A. D. Nguyen, et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. J. Dally. The End of Denial Architecture and the Rise of Throughput Computing. Keynote speech at Desgin Automation Conference, 2010.Google ScholarGoogle Scholar
  14. K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley, Dec 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. A. Drebin, L. C. Carpenter, and P. Hanrahan. Volume rendering. In SIGGRAPH, pages 65--74, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Dubey. A Platform 2015 Workload Model: Recognition, Miniming and Synthesis Moves Computers to the Era of Tera. Intel, 2005.Google ScholarGoogle Scholar
  17. A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In PLDI, pages 82--93, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Feinbube, P. Troger, and A. Polze. Joint Forces: From Multithreaded Programming to GPU Computing. IEEE Softw., 28:51--57, January 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. B. Giles. Monte carlo evaluation of sensitivities in computational finance. Technical report, Oxford University Computing Laboratory, 2007.Google ScholarGoogle Scholar
  20. A. G. Gray and A. W. Moore. 'N-Body' Problems in Statistical Learning. In NIPS, pages 521--527, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Houston, J.-Y. Park, M. Ren, T. Knight, K. Fatahalian, A. Aiken, W. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In PPoPP, pages 143--152, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Intel. A quick, easy and reliable way to improve threaded performance. http://software.intel.com/en-us/articles/intel-cilk-plus/, 2010.Google ScholarGoogle Scholar
  23. Intel. Intel Advanced Vector Extensions Programming Reference. White paper, June 2011.Google ScholarGoogle Scholar
  24. Intel. Optimization Notice. http://software.intel.com/en-us/articles/optimization-notice/, 2012.Google ScholarGoogle Scholar
  25. L. Ismail and D. Guerchi. Performance Evaluation of Convolution on the Cell Broadband Engine Processor. IEEE PDS, 22(2):337--351, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Kachelrieb, M. Knaup, and O. Bockenbach. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science, pages 1679--1683, 2006.Google ScholarGoogle Scholar
  27. M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, et al. Cache topology aware computation mapping for multicores. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Kim, J. Chhugani, N. Satish, et al. FAST: Fast Architecture Sensitive Tree search on modern CPUs and GPUs. In SIGMOD, pages 339--350, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Kim, N. Satish, J. Chhugani, et al. Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Technical report, Intel Labs, 2011.Google ScholarGoogle Scholar
  30. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. ISCA, pages 451--460, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C.-K. Luk, R. Newton, et al. A synergetic approach to throughput computing on x86-based multicore desktops. IEEE Software, 28:39--50, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer, 34(4):52--58, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Nguyen, N. Satish, et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC10, pages 1--13, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, pages 281--294, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT, pages 2--11, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Nvidia. CUDA C Best Practices Guide 3.2, 2010.Google ScholarGoogle Scholar
  37. Oracle. Oracle TimesTen In-Memory Database Technical FAQ, 2007.Google ScholarGoogle Scholar
  38. V. Podlozhnyuk. Black-Scholes option pricing. Nvidia, 2007.Google ScholarGoogle Scholar
  39. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73--82, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. B. Skaugen. HPC Technology-Scale-Up and Scale-Out. lecture2go.uni-hamburg.de/konferenzen/-/k/10940. ISC10 Keynote.Google ScholarGoogle Scholar
  43. M. Smelyanskiy, D. Holmes, et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. C. Sukop and D. T. Thorne, Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers. 2006. Google ScholarGoogle ScholarCross RefCross Ref
  45. S. Williams, J. Carter, L. Oliker, J. Shalf, and K. A. Yelick. Optimization of a lattice boltzmann computation on state-of-the-art multicore platforms. J. Parallel Distrib. Comput., 69(9):762--777, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth. In HPCA, pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  47. Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, pages 86--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Zhou and B. Demsky. Bamboo: a data-centric, object-oriented approach to many-core software. In PLDI, pages 388--399, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Can traditional programming bridge the Ninja performance gap for parallel computing applications?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
      June 2012
      584 pages
      ISBN:9781450316422
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 40, Issue 3
        ISCA '12
        June 2012
        559 pages
        ISSN:0163-5964
        DOI:10.1145/2366231
        Issue’s Table of Contents

      Publisher

      IEEE Computer Society

      United States

      Publication History

      • Published: 9 June 2012

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ISCA '12 Paper Acceptance Rate47of262submissions,18%Overall Acceptance Rate543of3,203submissions,17%

      Upcoming Conference

      ISCA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader