ABSTRACT
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads.
We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
- S. J. Aarseth. Gravitational N-body Simulations Tools and Algorithms. 2003.Google ScholarCross Ref
- N. Arora, A. Shringarpure, and R. W. Vuduc. Direct N-body Kernels for Multicore Platforms. In ICPP, pages 379--387, 2009. Google ScholarDigital Library
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, pages 72--81, 2008. Google ScholarDigital Library
- F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637--654, 1973.Google ScholarCross Ref
- G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In SPAA, pages 189--199, 2010. Google ScholarDigital Library
- R. Bordawekar, U. Bondhugula, and R. Rao. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU. IBM Research Report, RC25033, August 2010.Google Scholar
- A. Brace, D. Gatarek, and M. Musiela. The Market Model of Interest Rate Dynamics. Mathematical Finance, 7(2):127--155, 1997.Google Scholar
- B. Casper. Reinventing DRAM with the Hybrid Memory Cube. blogs.intel.com/research/2011/09/hmc.php, 2010. Research@Intel.Google Scholar
- R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald. Parallel Programming in OpenMP, 2010. Google ScholarDigital Library
- Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, et al. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.Google ScholarCross Ref
- J. Chhugani, A. D. Nguyen, et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
- W. J. Dally. The End of Denial Architecture and the Rise of Throughput Computing. Keynote speech at Desgin Automation Conference, 2010.Google Scholar
- K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley, Dec 2009. Google ScholarDigital Library
- R. A. Drebin, L. C. Carpenter, and P. Hanrahan. Volume rendering. In SIGGRAPH, pages 65--74, 1988. Google ScholarDigital Library
- P. Dubey. A Platform 2015 Workload Model: Recognition, Miniming and Synthesis Moves Computers to the Era of Tera. Intel, 2005.Google Scholar
- A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In PLDI, pages 82--93, 2004. Google ScholarDigital Library
- F. Feinbube, P. Troger, and A. Polze. Joint Forces: From Multithreaded Programming to GPU Computing. IEEE Softw., 28:51--57, January 2011. Google ScholarDigital Library
- M. B. Giles. Monte carlo evaluation of sensitivities in computational finance. Technical report, Oxford University Computing Laboratory, 2007.Google Scholar
- A. G. Gray and A. W. Moore. 'N-Body' Problems in Statistical Learning. In NIPS, pages 521--527, 2000.Google ScholarDigital Library
- M. Houston, J.-Y. Park, M. Ren, T. Knight, K. Fatahalian, A. Aiken, W. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In PPoPP, pages 143--152, 2008. Google ScholarDigital Library
- Intel. A quick, easy and reliable way to improve threaded performance. http://software.intel.com/en-us/articles/intel-cilk-plus/, 2010.Google Scholar
- Intel. Intel Advanced Vector Extensions Programming Reference. White paper, June 2011.Google Scholar
- Intel. Optimization Notice. http://software.intel.com/en-us/articles/optimization-notice/, 2012.Google Scholar
- L. Ismail and D. Guerchi. Performance Evaluation of Convolution on the Cell Broadband Engine Processor. IEEE PDS, 22(2):337--351, 2011. Google ScholarDigital Library
- M. Kachelrieb, M. Knaup, and O. Bockenbach. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science, pages 1679--1683, 2006.Google Scholar
- M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, et al. Cache topology aware computation mapping for multicores. In PLDI, 2010. Google ScholarDigital Library
- C. Kim, J. Chhugani, N. Satish, et al. FAST: Fast Architecture Sensitive Tree search on modern CPUs and GPUs. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
- C. Kim, N. Satish, J. Chhugani, et al. Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Technical report, Intel Labs, 2011.Google Scholar
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. ISCA, pages 451--460, 2010. Google ScholarDigital Library
- C.-K. Luk, R. Newton, et al. A synergetic approach to throughput computing on x86-based multicore desktops. IEEE Software, 28:39--50, 2011. Google ScholarDigital Library
- T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer, 34(4):52--58, 2001. Google ScholarDigital Library
- A. Nguyen, N. Satish, et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC10, pages 1--13, 2010. Google ScholarDigital Library
- D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, pages 281--294, 2006. Google ScholarDigital Library
- D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT, pages 2--11, 2008. Google ScholarDigital Library
- Nvidia. CUDA C Best Practices Guide 3.2, 2010.Google Scholar
- Oracle. Oracle TimesTen In-Memory Database Technical FAQ, 2007.Google Scholar
- V. Podlozhnyuk. Black-Scholes option pricing. Nvidia, 2007.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73--82, 2008. Google ScholarDigital Library
- N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarDigital Library
- K. B. Skaugen. HPC Technology-Scale-Up and Scale-Out. lecture2go.uni-hamburg.de/konferenzen/-/k/10940. ISC10 Keynote.Google Scholar
- M. Smelyanskiy, D. Holmes, et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009. Google ScholarDigital Library
- M. C. Sukop and D. T. Thorne, Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers. 2006. Google ScholarCross Ref
- S. Williams, J. Carter, L. Oliker, J. Shalf, and K. A. Yelick. Optimization of a lattice boltzmann computation on state-of-the-art multicore platforms. J. Parallel Distrib. Comput., 69(9):762--777, 2009. Google ScholarDigital Library
- D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth. In HPCA, pages 1--12, 2010.Google ScholarCross Ref
- Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, pages 86--97, 2010. Google ScholarDigital Library
- J. Zhou and B. Demsky. Bamboo: a data-centric, object-oriented approach to many-core software. In PLDI, pages 388--399, 2010. Google ScholarDigital Library
- Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Recommendations
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Modeling and predicting performance of high performance computing applications on hardware accelerators
Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Fast parallel genetic programming: multi-core CPU versus many-core GPU
Genetic Programming (GP) is a computationally intensive technique which is also highly parallel in nature. In recent years, significant performance improvements have been achieved over a standard GP CPU-based approach by harnessing the parallel ...
Comments