research-article

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Authors:
Nadathur Satish

Parallel Computing Lab, Intel Corporation

Parallel Computing Lab, Intel Corporation
View Profile

,
Changkyu Kim

Parallel Computing Lab, Intel Corporation

Parallel Computing Lab, Intel Corporation
View Profile

,
Jatin Chhugani

Parallel Computing Lab, Intel Corporation

Parallel Computing Lab, Intel Corporation
View Profile

,
Hideki Saito

Intel Compiler Lab, Intel Corporation

Intel Compiler Lab, Intel Corporation
View Profile

,
Rakesh Krishnaiyer

Intel Compiler Lab, Intel Corporation

Intel Compiler Lab, Intel Corporation
View Profile

,
Mikhail Smelyanskiy

Parallel Computing Lab, Intel Corporation

Parallel Computing Lab, Intel Corporation
View Profile

,
Milind Girkar

Intel Compiler Lab, Intel Corporation

Intel Compiler Lab, Intel Corporation
View Profile

,
Pradeep Dubey

Parallel Computing Lab, Intel Corporation

Parallel Computing Lab, Intel Corporation
View Profile

Authors Info & Claims

ISCA '12: Proceedings of the 39th Annual International Symposium on Computer ArchitectureJune 2012Pages 440–451

Published:09 June 2012Publication History

ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Pages 440–451

ABSTRACT

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming many-core architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads.

We first quantify the extent of the "Ninja gap", which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.

References

S. J. Aarseth. Gravitational N-body Simulations Tools and Algorithms. 2003.Google ScholarCross Ref
N. Arora, A. Shringarpure, and R. W. Vuduc. Direct N-body Kernels for Multicore Platforms. In ICPP, pages 379--387, 2009. Google ScholarDigital Library
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.Google Scholar
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT, pages 72--81, 2008. Google ScholarDigital Library
F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637--654, 1973.Google ScholarCross Ref
G. E. Blelloch, P. B. Gibbons, and H. V. Simhadri. Low depth cache-oblivious algorithms. In SPAA, pages 189--199, 2010. Google ScholarDigital Library
R. Bordawekar, U. Bondhugula, and R. Rao. Can CPUs Match GPUs on Performance with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU. IBM Research Report, RC25033, August 2010.Google Scholar
A. Brace, D. Gatarek, and M. Musiela. The Market Model of Interest Rate Dynamics. Mathematical Finance, 7(2):127--155, 1997.Google Scholar
B. Casper. Reinventing DRAM with the Hybrid Memory Cube. blogs.intel.com/research/2011/09/hmc.php, 2010. Research@Intel.Google Scholar
R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald. Parallel Programming in OpenMP, 2010. Google ScholarDigital Library
Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, et al. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.Google ScholarCross Ref
J. Chhugani, A. D. Nguyen, et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
W. J. Dally. The End of Denial Architecture and the Rise of Throughput Computing. Keynote speech at Desgin Automation Conference, 2010.Google Scholar
K. Datta. Auto-tuning Stencil Codes for Cache-Based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley, Dec 2009. Google ScholarDigital Library
R. A. Drebin, L. C. Carpenter, and P. Hanrahan. Volume rendering. In SIGGRAPH, pages 65--74, 1988. Google ScholarDigital Library
P. Dubey. A Platform 2015 Workload Model: Recognition, Miniming and Synthesis Moves Computers to the Era of Tera. Intel, 2005.Google Scholar
A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for simd architectures with alignment constraints. In PLDI, pages 82--93, 2004. Google ScholarDigital Library
F. Feinbube, P. Troger, and A. Polze. Joint Forces: From Multithreaded Programming to GPU Computing. IEEE Softw., 28:51--57, January 2011. Google ScholarDigital Library
M. B. Giles. Monte carlo evaluation of sensitivities in computational finance. Technical report, Oxford University Computing Laboratory, 2007.Google Scholar
A. G. Gray and A. W. Moore. 'N-Body' Problems in Statistical Learning. In NIPS, pages 521--527, 2000.Google ScholarDigital Library
M. Houston, J.-Y. Park, M. Ren, T. Knight, K. Fatahalian, A. Aiken, W. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In PPoPP, pages 143--152, 2008. Google ScholarDigital Library
Intel. A quick, easy and reliable way to improve threaded performance. http://software.intel.com/en-us/articles/intel-cilk-plus/, 2010.Google Scholar
Intel. Intel Advanced Vector Extensions Programming Reference. White paper, June 2011.Google Scholar
Intel. Optimization Notice. http://software.intel.com/en-us/articles/optimization-notice/, 2012.Google Scholar
L. Ismail and D. Guerchi. Performance Evaluation of Convolution on the Cell Broadband Engine Processor. IEEE PDS, 22(2):337--351, 2011. Google ScholarDigital Library
M. Kachelrieb, M. Knaup, and O. Bockenbach. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science, pages 1679--1683, 2006.Google Scholar
M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, et al. Cache topology aware computation mapping for multicores. In PLDI, 2010. Google ScholarDigital Library
C. Kim, J. Chhugani, N. Satish, et al. FAST: Fast Architecture Sensitive Tree search on modern CPUs and GPUs. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
C. Kim, N. Satish, J. Chhugani, et al. Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Technical report, Intel Labs, 2011.Google Scholar
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. ISCA, pages 451--460, 2010. Google ScholarDigital Library
C.-K. Luk, R. Newton, et al. A synergetic approach to throughput computing on x86-based multicore desktops. IEEE Software, 28:39--50, 2011. Google ScholarDigital Library
T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer, 34(4):52--58, 2001. Google ScholarDigital Library
A. Nguyen, N. Satish, et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs. In SC10, pages 1--13, 2010. Google ScholarDigital Library
D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, pages 281--294, 2006. Google ScholarDigital Library
D. Nuzman and A. Zaks. Outer-loop vectorization: revisited for short simd architectures. In PACT, pages 2--11, 2008. Google ScholarDigital Library
Nvidia. CUDA C Best Practices Guide 3.2, 2010.Google Scholar
Oracle. Oracle TimesTen In-Memory Database Technical FAQ, 2007.Google Scholar
V. Podlozhnyuk. Black-Scholes option pricing. Nvidia, 2007.Google Scholar
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, pages 73--82, 2008. Google ScholarDigital Library
N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, pages 351--362, 2010. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarDigital Library
K. B. Skaugen. HPC Technology-Scale-Up and Scale-Out. lecture2go.uni-hamburg.de/konferenzen/-/k/10940. ISC10 Keynote.Google Scholar
M. Smelyanskiy, D. Holmes, et al. Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009. Google ScholarDigital Library
M. C. Sukop and D. T. Thorne, Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers. 2006. Google ScholarCross Ref
S. Williams, J. Carter, L. Oliker, J. Shalf, and K. A. Yelick. Optimization of a lattice boltzmann computation on state-of-the-art multicore platforms. J. Parallel Distrib. Comput., 69(9):762--777, 2009. Google ScholarDigital Library
D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth. In HPCA, pages 1--12, 2010.Google ScholarCross Ref
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, pages 86--97, 2010. Google ScholarDigital Library
J. Zhou and B. Demsky. Bamboo: a data-centric, object-oriented approach to many-core software. In PLDI, pages 388--399, 2010. Google ScholarDigital Library

Can traditional programming bridge the Ninja performance gap for parallel computing applications?
1. Software and its engineering
  1. Software notations and tools

Recommendations

Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Read More
Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Read More
Fast parallel genetic programming: multi-core CPU versus many-core GPU

Genetic Programming (GP) is a computationally intensive technique which is also highly parallel in nature. In recent years, significant performance improvements have been achieved over a standard GP CPU-based approach by harnessing the parallel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
June 2012
584 pages
ISBN:9781450316422
General Chair:
Shih-Lien Lu
Intel
,
Program Chair:
Josep Torrellas
University of Illinois
ACM SIGARCH Computer Architecture News Volume 40, Issue 3
ISCA '12
June 2012
559 pages
ISSN:0163-5964
DOI:10.1145/2366231
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
IEEE Computer Society
United States
Publication History
- Published: 9 June 2012
Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '12 Paper Acceptance Rate47of262submissions,18%Overall Acceptance Rate543of3,203submissions,17%
More
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 45
  Total Citations
  View Citations
- 1,415
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Recommendations

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Modeling and predicting performance of high performance computing applications on hardware accelerators

Fast parallel genetic programming: multi-core CPU versus many-core GPU