Abstract
Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer(EXO), an architecture to represent heterogeneous acceleratorsas ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with generalpurpose CPU cores, and (2) C for Heterogeneous Integration(CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages. The CHI compiler extends the OpenMP pragma for heterogeneous multithreading programming, and produces a single fat binary with code sections corresponding to different instruction sets. The runtime can judiciously spread parallel computation across the heterogeneous cores to optimize performance and power.
We have prototyped the EXO architecture on a physical heterogeneous platform consisting of an Intel® Core™ 2 Duo processor and an 8-core 32-thread Intel® Graphics Media Accelerator X3000. In addition, we have implemented the CHI integrated programming environment with the Intel® C++ Compiler, runtime toolset, and debugger. On the EXO prototype system, we have enhanced a suite of production-quality media kernels for video and image processing to utilize the accelerator through the CHI programming interface, achieving significant speedup (1.41X to10.97X) over execution on the IA32 CPU alone.
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics, 23(3):777--786, 2004. Google ScholarDigital Library
- CPU+GPU integration. http://www.google.com/search?hl=en&lr=&rls=GGLG%2CGGLG%2005--47%2CGGLG3Aen&q=intel+amd+nvidia+ati+cpu+gpu+integrated+&btnG=Search.Google Scholar
- CUDA. http://developer.nvidia.com/object/cuda.html.Google Scholar
- P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.Google Scholar
- A. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th international Conference on Parallel Architectures and Compilation Techniques, 2005. Google ScholarDigital Library
- GLSL OpenGL Shading Language. www.wikipedia.org/wiki/GLSL.Google Scholar
- R. Gonzalez. A Software-configurable Processor Architecture. IEEE Micro, pages 42--51, Sept-Oct 2006. Google ScholarDigital Library
- N. Govindaraju, S. Larsen, J. Gray, and D.Manocha. AMemory Model for Scientific Algorithms on Graphics Processor. In IEEE Supercomputing, 2006. Google ScholarDigital Library
- GPGPU: General Purpose Computation using Graphics Hardware. www.gpgpu.org.Google Scholar
- E. Grochowski and M. Annavaram. Energy per Instruction Trends in Intel Microprocessors. Technology@Intel Magazine, March 2006.Google Scholar
- R. Hankins, G. Chinya, J. Collins, P. Wang, R. Rakvic, H. Wang, and J. Shen. Multiple Instruction Stream Processor. In Proceedings of the 33rd International Symposium on Computer Architecture, June 2006. Google ScholarDigital Library
- Intel G965 Express Chipset. http://www.intel.com/products/chipsets/g965/prod brief.pdf.Google Scholar
- Intel Santa Rosa Platform. http://www.intel.com/pressroom/archive/releases/20060307corp b.htm.Google Scholar
- Tera-scale Research Prototype: Connecting 80 Simple Sores on a Single Test Chip. ftp://download.intel.com/research/platform/terascale/tera-scaleresearchprototypebackgrounder.pdf.Google Scholar
- Intels Next Generation Integrated Graphics Architecture Intel Graphics Media Accelerator X3000 and 3000. Intel Corporation, 2006.Google Scholar
- U. Kapasi, S. Rixner, W. Dally, B. Khailany, J. Ahn, P. Mattson, and J. Owens. Programmable Stream Processors. IEEE Computer, 2003. Google ScholarDigital Library
- R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi, and K. Farkas. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proceedings of the 31st International Symposium on Computer Architecture, June 2004. Google ScholarDigital Library
- F. Labonte, P. Mattson, W. Thies, I. Buck, C. Kozyrakis, and M. Horowitz. The Stream Virtual Machine. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004. Google ScholarDigital Library
- W. Mark, R. Glanville, K. Akeley, and M. Kilgard. Cg: A System for Programming Graphics Hardware in a C-like Language. ACM Transactions on Graphics, (3):896--907, 2003. Google ScholarDigital Library
- M. McCool and S. Toit. Metaprogramming GPUs with Sh. A K Peters, 2004. Google ScholarDigital Library
- M. McCool, K. Wadleigh, B. Henderson, and H. Y. Lin. Performance Evaluation of GPUs using the RapidMind Development Platform. In Proceedings of the 20th International Conference on Supercomputing, 2006. Google ScholarDigital Library
- J. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. Lefohn, and T. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. In Eurographics, August 2005.Google Scholar
- The PeakStream Platform: High Productivity Software Development for Multi-core Processors. PeakStream Inc, 2006.Google Scholar
- M. Segal and M. Peercy. A Performance-Oriented Data Parallel Virtual Machine for GPUs. In SIGGRAPH, 2006. Google ScholarDigital Library
- S. Shah, G. Haab, P. Petersen, and J. Throop. Flexible control structures for parallelism in OpenMP. In First European Workshop on OpenMP, September 1999.Google Scholar
- E. Su, X. Tian ,M. Girkar, G. Haab, S. Shah, and P. Petersen. Compiler Support of the Workqueuing Execution Model for Intel SMP Architectures. In Proceedings of the 4th European Workshop on OpenMP, 2002.Google Scholar
- D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2006. Google ScholarDigital Library
- W. Thies,M. Karczmarek, and S. Amarasinghe. StreamIt: A Language for Streaming Applications. In Computational Complexity, 2002. Google ScholarDigital Library
- X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su. Intel OpenMP C++/Fortran Compiler for Hyper--Threading Technology: Implementation and Performance. Intel Technology Journal, Q1 2002.Google Scholar
- X. Tian, M. Girkar, S. Shah, D. Armstrong, E. Su, and P. Petersen. Compiler and Runtime Support for Running OpenMP Programs on Pentium and Itanium Architectures. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, April 2003. Google ScholarDigital Library
- O. Wechsler. Inside Intel Core Microarchitecture: Setting New Standards for Energy-efficient Performance. Technology@Intel Magazine, 2006.Google Scholar
- D. Zhang, Z. Li, H. Song, and L. Liu. A Programming Model for an Embedded Media Processing Architecture. In Embedded Computer Systems: Architecture, Modeling, and Simulation, 2005. Google ScholarDigital Library
Index Terms
- EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Recommendations
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
PLDI '07: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and ImplementationFuture mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multicore platform,...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures
The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version ...
Comments