Abstract
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.
This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs.
We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.
- The OpenCL Specification, 2009.Google Scholar
- A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz,J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In ISCA '95, pages2--13, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- S. Ahuja, N. Carriero, and D. Gelernter. Linda and friends. IEEETrans. on Computers, 19(8):26--34, Aug. 1986. Google ScholarDigital Library
- P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas,N. Amato, and L. Rauchwerger. STAPL: An adaptive, generic parallel C++ library. LNCS, pages 193--208, 2003Google ScholarCross Ref
- H. Bal and A. Tanenbaum. Distributed programming with shared data.In ICCL '88, pages 82--91, Oct 1988.Google ScholarCross Ref
- K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin,and J. C. Sancho. Entering the petaflop era: the architecture and performance of roadrunner. In SC'08, pages 1--11, Piscataway, NJ,USA, 2008. IEEE Press. Google ScholarDigital Library
- P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. Cellss: a programming model for the cell be architecture. In SC'06, page 86, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributedshared memory system. In Compcon Spring '93, pages 528--537, Feb 1993.Google ScholarCross Ref
- R. Bisiani and A. Forin. Multilanguage parallel programming ofheterogeneous machines. IEEE Trans. on Computers, 37(8):930--945, Aug 1988. Google ScholarDigital Library
- R. Bisiani and M. Ravishankar. Plus: a distributed shared-memorysystem. SIGARCH Comput. Archit. News, 18(3a):115--124, 1990. Google ScholarDigital Library
- I. Buck. GPU computing with NVIDIA CUDA. In SIGGRAPH '07,page 6, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation andperformance of munin. In SOSP '91, pages 152--164, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
- B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In PLDI '00, pages 57--69, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N.Sharp, and Q. Wu. Parallel programming using skeleton functions. In PARLE'93, pages 146--160, London, UK, 1993. Springer-Verlag. Google ScholarDigital Library
- P. Dasgupta, J. LeBlanc, R.J., M. Ahamad, and U. Ramachandran.The clouds distributed operating system. IEEE Trans. on Computers, 24(11):34--44, Nov 1991. Google ScholarDigital Library
- G. Delp, A. Sethi, and D. Farber. An analysis of memnet--an experiment in high-speed shared-memory local networking. In SIGCOMM'88, pages 165--174, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
- B. Fleisch and G. Popek. Mirage: a coherent distributed sharedmemory design. In SOSP '89, pages 211--223, New York, NY, USA, 1989. ACM. Google ScholarDigital Library
- S. Frank, I. Burkhardt, H., and J. Rothnie. The ksr 1: bridging the gapbetween shared memory and mpps. In Compcon Spring '93, pages 285--294, Feb 1993.Google ScholarCross Ref
- I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, andW. W. Hwu. Cuba: an architecture for efficient cpu/co--processor data communication. In ICS '08, pages 299--308, New York, NY, USA,2008. ACM. Google ScholarDigital Library
- D. B. Gustavson. The scalable coherent interface and related standardsprojects. IEEE Micro, 12(1):10--22, 1992. Google ScholarDigital Library
- S. H. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. IEEE Trans. on VLSI, 12(2):206--217, Feb. 2004. Google ScholarDigital Library
- J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with areconfigurable coprocessor. In FCCM '97, pages 12--21, Apr 1997. Google ScholarDigital Library
- M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh,R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The performance impact of flexibility in the stanford flash multiprocessor. In ASPLOS '94, pages 274--285, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
- W. W. Hwu and J. Stone. A programmers view of the new GPUcomputing capabilities in the Fermi architecture and cuda 3.0. White paper, University of Illinois, 2009.Google Scholar
- IBM Staff. SPE Runtime Management Library, 2007.Google Scholar
- IMPACT Group. Parboil benchmark suite.http://impact.crhc.illinois.edu/parboil.php.Google Scholar
- Intel Staff. Intel 945G Express Chipset Product Brief, 2005.Google Scholar
- Intel Staff. Intel Xeon Processor 7400 Series. Specification Update,2008.Google Scholar
- V. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro.Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589--604, 2005. Google ScholarCross Ref
- P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. Tread-marks: distributed shared memory on standard workstations and operating systems. In WTEC'94, pages 10--10, Berkeley, CA, USA, 1994.USENIX Association. Google ScholarDigital Library
- J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,A. Mahesri, S. S. Lumetta, M. I. Frank, and S. Patel. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In ISCA '09, pages 140--151, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy.The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90, pages 148--159, New York, NY, USA, 1990.ACM. Google ScholarDigital Library
- K. Li and P. Hudak. Memory coherence in shared virtual memorysystems. ACM Trans. Comput. Syst., 7(4):321--359, 1989. Google ScholarDigital Library
- E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidiatesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39-55, March-April 2008. Google ScholarDigital Library
- C. Maples and L. Wittie. Merlin: A superglue for multicomputersystems. In Compcon Spring '90, volume 90, pages 73--81, 1990.Google Scholar
- J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: aportable "shared-memory" programming model for distributed memory computers. In SC'94, pages 340--349, New York, NY, USA, 1994.ACM. Google ScholarDigital Library
- NVIDIA Staff. NVIDIA CUDA Programming Guide 2.2, 2009.Google Scholar
- S. Patel and W. W. Hwu. Accelerator architectures. IEEE Micro,28(4):4--12, July-Aug. 2008. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture forvisual computing. ACM Trans. Graph., 27(3):1--15, 2008. Google ScholarDigital Library
- H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. onComputers, 49(5):465--481, May 2000. Google ScholarDigital Library
- M. Vanneschi. The programming model of assist, an environmentfor parallel and distributed portable applications. Parallel Comput., 28(12):1709--1732, 2002. Google ScholarDigital Library
- S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, andE. M. Panainte. The molen polymorphic processor. IEEE Trans. on Computers, 53(11):1363--1375, 2004. Google ScholarDigital Library
- D. Warren and S. Haridi. Data Diffusion Machine -- a scalable sharedvirtual memory multiprocessor. In Fifth Generation Computer Systems 1988, page 943. Springer-Verlag, 1988.Google Scholar
- J. Wilson, A.W., J. LaRowe, R.P., and M. Teller. Hardware assist fordistributed shared memory. In DCS '03, pages 246--255, May 1993.Google Scholar
- Xilinx Staff. Virtex-5 Family Overview, Feb 2009.Google Scholar
- S. Zhou, M. Stumm, and T. McInerney. Extending distributed shared memory to heterogeneous environments. In DCS '90, pages 30--37, May 1990.Google ScholarCross Ref
Index Terms
- An asymmetric distributed shared memory model for heterogeneous parallel systems
Recommendations
An asymmetric distributed shared memory model for heterogeneous parallel systems
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsHeterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ...
An asymmetric distributed shared memory model for heterogeneous parallel systems
ASPLOS '10Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ...
A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs
Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...
Comments