skip to main content
research-article

An asymmetric distributed shared memory model for heterogeneous parallel systems

Authors Info & Claims
Published:13 March 2010Publication History
Skip Abstract Section

Abstract

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.

This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs.

We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

References

  1. The OpenCL Specification, 2009.Google ScholarGoogle Scholar
  2. A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz,J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In ISCA '95, pages2--13, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Ahuja, N. Carriero, and D. Gelernter. Linda and friends. IEEETrans. on Computers, 19(8):26--34, Aug. 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas,N. Amato, and L. Rauchwerger. STAPL: An adaptive, generic parallel C++ library. LNCS, pages 193--208, 2003Google ScholarGoogle ScholarCross RefCross Ref
  5. H. Bal and A. Tanenbaum. Distributed programming with shared data.In ICCL '88, pages 82--91, Oct 1988.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin,and J. C. Sancho. Entering the petaflop era: the architecture and performance of roadrunner. In SC'08, pages 1--11, Piscataway, NJ,USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. Cellss: a programming model for the cell be architecture. In SC'06, page 86, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributedshared memory system. In Compcon Spring '93, pages 528--537, Feb 1993.Google ScholarGoogle ScholarCross RefCross Ref
  9. R. Bisiani and A. Forin. Multilanguage parallel programming ofheterogeneous machines. IEEE Trans. on Computers, 37(8):930--945, Aug 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Bisiani and M. Ravishankar. Plus: a distributed shared-memorysystem. SIGARCH Comput. Archit. News, 18(3a):115--124, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. Buck. GPU computing with NVIDIA CUDA. In SIGGRAPH '07,page 6, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation andperformance of munin. In SOSP '91, pages 152--164, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In PLDI '00, pages 57--69, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N.Sharp, and Q. Wu. Parallel programming using skeleton functions. In PARLE'93, pages 146--160, London, UK, 1993. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Dasgupta, J. LeBlanc, R.J., M. Ahamad, and U. Ramachandran.The clouds distributed operating system. IEEE Trans. on Computers, 24(11):34--44, Nov 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Delp, A. Sethi, and D. Farber. An analysis of memnet--an experiment in high-speed shared-memory local networking. In SIGCOMM'88, pages 165--174, New York, NY, USA, 1988. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Fleisch and G. Popek. Mirage: a coherent distributed sharedmemory design. In SOSP '89, pages 211--223, New York, NY, USA, 1989. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Frank, I. Burkhardt, H., and J. Rothnie. The ksr 1: bridging the gapbetween shared memory and mpps. In Compcon Spring '93, pages 285--294, Feb 1993.Google ScholarGoogle ScholarCross RefCross Ref
  19. I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, andW. W. Hwu. Cuba: an architecture for efficient cpu/co--processor data communication. In ICS '08, pages 299--308, New York, NY, USA,2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. B. Gustavson. The scalable coherent interface and related standardsprojects. IEEE Micro, 12(1):10--22, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. H. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. IEEE Trans. on VLSI, 12(2):206--217, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with areconfigurable coprocessor. In FCCM '97, pages 12--21, Apr 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh,R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The performance impact of flexibility in the stanford flash multiprocessor. In ASPLOS '94, pages 274--285, New York, NY, USA, 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. W. Hwu and J. Stone. A programmers view of the new GPUcomputing capabilities in the Fermi architecture and cuda 3.0. White paper, University of Illinois, 2009.Google ScholarGoogle Scholar
  25. IBM Staff. SPE Runtime Management Library, 2007.Google ScholarGoogle Scholar
  26. IMPACT Group. Parboil benchmark suite.http://impact.crhc.illinois.edu/parboil.php.Google ScholarGoogle Scholar
  27. Intel Staff. Intel 945G Express Chipset Product Brief, 2005.Google ScholarGoogle Scholar
  28. Intel Staff. Intel Xeon Processor 7400 Series. Specification Update,2008.Google ScholarGoogle Scholar
  29. V. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro.Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589--604, 2005. Google ScholarGoogle ScholarCross RefCross Ref
  31. P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. Tread-marks: distributed shared memory on standard workstations and operating systems. In WTEC'94, pages 10--10, Berkeley, CA, USA, 1994.USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,A. Mahesri, S. S. Lumetta, M. I. Frank, and S. Patel. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In ISCA '09, pages 140--151, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy.The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90, pages 148--159, New York, NY, USA, 1990.ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Li and P. Hudak. Memory coherence in shared virtual memorysystems. ACM Trans. Comput. Syst., 7(4):321--359, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidiatesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39-55, March-April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Maples and L. Wittie. Merlin: A superglue for multicomputersystems. In Compcon Spring '90, volume 90, pages 73--81, 1990.Google ScholarGoogle Scholar
  37. J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: aportable "shared-memory" programming model for distributed memory computers. In SC'94, pages 340--349, New York, NY, USA, 1994.ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. NVIDIA Staff. NVIDIA CUDA Programming Guide 2.2, 2009.Google ScholarGoogle Scholar
  39. S. Patel and W. W. Hwu. Accelerator architectures. IEEE Micro,28(4):4--12, July-Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture forvisual computing. ACM Trans. Graph., 27(3):1--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. onComputers, 49(5):465--481, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Vanneschi. The programming model of assist, an environmentfor parallel and distributed portable applications. Parallel Comput., 28(12):1709--1732, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, andE. M. Panainte. The molen polymorphic processor. IEEE Trans. on Computers, 53(11):1363--1375, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. Warren and S. Haridi. Data Diffusion Machine -- a scalable sharedvirtual memory multiprocessor. In Fifth Generation Computer Systems 1988, page 943. Springer-Verlag, 1988.Google ScholarGoogle Scholar
  45. J. Wilson, A.W., J. LaRowe, R.P., and M. Teller. Hardware assist fordistributed shared memory. In DCS '03, pages 246--255, May 1993.Google ScholarGoogle Scholar
  46. Xilinx Staff. Virtex-5 Family Overview, Feb 2009.Google ScholarGoogle Scholar
  47. S. Zhou, M. Stumm, and T. McInerney. Extending distributed shared memory to heterogeneous environments. In DCS '90, pages 30--37, May 1990.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. An asymmetric distributed shared memory model for heterogeneous parallel systems

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
    ASPLOS '10
    March 2010
    399 pages
    ISSN:0163-5964
    DOI:10.1145/1735970
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
      March 2010
      422 pages
      ISBN:9781605588391
      DOI:10.1145/1736020
      • General Chair:
      • James C. Hoe,
      • Program Chair:
      • Vikram S. Adve

    Copyright © 2010 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 13 March 2010

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader