research-article

An asymmetric distributed shared memory model for heterogeneous parallel systems

Authors:
Isaac Gelado

Universitat Politecnica de Catalunya, Barcelona, Spain

Universitat Politecnica de Catalunya, Barcelona, Spain
View Profile

,
John E. Stone

University of Illinois, Urbana-Champaign, IL, USA

University of Illinois, Urbana-Champaign, IL, USA
View Profile

,
Javier Cabezas

Universitat Politecnica de Catalunya, Barcelona, Spain

Universitat Politecnica de Catalunya, Barcelona, Spain
View Profile

,
Sanjay Patel

University of Illinois, Urbana-Champaign, IL, USA

University of Illinois, Urbana-Champaign, IL, USA
View Profile

,
Nacho Navarro

Universitat Politecnica de Catalunya, Barcelona, Spain

Universitat Politecnica de Catalunya, Barcelona, Spain
View Profile

,
Wen-mei W. Hwu

University of Illinois, Urbana-Champaign, IL, USA

University of Illinois, Urbana-Champaign, IL, USA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 38 Issue 1March 2010pp 347–358https://doi.org/10.1145/1735970.1736059

Published:13 March 2010Publication History

ACM SIGARCH Computer Architecture News

Abstract

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.

This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs.

We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmer-managed data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.

References

The OpenCL Specification, 2009.Google Scholar
A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz,J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In ISCA '95, pages2--13, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
S. Ahuja, N. Carriero, and D. Gelernter. Linda and friends. IEEETrans. on Computers, 19(8):26--34, Aug. 1986. Google ScholarDigital Library
P. An, A. Jula, S. Rus, S. Saunders, T. Smith, G. Tanase, N. Thomas,N. Amato, and L. Rauchwerger. STAPL: An adaptive, generic parallel C++ library. LNCS, pages 193--208, 2003Google ScholarCross Ref
H. Bal and A. Tanenbaum. Distributed programming with shared data.In ICCL '88, pages 82--91, Oct 1988.Google ScholarCross Ref
K. J. Barker, K. Davis, A. Hoisie, D. J. Kerbyson, M. Lang, S. Pakin,and J. C. Sancho. Entering the petaflop era: the architecture and performance of roadrunner. In SC'08, pages 1--11, Piscataway, NJ,USA, 2008. IEEE Press. Google ScholarDigital Library
P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. Cellss: a programming model for the cell be architecture. In SC'06, page 86, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
B. Bershad, M. Zekauskas, and W. Sawdon. The midway distributedshared memory system. In Compcon Spring '93, pages 528--537, Feb 1993.Google ScholarCross Ref
R. Bisiani and A. Forin. Multilanguage parallel programming ofheterogeneous machines. IEEE Trans. on Computers, 37(8):930--945, Aug 1988. Google ScholarDigital Library
R. Bisiani and M. Ravishankar. Plus: a distributed shared-memorysystem. SIGARCH Comput. Archit. News, 18(3a):115--124, 1990. Google ScholarDigital Library
I. Buck. GPU computing with NVIDIA CUDA. In SIGGRAPH '07,page 6, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation andperformance of munin. In SOSP '91, pages 152--164, New York, NY, USA, 1991. ACM. Google ScholarDigital Library
B.-C. Cheng and W. W. Hwu. Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation. In PLDI '00, pages 57--69, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N.Sharp, and Q. Wu. Parallel programming using skeleton functions. In PARLE'93, pages 146--160, London, UK, 1993. Springer-Verlag. Google ScholarDigital Library
P. Dasgupta, J. LeBlanc, R.J., M. Ahamad, and U. Ramachandran.The clouds distributed operating system. IEEE Trans. on Computers, 24(11):34--44, Nov 1991. Google ScholarDigital Library
G. Delp, A. Sethi, and D. Farber. An analysis of memnet--an experiment in high-speed shared-memory local networking. In SIGCOMM'88, pages 165--174, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
B. Fleisch and G. Popek. Mirage: a coherent distributed sharedmemory design. In SOSP '89, pages 211--223, New York, NY, USA, 1989. ACM. Google ScholarDigital Library
S. Frank, I. Burkhardt, H., and J. Rothnie. The ksr 1: bridging the gapbetween shared memory and mpps. In Compcon Spring '93, pages 285--294, Feb 1993.Google ScholarCross Ref
I. Gelado, J. H. Kelm, S. Ryoo, S. S. Lumetta, N. Navarro, andW. W. Hwu. Cuba: an architecture for efficient cpu/co--processor data communication. In ICS '08, pages 299--308, New York, NY, USA,2008. ACM. Google ScholarDigital Library
D. B. Gustavson. The scalable coherent interface and related standardsprojects. IEEE Micro, 12(1):10--22, 1992. Google ScholarDigital Library
S. H. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao. The chimaera reconfigurable functional unit. IEEE Trans. on VLSI, 12(2):206--217, Feb. 2004. Google ScholarDigital Library
J. R. Hauser and J. Wawrzynek. Garp: a MIPS processor with areconfigurable coprocessor. In FCCM '97, pages 12--21, Apr 1997. Google ScholarDigital Library
M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. P. Singh,R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy. The performance impact of flexibility in the stanford flash multiprocessor. In ASPLOS '94, pages 274--285, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
W. W. Hwu and J. Stone. A programmers view of the new GPUcomputing capabilities in the Fermi architecture and cuda 3.0. White paper, University of Illinois, 2009.Google Scholar
IBM Staff. SPE Runtime Management Library, 2007.Google Scholar
IMPACT Group. Parboil benchmark suite.http://impact.crhc.illinois.edu/parboil.php.Google Scholar
Intel Staff. Intel 945G Express Chipset Product Brief, 2005.Google Scholar
Intel Staff. Intel Xeon Processor 7400 Series. Specification Update,2008.Google Scholar
V. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro.Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, andD. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589--604, 2005. Google ScholarCross Ref
P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. Tread-marks: distributed shared memory on standard workstations and operating systems. In WTEC'94, pages 10--10, Berkeley, CA, USA, 1994.USENIX Association. Google ScholarDigital Library
J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy,A. Mahesri, S. S. Lumetta, M. I. Frank, and S. Patel. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In ISCA '09, pages 140--151, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy.The directory-based cache coherence protocol for the dash multiprocessor. In ISCA '90, pages 148--159, New York, NY, USA, 1990.ACM. Google ScholarDigital Library
K. Li and P. Hudak. Memory coherence in shared virtual memorysystems. ACM Trans. Comput. Syst., 7(4):321--359, 1989. Google ScholarDigital Library
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidiatesla: A unified graphics and computing architecture. IEEE Micro, 28(2):39-55, March-April 2008. Google ScholarDigital Library
C. Maples and L. Wittie. Merlin: A superglue for multicomputersystems. In Compcon Spring '90, volume 90, pages 73--81, 1990.Google Scholar
J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: aportable "shared-memory" programming model for distributed memory computers. In SC'94, pages 340--349, New York, NY, USA, 1994.ACM. Google ScholarDigital Library
NVIDIA Staff. NVIDIA CUDA Programming Guide 2.2, 2009.Google Scholar
S. Patel and W. W. Hwu. Accelerator architectures. IEEE Micro,28(4):4--12, July-Aug. 2008. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey,S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture forvisual computing. ACM Trans. Graph., 27(3):1--15, 2008. Google ScholarDigital Library
H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. onComputers, 49(5):465--481, May 2000. Google ScholarDigital Library
M. Vanneschi. The programming model of assist, an environmentfor parallel and distributed portable applications. Parallel Comput., 28(12):1709--1732, 2002. Google ScholarDigital Library
S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, andE. M. Panainte. The molen polymorphic processor. IEEE Trans. on Computers, 53(11):1363--1375, 2004. Google ScholarDigital Library
D. Warren and S. Haridi. Data Diffusion Machine -- a scalable sharedvirtual memory multiprocessor. In Fifth Generation Computer Systems 1988, page 943. Springer-Verlag, 1988.Google Scholar
J. Wilson, A.W., J. LaRowe, R.P., and M. Teller. Hardware assist fordistributed shared memory. In DCS '03, pages 246--255, May 1993.Google Scholar
Xilinx Staff. Virtex-5 Family Overview, Feb 2009.Google Scholar
S. Zhou, M. Stumm, and T. McInerney. Extending distributed shared memory to heterogeneous environments. In DCS '90, pages 30--37, May 1990.Google ScholarCross Ref

Index Terms

An asymmetric distributed shared memory model for heterogeneous parallel systems
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Distributed memory

Recommendations

An asymmetric distributed shared memory model for heterogeneous parallel systems
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ...
Read More
An asymmetric distributed shared memory model for heterogeneous parallel systems
ASPLOS '10

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to ...
Read More
A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 38, Issue 1
ASPLOS '10
March 2010
399 pages
ISSN:0163-5964
DOI:10.1145/1735970
Issue’s Table of Contents
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
March 2010
422 pages
ISBN:9781605588391
DOI:10.1145/1736020
General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2010
Check for updates
Author Tags
asymmetric distributed shared memory
data-centric programming models
heterogeneous systems
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 177
  Total Citations
  View Citations
- 2,751
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An asymmetric distributed shared memory model for heterogeneous parallel systems

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

An asymmetric distributed shared memory model for heterogeneous parallel systems

An asymmetric distributed shared memory model for heterogeneous parallel systems

A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs