X10 and APGAS at Petascale

Authors:
Olivier Tardieu

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
Benjamin Herta

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
David Cunningham

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
David Grove

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
Prabhanjan Kambadur

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
Vijay Saraswat

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
Avraham Shinnar

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
Mikio Takeuchi

IBM Research - Tokyo, Japan

IBM Research - Tokyo, Japan
View Profile

,
Mandana Vaziri

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

,
Wei Zhang

IBM T.J. Watson Research Center, NY, USA

IBM T.J. Watson Research Center, NY, USA
View Profile

Authors Info & Claims

ACM Transactions on Parallel Computing Volume 2 Issue 4Article No.: 25pp 1–32https://doi.org/10.1145/2894746

Published:15 March 2016Publication History

ACM Transactions on Parallel Computing

Abstract

X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same fine-grained concurrency mechanisms within and across shared-memory nodes.

We demonstrate that X10 delivers solid performance at petascale by running (weak scaling) eight application kernels on an IBM Power--775 supercomputer utilizing up to 55,680 Power7 cores (for 1.7Pflop/s of theoretical peak performance). For the four HPC Class 2 Challenge benchmarks, X10 achieves 41% to 87% of the system’s potential at scale (as measured by IBM’s HPCC Class 1 optimized runs). We also implement K-Means, Smith-Waterman, Betweenness Centrality, and Unbalanced Tree Search (UTS) for geometric trees. Our UTS implementation is the first to scale to petaflop systems.

We describe the advances in distributed termination detection, distributed load balancing, and use of high-performance interconnects that enable X10 to scale out to tens of thousands of cores. We discuss how this work is driving the evolution of the X10 language, core class libraries, and runtime systems.

References

George Almási, Barnaby Dalton, Lawrence L. Hu, Franz Franchetti, Yaxun Liu, Albert Sidelnik, Thomas Spelce, Ilie Gabriel Tanase, Ettore Tiotto, Yevgen Voronenko, and Xing Xue. 2010. 2010 IBM HPC Challenge Class II Submission. http://www.hpcchallenge.org/presentations/sc2010/hpcc10_ibm.pdf.Google Scholar
Baba Arimilli, Ravi Arimilli, Vicente Chung, Scott Clark, Wolfgang Denzel, Ben Drerup, Torsten Hoefler, Jody Joyner, Jerry Lewis, Jian Li, Nan Ni, and Ram Rajamony. 2010. The PERCS high-performance interconnect. In Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects (HOTI’10). IEEE Computer Society, Washington, DC, 75--82. DOI:http://dx.doi.org/10.1109/HOTI.2010.16 Google ScholarDigital Library
Christopher Barton, Călin Casçaval, George Almási, Yili Zheng, Montse Farreras, Siddhartha Chatterje, and José Nelson Amaral. 2006. Shared memory programming for large scale machines. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’06). ACM, New York, NY, 108--117. DOI:http://dx.doi.org/10.1145/1133981.1133995 Google ScholarDigital Library
Stephen M. Blackburn, Richard L. Hudson, Ron Morrison, J. Eliot B. Moss, David S. Munro, and John Zigman. 2001. Starting with termination: A methodology for building distributed garbage collection algorithms. In Proceedings of the 24th Australasian Conference on Computer Science (ACSC’01). IEEE Computer Society, Washington, DC, 20--28. Google ScholarDigital Library
Dan Bonachea. 2002. GASNet Specification, v1.1. Technical Report UCB/CSD-02-1207. EECS Department, University of California, Berkeley. Google ScholarDigital Library
Ulrik Brandes. 2001. A faster algorithm for betweenness centrality. J. Math. Sociol. 25 (2001), 163--177.Google ScholarCross Ref
Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In SDM.Google Scholar
Brad Chamberlain, Sung-Eun Choi, Martha Dumler, Tom Hildebrandt, David Iten, Vass Litvinov, Greg Titus, Casey Battaglino, Rachel Sobel, Brandon Holt, and Jeff Keasler. 2012. Chapel HPC Challenge Entry: 2012. http://www.hpcchallenge.org/presentations/sc2012/ChapelHPCC2012.pdf.Google Scholar
Satish Chandra, Vijay Saraswat, Vivek Sarkar, and Rastislav Bodik. 2008. Type inference for locality analysis of distributed data structures. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, New York, NY, 11--22. DOI:http://dx.doi.org/10.1145/1345206.1345211 Google ScholarDigital Library
Silvia Crafa, David Cunningham, Vijay Saraswat, Avraham Shinnar, and Olivier Tardieu. 2014. Semantics of (Resilient) X10. In ECOOP 2014 Object-Oriented Programming, Richard Jones (Ed.). Lecture Notes in Computer Science, Vol. 8586. Springer, Berlin, 670--696. DOI:http://dx.doi.org/10.1007/978-3-662-44202-9_27Google Scholar
Cray. 2013. Chapel Language Specification Version 0.93. http://chapel.cray.com/spec/spec-0.93.pdf.Google Scholar
Dave Cunningham, Rajesh Bordawekar, and Vijay Saraswat. 2011. GPU programming in a high level language: Compiling X10 to CUDA. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop (X10’11). ACM, New York, NY, Article 8, 10 pages. DOI:http://dx.doi.org/10.1145/2212736.2212744 Google ScholarDigital Library
James Dinan, D. Brian Larkins, P. Sadayappan, Sriram Krishnamoorthy, and Jarek Nieplocha. 2009. Scalable work stealing. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, New York, NY, 1--11. DOI:http://dx.doi.org/10.1145/1654059.1654113 Google ScholarDigital Library
Jack Dongarra, Robert Graybill, William Harrod, Robert Lucas, Ewing Lusk, Piotr Luszczek, Janice Mcmahon, Allan Snavely, Jeffrey Vetter, Katherine Yelick, Sadaf Alam, Roy Campbell, Laura Carrington, Tzu-Yi Chen, Omid Khalili, Jeremy Meredith, and Mustafa Tikir. 2008. DARPA’s HPCS program: History, models, tools, languages. In Advances in COMPUTERS High Performance Computing, Marvin V. Zelkowitz (Ed.). Advances in Computers, Vol. 72. Elsevier, 1--100. DOI:http://dx.doi.org/10.1016/S0065-2458(08)00001-6Google Scholar
Kemal Ebcioglu, Vivek Sarkar, Tarek El-Ghazawi, and John Urbanic. 2006. An experiment in measuring the productivity of three parallel programming languages. In Proceedings of the P-PHEC Workshop, held in conjunction with HPCA.Google Scholar
David Grove, Josh Milthorpe, and Olivier Tardieu. 2014. Supporting array programming in X10. In Proceedings of the ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY’14). ACM, New York, NY, Article 38, 6 pages. DOI:http://dx.doi.org/10.1145/2627373.2627380 Google ScholarDigital Library
David Grove, Olivier Tardieu, David Cunningham, Ben Herta, Igor Peshansky, and Vijay Saraswat. 2011. A performance model for X10 applications: what’s going on under the hood? In Proceedings of the 2011 ACM SIGPLAN X10 Workshop (X10’11). ACM, New York, NY, Article 1, 8 pages. DOI:http://dx.doi.org/10.1145/2212736.2212737 Google ScholarDigital Library
HPC Challenge. 2012. HPC Challenge Awards Competition. Retrieved from http://www.hpcchallenge.org/.Google Scholar
HPC Challenge Benchmark Record 482. 2012. HPC Challenge Benchmark Record 482. Retrieved from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=482.Google Scholar
HPC Challenge Benchmark Record 495. 2012. HPC Challenge Benchmark Record 495. Retrieved from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=495.Google Scholar
HPC Challenge Benchmarks. 2012. HPC Challenge Benchmarks. Retrieved from http://icl.cs.utk.edu/hpcc/.Google Scholar
Laxmikant V. Kale, Anshu Arya, Abhinav Bhatele, Abhishek Gupta, Nikhil Jain, Pritish Jetley, Jonathan Lifflander, Phil Miller, Yanhua Sun, Ramprasad Venkataramanz, Lukasz Wesolowski, and Gengbin Zheng. 2011. Charm++ for Productivity and Performance. http://www.hpcchallenge.org/presentations/sc2011/hpcc11_report_charmplusplus.pdf.Google Scholar
Jonathan K. Lee and Jens Palsberg. 2010. Featherweight X10: A core calculus for async-finish parallelism. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). ACM, New York, NY, 25--36. DOI:http://dx.doi.org/10.1145/1693453.1693459 Google ScholarDigital Library
S. Lloyd. 2006. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Sept. 2006), 129--137. DOI:http://dx.doi.org/10.1109/TIT.1982.1056489 Google ScholarDigital Library
J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Symp. Math. Stat. Probab., Univ. Calif. 1965/66, 1, 281--297 (1967).Google Scholar
John Mellor-Crummey, Laksono Adhianto, Guohua Jin, Mark Krentel, Karthik Murthy, William Scherer, and Chaoran Yang. 2011. Class II Submission to the HPC Challenge Award Competition Coarray Fortran 2.0. (Nov. 2011). http://www.hpcchallenge.org/presentations/sc2011/hpcc11_report_caf2_0.pdf.Google Scholar
Josh Milthorpe, David Grove, Benjamin Herta, and Olivier Tardieu. 2015. Exploring the APGAS Programming Model using the LULESH Proxy Application. Technical Report RC25555. IBM Research.Google Scholar
Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, and Mitsuhisa Sato. 2012. XcalableMP 2012 HPC Challenge Class II Submission. http://www.hpcchallenge.org/presentations/sc2012/HPCC12_XMP_slide.pdf.Google Scholar
Martin Odersky, Lex Spoon, and Bill Venners. 2011. Programming in Scala: A Comprehensive Step-by-Step Guide, 2Nd Edition (2nd ed.). Artima Incorporation, Walnut Creek, CA. Google ScholarDigital Library
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th international conference on Languages and compilers for parallel computing (LCPC’06). Springer-Verlag, Berlin, 235--250. http://dl.acm.org/citation.cfm?id=1757112.1757137 Google ScholarDigital Library
Stephen Olivier and Jan Prins. 2008. Scalable dynamic load balancing using UPC. In ICPP’08: Proceedings of the 2008 37th International Conference on Parallel Processing. IEEE Computer Society, Washington, DC, 123--131. DOI:http://dx.doi.org/10.1109/ICPP.2008.19 Google ScholarDigital Library
Parallel Programming Laboratory. 2013. The Charm++ Parallel Programming System Manual. Technical Report Version 6.4. Department of Computer Science, University of Illinois, Urbana-Champaign.Google Scholar
Jeeva Paudel and José Nelson Amaral. 2011. Using the cowichan problems to investigate the programmability of X10 programming system. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop (X10’11). ACM, New York, NY, Article 4, 10 pages. DOI:http://dx.doi.org/10.1145/2212736.2212740 Google ScholarDigital Library
Jan Prins, Jun Huan, Bill Pugh, Chau-Wen Tseng, and P. Sadayappan. 2003. UPC Implementation of an Unbalanced Tree Search Benchmark. Technical Report 03-034. Univ. of North Carolina at Chapel Hill.Google Scholar
Dino Quintero, Kerry Bosworth, Puneet Chaudhary, Rodrigo Garcia da Silva, ByungUn Ha, Jose Higino, Marc-Eric Kahle, Tsuyoshi Kamenoue, James Pearson, Mark Perez, Fernando Pizzano, Robert Simon, and Kai Sun. 2012. IBM Power Systems 775 for AIX and Linux HPC Solution. IBM.Google Scholar
Ramakrishnan Rajamony, Mark W. Stephenson, and William Evan Speight. 2013. The power 775 architecture at scale. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, New York, NY, 183--192. DOI:http://dx.doi.org/10.1145/2464996.2465435 Google ScholarDigital Library
John T. Richards, Jonathan Brezin, Calvin B. Swart, and Christine A. Halverson. 2014. Productivity in parallel programming: A decade of progress. Queue 12, 9, Article 30 (Sept. 2014), 11 pages. DOI:http://dx.doi.org/10.1145/2674600.2682913 Google ScholarDigital Library
Vijay Saraswat, Gheorghe Almasi, Ganesh Bikshandi, Calin Cascaval, David Cunningham, David Grove, Sreedhar Kodali, Igor Peshansky, and Olivier Tardieu. 2010. The asynchronous partitioned global address space model. In AMP’10: Proceedings of the 1st Workshop on Advances in Message Passing.Google Scholar
Vijay Saraswat, Bard Bloom, Igor Peshansky, Olivier Tardieu, and David Grove. 2012. The X10 Language Specification, v2.2.3. http://x10.sourceforge.net/documentation/languagespec/x10-223.pdf.Google Scholar
Vijay Saraswat and Radha Jagadeesan. 2005. Concurrent clustered programming. In Concur’05. 353--367. Google ScholarDigital Library
Vijay Saraswat, Olivier Tardieu, David Grove, David Cunningham, Mikio Takeuchi, and Benjanmin Herta. 2013. A Brief Introduction to X10 (For the High Performance Programmer). Retrieved from http://x10.sourceforge.net/documentation/intro/latest/html/.Google Scholar
Vijay A. Saraswat, Prabhanjan Kambadur, Sreedhar Kodali, David Grove, and Sriram Krishnamoorthy. 2011. Lifeline-based global load balancing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 201--212. DOI:http://dx.doi.org/10.1145/1941553.1941582 Google ScholarDigital Library
Amitabh B. Sinha, L. V. Kale, and B. Ramkumar. 1993. A Dynamic and Adaptive Quiescence Detection Algorithm. Technical Report 93-11. Parallel Programming Laboratory, Department of Computer Science, University of Illinois, Urbana-Champaign.Google Scholar
T. F. Smith and M. S. Waterman. 1981. Identification of common molecular subsequences. J. Molec. Biol. 147, 1 (1981), 195--197. DOI:http://dx.doi.org/10.1016/0022-2836(81)90087-5Google ScholarCross Ref
Gabriel Tanase, Gheorghe Almási, Ettore Tiotto, Michail Alvanos, Anny Ly, and Barnaby Dalton. 2013. Performance Analysis of the IBM XL UPC on the PERCS Architecture. Technical Report RC25360. IBM Research.Google Scholar
Olivier Tardieu, David Grove, Bard Bloom, David Cunningham, Benjamin Herta, Prabhanjan Kambadur, Vijay A. Saraswat, Avraham Shinnar, Mikio Takeuchi, and Mandana Vaziri. 2012. X10 for Productivity and Performance at Scale. http://www.hpcchallenge.org/presentations/sc2012/x10-hpcc.pdf.Google Scholar
Olivier Tardieu, Benjamin Herta, David Cunningham, David Grove, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar, Mikio Takeuchi, and Mandana Vaziri. 2014. X10 and APGAS at petascale. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 53--66. DOI:http://dx.doi.org/10.1145/2555243.2555245 Google ScholarDigital Library
Olivier Tardieu, Haichuan Wang, and Haibo Lin. 2012. A work-stealing scheduler for X10’s task parallelism with suspension. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 267--276. DOI:http://dx.doi.org/10.1145/2145816.2145850 Google ScholarDigital Library
Wikipedia. 2011. PERCS. Retrieved from http://en.wikipedia.org/w/index.php?title=PERCS.Google Scholar
Chaoran Yang, Karthik Murthy, and John Mellor-Crummey. 2013. Managing asynchronous operations in coarray fortran 2.0. In Proceedings of the IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS). 1321--1332. DOI:http://dx.doi.org/10.1109/IPDPS.2013.17 Google ScholarDigital Library
Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-based global load balancing library in X10. In Proceedings of the 1st Workshop on Parallel Programming for Analytics Applications (PPAA’14). ACM, New York, NY, 31--40. DOI:http://dx.doi.org/10.1145/2567634.2567639 Google ScholarDigital Library

Index Terms

X10 and APGAS at Petascale
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Distributed programming languages
2. Theory of computation
  1. Design and analysis of algorithms
    1. Distributed algorithms

Recommendations

X10 and APGAS at Petascale
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same ...
Read More
X10 and APGAS at Petascale
PPoPP '14

X10 is a high-performance, high-productivity programming language aimed at large-scale distributed and shared-memory parallel applications. It is based on the Asynchronous Partitioned Global Address Space (APGAS) programming model, supporting the same ...
Read More
Parallel computing with x10
IWMSE '08: Proceedings of the 1st international workshop on Multicore software engineering

Many problems require parallel solutions and implementations and how to extract and specify parallelism has been the focus of Research during the last few decades. While there has been a significant progress in terms of (a)automatically deriving ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Parallel Computing Volume 2, Issue 4
Special Issue on PPOPP 2014
March 2016
202 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2888415
Editor:
Phillip B. Gibbons
Carnegie Mellon University, Pittsburgh, USA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 March 2016
- Accepted: 1 February 2016
- Revised: 1 January 2016
- Received: 1 December 2014
Published in topc Volume 2, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
APGAS
X10
performance
scalability
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 231
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

X10 and APGAS at Petascale

ACM Transactions on Parallel Computing

Abstract

References

Cited By

Index Terms

Recommendations

X10 and APGAS at Petascale

X10 and APGAS at Petascale

Parallel computing with x10