skip to main content
research-article

Faster topology-aware collective algorithms through non-minimal communication

Authors Info & Claims
Published:25 February 2012Publication History
Skip Abstract Section

Abstract

Known algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering semantics of the operations, limits the flexibility and performance of these algorithms. Our novel non-minimal, topology-aware algorithms deliver far better performance with the addition of a very small amount of redundant communication. We develop novel algorithms for Clos networks and single or multi-ported torus networks. Tests on a 32k-node BlueGene/P result in allgather speedups of up to 6x and reduce-scatter speedups of over 11x compared to the native IBM algorithm. Broadcast, reduce, and allreduce can be composed of allgather or reduce-scatter and other collective operations; our techniques also improve the performance of these algorithms.

References

  1. J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. HyperX: Topology, routing, and packaging of efficient large-scale networks. In Proc of the Conf. on High Performance Computing Networking, Storage and Analysis, SC '09, pages 41:1--41:11, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Barnett, D. G. Payne, and R. A. van de Geijn. Optimal broadcasting in mesh-connected architectures. Technical report, Austin, TX, USA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Barnett, S. Gupta, D. G. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor collective communication library (intercom). In Proc. of the Scalable High Performance Computing Conf., pages 357--364. IEEE Computer Society Press, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  4. K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google ScholarGoogle Scholar
  5. J. Bruck, , C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc. 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22 (6): 789--828, Sept. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. D. Gropp and E. Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6.Google ScholarGoogle Scholar
  9. T. Hoefler, T. Schneider, and A. Lumsdaine. Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks. In Proc. of the 2008 IEEE Int. Conf. on Cluster Computing. IEEE Computer Society, Oct. 2008. ISBN 978-1-4244-2640.Google ScholarGoogle ScholarCross RefCross Ref
  10. IBM-Blue-Gene-Team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52 (1/2): 199--220, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Jain and Y. Sabharwal. Optimal bucket algorithms for large MPI collectives on torus interconnects. In Proc. of the 24th ACM Int. Conf. on Supercomputing, ICS '10, pages 27--36, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP500 Supercomputing Sites, 2011 (accessed May 13, 2011). http://top500.org.Google ScholarGoogle Scholar
  13. B. L. Payne, M. Barnett, R. Littlefield, D. G. Payne, and R. A. van De Geijn. Global combine on mesh architectures with wormhole routing. In Proc. of 7 th Int. Parallel Proc. Symp, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Sack. Scalable Collective Message-passing Algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 2011.Google ScholarGoogle Scholar
  15. H. Tang and T. Yang. Optimizing threaded MPI execution on SMP clusters. In Proc. of 15th ACM int. conf. on supercomputing, pages 381--392. ACM Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Thakur and W. Gropp. Improving the performance of collective operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257--267 10th European PVM/MPI User's Group Meeting, pages 257--267. Springer Verlag, 2003.Google ScholarGoogle Scholar
  17. R. Thakur and R. Rabenseifner. Optimization of collective communication operations in MPICH. Int. Journal of High Performance Computing Applications, 19: 49--66, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. L. Traff, A. Ripke, C. Siebert, P. Balaji, R. Thakur, and W. Gropp. A pipelined algorithm for large, irregular all-gather problems. Int. Journal High Performance Computing Applications, 24: 58--68, February 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Zahavi. Fat-trees routing and node ordering providing contention free traffic for MPI global collectives. Parallel and Distributed Processing Workshops and PhD Forum, 0: 761--770, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Faster topology-aware collective algorithms through non-minimal communication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 47, Issue 8
        PPOPP '12
        August 2012
        334 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2370036
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
          February 2012
          352 pages
          ISBN:9781450311601
          DOI:10.1145/2145816

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 February 2012

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader