Abstract
Known algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering semantics of the operations, limits the flexibility and performance of these algorithms. Our novel non-minimal, topology-aware algorithms deliver far better performance with the addition of a very small amount of redundant communication. We develop novel algorithms for Clos networks and single or multi-ported torus networks. Tests on a 32k-node BlueGene/P result in allgather speedups of up to 6x and reduce-scatter speedups of over 11x compared to the native IBM algorithm. Broadcast, reduce, and allreduce can be composed of allgather or reduce-scatter and other collective operations; our techniques also improve the performance of these algorithms.
- J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. HyperX: Topology, routing, and packaging of efficient large-scale networks. In Proc of the Conf. on High Performance Computing Networking, Storage and Analysis, SC '09, pages 41:1--41:11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- M. Barnett, D. G. Payne, and R. A. van de Geijn. Optimal broadcasting in mesh-connected architectures. Technical report, Austin, TX, USA, 1991. Google ScholarDigital Library
- M. Barnett, S. Gupta, D. G. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor collective communication library (intercom). In Proc. of the Scalable High Performance Computing Conf., pages 357--364. IEEE Computer Society Press, 1994.Google ScholarCross Ref
- K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.Google Scholar
- J. Bruck, , C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997. Google ScholarDigital Library
- E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proc. 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google ScholarCross Ref
- W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22 (6): 789--828, Sept. 1996. Google ScholarDigital Library
- W. D. Gropp and E. Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6.Google Scholar
- T. Hoefler, T. Schneider, and A. Lumsdaine. Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks. In Proc. of the 2008 IEEE Int. Conf. on Cluster Computing. IEEE Computer Society, Oct. 2008. ISBN 978-1-4244-2640.Google ScholarCross Ref
- IBM-Blue-Gene-Team. Overview of the IBM Blue Gene/P project. IBM Journal of Research and Development, 52 (1/2): 199--220, 2008. Google ScholarDigital Library
- N. Jain and Y. Sabharwal. Optimal bucket algorithms for large MPI collectives on torus interconnects. In Proc. of the 24th ACM Int. Conf. on Supercomputing, ICS '10, pages 27--36, 2010. Google ScholarDigital Library
- H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. TOP500 Supercomputing Sites, 2011 (accessed May 13, 2011). http://top500.org.Google Scholar
- B. L. Payne, M. Barnett, R. Littlefield, D. G. Payne, and R. A. van De Geijn. Global combine on mesh architectures with wormhole routing. In Proc. of 7 th Int. Parallel Proc. Symp, 1993. Google ScholarDigital Library
- P. Sack. Scalable Collective Message-passing Algorithms. PhD thesis, University of Illinois at Urbana-Champaign, 2011.Google Scholar
- H. Tang and T. Yang. Optimizing threaded MPI execution on SMP clusters. In Proc. of 15th ACM int. conf. on supercomputing, pages 381--392. ACM Press, 2001. Google ScholarDigital Library
- R. Thakur and W. Gropp. Improving the performance of collective operations in MPICH. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257--267 10th European PVM/MPI User's Group Meeting, pages 257--267. Springer Verlag, 2003.Google Scholar
- R. Thakur and R. Rabenseifner. Optimization of collective communication operations in MPICH. Int. Journal of High Performance Computing Applications, 19: 49--66, 2005.Google ScholarDigital Library
- J. L. Traff, A. Ripke, C. Siebert, P. Balaji, R. Thakur, and W. Gropp. A pipelined algorithm for large, irregular all-gather problems. Int. Journal High Performance Computing Applications, 24: 58--68, February 2010. Google ScholarDigital Library
- E. Zahavi. Fat-trees routing and node ordering providing contention free traffic for MPI global collectives. Parallel and Distributed Processing Workshops and PhD Forum, 0: 761--770, 2011. Google ScholarDigital Library
Index Terms
- Faster topology-aware collective algorithms through non-minimal communication
Recommendations
Optimized MPI collective algorithms for dragonfly topology
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingThe Message Passing Interface (MPI) is the most prominent and dominant programming model for scientific computing in super-computing systems today. Although many general and efficient algorithms have been proposed for MPI collective operations, there is ...
Faster topology-aware collective algorithms through non-minimal communication
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel ProgrammingKnown algorithms for two important collective communication operations, allgather and reduce-scatter, are minimal-communication algorithms; no process sends or receives more than the minimum amount of data. This, combined with the data-ordering ...
NUMA-aware shared-memory collective communication for MPI
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingAs the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We ...
Comments