ABSTRACT
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, 1998. Google ScholarDigital Library
- G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. J. Harrison, S. Hirata, S. Krishanmoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276--292, Feb. 2005.Google ScholarCross Ref
- P. Berenbrink, T. Friedetzky, and L. Goldberg. The natural work-stealing algorithm is stable. In Proc. 42nd IEEE Symposium on Foundations of Computer Science (FOCS), pages 178--187, 2001. Google ScholarDigital Library
- G. E. Blelloch and J. Greiner. A provable time and space efflcient implementation of NESL. In Proc. 1st ACM SIGPLAN Intl. Conf. on Functional Programming (ICFP), pages 213--225, Philadelphia, Pennsylvania, May 1996. Google ScholarDigital Library
- R. D. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. 35th Symposium on Foundations of Computer Science (FOCS), pages 356--368, Nov. 1994. Google ScholarDigital Library
- R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Conference (ATEC), pages 10--10, 1997. Google ScholarDigital Library
- Ü. V. Çatalyürek, E. G. Boman, K. D. Devine, D. Bozdag, R. Heaphy, and L. A. Riesen. Hypergraph-based dynamic load balancing for adaptive scientific computations. In Proc. 21st Intl. Parallel and Distributed Processing Symposium (IPDPS), pages 1--11. IEEE, 2007.Google ScholarCross Ref
- B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. Int. J. High Performance Computing Applications (IJHPCA), 21(3):291--312, 2007. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005. Google ScholarDigital Library
- G. Cong, S. Kodali, S. Krishnamoorty, D. Lea, V. Saraswat, and T. Wen. Solving irregular graph problems using adaptive work-stealing. In Proc. 37th Int Conf. on Parallel Processing (ICPP), Portland, OR, Sept. 2008. Google ScholarDigital Library
- K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio. New challanges in dynamic load balancing. J. Appl. Numer. Math., 52(2--3):133--152, 2005. Google ScholarDigital Library
- J. Dinan, S. Krishnamoorthy, D. B. Larkins, J. Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In Proc. 37th Intl. Conf. on Parallel Processing (ICPP), pages 586--593, 2008. Google ScholarDigital Library
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Proc. of 6th Intl. Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS), pages 1--8, 2007.Google ScholarCross Ref
- J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. A message passing benchmark for unbalanced applications. J. Simulation Modelling Practice and Theory, 16(9):1177--1189, 2008.Google ScholarCross Ref
- N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. on Software Engineering, SE-8(3):287--292, May 1982. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. Conf. on Prog. Language Design and Implementation (PLDI), pages 212--223. ACM SIGPLAN, 1998. Google ScholarDigital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. 23rd Intl. Parallel and Distributed Processing Symposium (IPDPS), 2009.Google Scholar
- D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. J. Distributed Computing, 18(3):189--207, 2006. Google ScholarDigital Library
- Intel Corporation. Cluster OpenMP user's guide v9.1. (309096-002 US), 2005--2006.Google Scholar
- L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 91--108, 1993. Google ScholarDigital Library
- G. Karypis and V. Kumar. MeTis: Unstrctured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0, Sept. 1998.Google Scholar
- S. Krishnamoorthy, Ü. Çatalyürek, J. Nieplocha, A. Rountev, and P. Sadayappan. Hypergraph partitioning for automatic memory hierarchy management. In Proc. ACM/IEEE Conference Supercomputing (SC), page 98, 2006. Google ScholarDigital Library
- V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google ScholarDigital Library
- Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406--471, 1999. Google ScholarDigital Library
- L. Lamport. A new solution of dijkstra's concurrent programming problem. Commun. ACM, 17(8):453--455, 1974. Google ScholarDigital Library
- M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 45--54, Feb. 2009. Google ScholarDigital Library
- M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proc. 10th Symposium on Parallel Algorithms and Architectures (SPAA), pages 212--221. ACM, 1998. Google ScholarDigital Library
- MPI Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994. Google ScholarDigital Library
- J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586:533--546, 1999. Google ScholarDigital Library
- J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a portable "shared-memory" programming model for distributed memory computers. In Proc. ACM/IEEE Conference Supercomputing (SC), pages 340--349, 1994. Google ScholarDigital Library
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proc. 19th Intl Workshop on Languages and Compilers for Parallel Computing (LCPC), pages 235--250, 2006. Google ScholarDigital Library
- S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In Proc. of 37th Intl. Conference on Parallel Processing (ICPP), Sept. 2008. Google ScholarDigital Library
- A. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In Proc. 7th Intl. Parallel Processing Symposium (IPPS), pages 230--237, 1993.Google ScholarDigital Library
- G. L. Steele Jr. Parallel programming and parallel abstractions in fortress. In Proc. 14th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), page 157, 2005. Google ScholarDigital Library
- A. Trifunović and W. J. Knottenbelt. Parallel multilevel algorithms for hypergraph partitioning. J. Parallel Distrib. Comput., 68(5):563--581, 2008. Google ScholarDigital Library
- UPC Consortium. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.Google ScholarCross Ref
- R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2001. Google ScholarDigital Library
- T. Yanai, G. I. Fann, Z. Gan, R. J. Harrison, and G. Beylkin. Multiresolution quantum chemistry in multiwavelet bases: Analytic derivatives for hartree--fock and density functional theory. J. Chemical Physics, 121(7):2866--2876, 2004.Google ScholarCross Ref
- K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13):825--836, 1998.Google Scholar
Index Terms
- Scalable work stealing
Recommendations
Enabling a highly-scalable global address space model for petascale computing
CF '10: Proceedings of the 7th ACM international conference on Computing frontiersOver the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of ...
Optimized distributed work-stealing
IA^3 '16: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and AlgorithmsWork-stealing is a popular approach for dynamic load balancing of task-parallel programs. However, as has been widely studied, the use of classical work-stealing algorithms on massively parallel and distributed supercomputers introduces several ...
Work stealing and persistence-based load balancers for iterative overdecomposed applications
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed ComputingApplications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two ...
Comments