research-article

Scalable work stealing

Authors:
James Dinan

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
D. Brian Larkins

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
P. Sadayappan

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Sriram Krishnamoorthy

Pacific Northwest National Laboratory, Richland, WA

Pacific Northwest National Laboratory, Richland, WA
View Profile

,
Jarek Nieplocha

Pacific Northwest National Laboratory, Richland, WA

Pacific Northwest National Laboratory, Richland, WA
View Profile

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and AnalysisNovember 2009Article No.: 53Pages 1–11https://doi.org/10.1145/1654059.1654113

Published:14 November 2009Publication History

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Pages 1–11

ABSTRACT

Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.

References

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, 1998. Google ScholarDigital Library
G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. J. Harrison, S. Hirata, S. Krishanmoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276--292, Feb. 2005.Google ScholarCross Ref
P. Berenbrink, T. Friedetzky, and L. Goldberg. The natural work-stealing algorithm is stable. In Proc. 42nd IEEE Symposium on Foundations of Computer Science (FOCS), pages 178--187, 2001. Google ScholarDigital Library
G. E. Blelloch and J. Greiner. A provable time and space efflcient implementation of NESL. In Proc. 1st ACM SIGPLAN Intl. Conf. on Functional Programming (ICFP), pages 213--225, Philadelphia, Pennsylvania, May 1996. Google ScholarDigital Library
R. D. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. 35th Symposium on Foundations of Computer Science (FOCS), pages 356--368, Nov. 1994. Google ScholarDigital Library
R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Conference (ATEC), pages 10--10, 1997. Google ScholarDigital Library
Ü. V. Çatalyürek, E. G. Boman, K. D. Devine, D. Bozdag, R. Heaphy, and L. A. Riesen. Hypergraph-based dynamic load balancing for adaptive scientific computations. In Proc. 21st Intl. Parallel and Distributed Processing Symposium (IPDPS), pages 1--11. IEEE, 2007.Google ScholarCross Ref
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. Int. J. High Performance Computing Applications (IJHPCA), 21(3):291--312, 2007. Google ScholarDigital Library
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005. Google ScholarDigital Library
G. Cong, S. Kodali, S. Krishnamoorty, D. Lea, V. Saraswat, and T. Wen. Solving irregular graph problems using adaptive work-stealing. In Proc. 37th Int Conf. on Parallel Processing (ICPP), Portland, OR, Sept. 2008. Google ScholarDigital Library
K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio. New challanges in dynamic load balancing. J. Appl. Numer. Math., 52(2--3):133--152, 2005. Google ScholarDigital Library
J. Dinan, S. Krishnamoorthy, D. B. Larkins, J. Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In Proc. 37th Intl. Conf. on Parallel Processing (ICPP), pages 586--593, 2008. Google ScholarDigital Library
J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Proc. of 6th Intl. Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS), pages 1--8, 2007.Google ScholarCross Ref
J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. A message passing benchmark for unbalanced applications. J. Simulation Modelling Practice and Theory, 16(9):1177--1189, 2008.Google ScholarCross Ref
N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. on Software Engineering, SE-8(3):287--292, May 1982. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. Conf. on Prog. Language Design and Implementation (PLDI), pages 212--223. ACM SIGPLAN, 1998. Google ScholarDigital Library
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. 23rd Intl. Parallel and Distributed Processing Symposium (IPDPS), 2009.Google Scholar
D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. J. Distributed Computing, 18(3):189--207, 2006. Google ScholarDigital Library
Intel Corporation. Cluster OpenMP user's guide v9.1. (309096-002 US), 2005--2006.Google Scholar
L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 91--108, 1993. Google ScholarDigital Library
G. Karypis and V. Kumar. MeTis: Unstrctured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0, Sept. 1998.Google Scholar
S. Krishnamoorthy, Ü. Çatalyürek, J. Nieplocha, A. Rountev, and P. Sadayappan. Hypergraph partitioning for automatic memory hierarchy management. In Proc. ACM/IEEE Conference Supercomputing (SC), page 98, 2006. Google ScholarDigital Library
V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994. Google ScholarDigital Library
Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406--471, 1999. Google ScholarDigital Library
L. Lamport. A new solution of dijkstra's concurrent programming problem. Commun. ACM, 17(8):453--455, 1974. Google ScholarDigital Library
M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 45--54, Feb. 2009. Google ScholarDigital Library
M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proc. 10th Symposium on Parallel Algorithms and Architectures (SPAA), pages 212--221. ACM, 1998. Google ScholarDigital Library
MPI Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994. Google ScholarDigital Library
J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586:533--546, 1999. Google ScholarDigital Library
J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a portable "shared-memory" programming model for distributed memory computers. In Proc. ACM/IEEE Conference Supercomputing (SC), pages 340--349, 1994. Google ScholarDigital Library
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proc. 19th Intl Workshop on Languages and Compilers for Parallel Computing (LCPC), pages 235--250, 2006. Google ScholarDigital Library
S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In Proc. of 37th Intl. Conference on Parallel Processing (ICPP), Sept. 2008. Google ScholarDigital Library
A. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In Proc. 7th Intl. Parallel Processing Symposium (IPPS), pages 230--237, 1993.Google ScholarDigital Library
G. L. Steele Jr. Parallel programming and parallel abstractions in fortress. In Proc. 14th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), page 157, 2005. Google ScholarDigital Library
A. Trifunović and W. J. Knottenbelt. Parallel multilevel algorithms for hypergraph partitioning. J. Parallel Distrib. Comput., 68(5):563--581, 2008. Google ScholarDigital Library
UPC Consortium. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.Google ScholarCross Ref
R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2001. Google ScholarDigital Library
T. Yanai, G. I. Fann, Z. Gan, R. J. Harrison, and G. Beylkin. Multiresolution quantum chemistry in multiwavelet bases: Analytic derivatives for hartree--fock and density functional theory. J. Chemical Physics, 121(7):2866--2876, 2004.Google ScholarCross Ref
K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13):825--836, 1998.Google Scholar

Index Terms

Recommendations

Enabling a highly-scalable global address space model for petascale computing
CF '10: Proceedings of the 7th ACM international conference on Computing frontiers

Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of ...
Read More
Optimized distributed work-stealing
IA^3 '16: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms

Work-stealing is a popular approach for dynamic load balancing of task-parallel programs. However, as has been widely studied, the use of classical work-stealing algorithms on massively parallel and distributed supercomputers introduces several ...
Read More
Work stealing and persistence-based load balancers for iterative overdecomposed applications
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
November 2009
778 pages
ISBN:9781605587448
DOI:10.1145/1654059
Conference Chair:
Wilfred Pinfold
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ARMCI
PGAS
dynamic load balancing
global arrays
task pools
work stealing
Qualifiers
- research-article
Conference

Acceptance Rates
SC '09 Paper Acceptance Rate59of261submissions,23%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 206
  Total Citations
  View Citations
- 278
  Total Downloads
- Downloads (Last 12 months)67
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable work stealing

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enabling a highly-scalable global address space model for petascale computing

Optimized distributed work-stealing

Work stealing and persistence-based load balancers for iterative overdecomposed applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable work stealing

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enabling a highly-scalable global address space model for petascale computing

Optimized distributed work-stealing

Work stealing and persistence-based load balancers for iterative overdecomposed applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media