article

Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Authors:
Sanjeev Kumar

Intel Corp., Santa Clara, CA

Intel Corp., Santa Clara, CA
View Profile

,
Christopher J. Hughes

Intel Corp., Santa Clara, CA

Intel Corp., Santa Clara, CA
View Profile

,
Anthony Nguyen

Intel Corp., Santa Clara, CA

Intel Corp., Santa Clara, CA
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 35 Issue 2May 2007pp 162–173https://doi.org/10.1145/1273440.1250683

Published:09 June 2007Publication History

ACM SIGARCH Computer Architecture News

Abstract

Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel "tasks" and allow an underlying software layer to schedule these tasks to different threads. Software task scheduling can provide good parallel performance as long as tasks are large compared to the software overheads.

We examine a set of applications from an important emerging domain: Recognition, Mining, and Synthesis (RMS). Many RMS applications are compute-intensive and have abundant thread-level parallelism, and are therefore good targets for running on a CMP. However, a significant number have small tasks for which software task schedulers achieve only limited parallel speedups.

We propose Carbon, a hardware technique to accelerate dynamic task scheduling on scalable CMPs. Carbon has relatively simple hardware, most of which can be placed far from the cores. We compare Carbon to some highly tuned software task schedulers for a set of RMS benchmarks with small tasks. Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on aset of task-parallel benchmarks.

References

U.A. Acar, G.E. Blelloch, and R.D. Blumofe. The data locality of work stealing. In Proceedings of the ACM symposium on Parallel algorithms and architectures, 2000. Google ScholarDigital Library
D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga. The NAS parallel benchmarks-summary and preliminary results. In Proceedings of the ACM/IEEE conference on Supercomputing, 1991 Google ScholarDigital Library
G.E. Blelloch. NESL: A Nested Data-Parallel Language (Version 2.6). Pittsburgh, PA, USA, 1993.Google Scholar
R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, 1995. Google ScholarDigital Library
R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of ACM, 46(5):720--748, 1999. Google ScholarDigital Library
F.W. Burton and M.R. Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the conference on Functional programming languages and computer architecture, 1981. Google ScholarDigital Library
J. Canny. A Computational Approach to Edge Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679--698, 1986. Google ScholarDigital Library
D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, 2005. Google ScholarDigital Library
J. Chen, P. Juang, K. Ko, G. Contreras, D. Penry, R. Rangan, A. Stoler, L.-S. Peh, and M. Martonosi. Hardware-modulated parallelism in chip multiprocessors. SIGARCH Comput. Archit. News, 33(4):54--63, 2005. Google ScholarDigital Library
P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.Google Scholar
R. Fedkiw. Physical Simulation. http://graphics.stanford.edu/~fedkiw/.Google Scholar
M. Frigo, C.E. Leiserson, and K.H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM conference on Programming language design and implementation, 1998. Google ScholarDigital Library
M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
R.H. Halstead Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the ACM Symposium on LISP and functional programming, 1984. Google ScholarDigital Library
R.A. Hankins, G.N. Chinya, J.D. Collins, P.H. Wang, R. Rakvic, H. Wang, and J.P. Shen. Multiple instruction stream processor. In Proceedings of the 33rd annual international symposium on Computer Architecture, 2006. Google ScholarDigital Library
D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the symposium on Principles of distributed computing, 2002. Google ScholarDigital Library
High Performance Fortran Forum. High Performance Fortran Language Specification, version 2.0, 1997. Google ScholarDigital Library
R. Hoffmann, M. Korch, and T. Rauber. Performance evaluation of task pools based on hardware synchronization. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004. Google ScholarDigital Library
R. Hoffmann, M. Korch, and T. Rauber. Using hardware operations to reduce the synchronization overhead of task pools. In Proceedings of the 2004 International Conference on Parallel Processing, pages 241--249, 2004. Google ScholarDigital Library
J.C. Hull. Options, Futures, and Other Derivatives, pages 346--348. Prentice Hall, third edition, 1996.Google Scholar
Intel® Math Kernel Library Reference Manual.Google Scholar
Intel Corporation. Intel Thread Building Blocks, 2006.Google Scholar
M. Korch and T. Rauber. A comparison of task pools for dynamic load balancing of irregular algorithms: Research articles. Concurr. Comput. : Pract. Exper., 16(1), 2004. Google ScholarDigital Library
Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999. Google ScholarDigital Library
C.W. McCurdy, R. Stevens, H. Simon, W. Kramer, D. Bailey, W. Johnston, C. Catlett, R. Lusk, T. Morgan, J. Meza, M. Banda, Leighton, and J. Hules. Creating science-driven computer architecture: A new path to scientific leadership. LBNL/PUB--5483, Oct. 2002.Google Scholar
E. Mohr, D.A. Kranz, and R.H. Halstead Jr. Lazy task creation: a technique for increasing the granularity of parallel programs. In Proceedings of the ACM conference on LISP and functional programming, 1990. Google ScholarDigital Library
G.J. Narlikar. Scheduling threads for low space requirement and good locality. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, 1999. Google ScholarDigital Library
Open Dynamics Engine v0.5 User Guide.Google Scholar
OpenMP Application Program Interface. Version 2.5.Google Scholar
J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, and K. Li. Thread scheduling for cache locality. In Architectural Support for Programming Languages and Operating Systems, pages 60--71, 1996. Google ScholarDigital Library
W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, pages 866--868. Cambridge University Press, second edition, 1992. Google ScholarDigital Library
J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips keynote presentation, 2006.Google Scholar
H. Simon, W. Kramer, W. Saphir, J. Shalf, D. Bailey, L. Oliker, M. Banda, C.W. McCurdy, J. Hules, A. Canning, M. Day, P. Colella, D. Serafini, M. Wehner, and P. Nugent. Science-driven system architecture: A new process for leadership class computing. Journal of the Earth Simulator, 2, Jan. 2005.Google Scholar
E. Su, X. Tian, M. Girkar, G. Haab, S. Shah, and P. Petersen. Compiler support of the workqueuing execution model for Intel SMP architectures. In Fourth European Workshop on OpenMP, 2002.Google Scholar
M.T. Vandevoorde and E.S. Roberts. Workcrews: an abstraction for controlling parallelism. Int. J. Parallel Program., 17(4):347--366, 1988. Google ScholarDigital Library
S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. In Proceedings of the International Symposium on Computer architecture, 1995. Google ScholarDigital Library

Index Terms

Carbon: architectural support for fine-grained parallelism on chip multiprocessors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Carbon: architectural support for fine-grained parallelism on chip multiprocessors
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture

Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the ...
Read More
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Read More
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009

To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 35, Issue 2
May 2007
527 pages
ISSN:0163-5964
DOI:10.1145/1273440
Issue’s Table of Contents
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
June 2007
542 pages
ISBN:9781595937063
DOI:10.1145/1250662
General Chair:
Dean Tullsen
University of California, San Diego
,
Program Chair:
Brad Calder
Microsoft & University of California, San Diego
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2007
Check for updates
Author Tags
CMP
architectural support
loop and task parallelism
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 192
  Total Citations
  View Citations
- 2,034
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Carbon: architectural support for fine-grained parallelism on chip multiprocessors

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Accelerating critical section execution with asymmetric multi-core architectures

Accelerating critical section execution with asymmetric multi-core architectures