Abstract
Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel "tasks" and allow an underlying software layer to schedule these tasks to different threads. Software task scheduling can provide good parallel performance as long as tasks are large compared to the software overheads.
We examine a set of applications from an important emerging domain: Recognition, Mining, and Synthesis (RMS). Many RMS applications are compute-intensive and have abundant thread-level parallelism, and are therefore good targets for running on a CMP. However, a significant number have small tasks for which software task schedulers achieve only limited parallel speedups.
We propose Carbon, a hardware technique to accelerate dynamic task scheduling on scalable CMPs. Carbon has relatively simple hardware, most of which can be placed far from the cores. We compare Carbon to some highly tuned software task schedulers for a set of RMS benchmarks with small tasks. Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on aset of task-parallel benchmarks.
- U.A. Acar, G.E. Blelloch, and R.D. Blumofe. The data locality of work stealing. In Proceedings of the ACM symposium on Parallel algorithms and architectures, 2000. Google ScholarDigital Library
- D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga. The NAS parallel benchmarks-summary and preliminary results. In Proceedings of the ACM/IEEE conference on Supercomputing, 1991 Google ScholarDigital Library
- G.E. Blelloch. NESL: A Nested Data-Parallel Language (Version 2.6). Pittsburgh, PA, USA, 1993.Google Scholar
- R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, 1995. Google ScholarDigital Library
- R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of ACM, 46(5):720--748, 1999. Google ScholarDigital Library
- F.W. Burton and M.R. Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the conference on Functional programming languages and computer architecture, 1981. Google ScholarDigital Library
- J. Canny. A Computational Approach to Edge Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679--698, 1986. Google ScholarDigital Library
- D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, 2005. Google ScholarDigital Library
- J. Chen, P. Juang, K. Ko, G. Contreras, D. Penry, R. Rangan, A. Stoler, L.-S. Peh, and M. Martonosi. Hardware-modulated parallelism in chip multiprocessors. SIGARCH Comput. Archit. News, 33(4):54--63, 2005. Google ScholarDigital Library
- P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.Google Scholar
- R. Fedkiw. Physical Simulation. http://graphics.stanford.edu/~fedkiw/.Google Scholar
- M. Frigo, C.E. Leiserson, and K.H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM conference on Programming language design and implementation, 1998. Google ScholarDigital Library
- M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- R.H. Halstead Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the ACM Symposium on LISP and functional programming, 1984. Google ScholarDigital Library
- R.A. Hankins, G.N. Chinya, J.D. Collins, P.H. Wang, R. Rakvic, H. Wang, and J.P. Shen. Multiple instruction stream processor. In Proceedings of the 33rd annual international symposium on Computer Architecture, 2006. Google ScholarDigital Library
- D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the symposium on Principles of distributed computing, 2002. Google ScholarDigital Library
- High Performance Fortran Forum. High Performance Fortran Language Specification, version 2.0, 1997. Google ScholarDigital Library
- R. Hoffmann, M. Korch, and T. Rauber. Performance evaluation of task pools based on hardware synchronization. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004. Google ScholarDigital Library
- R. Hoffmann, M. Korch, and T. Rauber. Using hardware operations to reduce the synchronization overhead of task pools. In Proceedings of the 2004 International Conference on Parallel Processing, pages 241--249, 2004. Google ScholarDigital Library
- J.C. Hull. Options, Futures, and Other Derivatives, pages 346--348. Prentice Hall, third edition, 1996.Google Scholar
- Intel® Math Kernel Library Reference Manual.Google Scholar
- Intel Corporation. Intel Thread Building Blocks, 2006.Google Scholar
- M. Korch and T. Rauber. A comparison of task pools for dynamic load balancing of irregular algorithms: Research articles. Concurr. Comput. : Pract. Exper., 16(1), 2004. Google ScholarDigital Library
- Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999. Google ScholarDigital Library
- C.W. McCurdy, R. Stevens, H. Simon, W. Kramer, D. Bailey, W. Johnston, C. Catlett, R. Lusk, T. Morgan, J. Meza, M. Banda, Leighton, and J. Hules. Creating science-driven computer architecture: A new path to scientific leadership. LBNL/PUB--5483, Oct. 2002.Google Scholar
- E. Mohr, D.A. Kranz, and R.H. Halstead Jr. Lazy task creation: a technique for increasing the granularity of parallel programs. In Proceedings of the ACM conference on LISP and functional programming, 1990. Google ScholarDigital Library
- G.J. Narlikar. Scheduling threads for low space requirement and good locality. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, 1999. Google ScholarDigital Library
- Open Dynamics Engine v0.5 User Guide.Google Scholar
- OpenMP Application Program Interface. Version 2.5.Google Scholar
- J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, and K. Li. Thread scheduling for cache locality. In Architectural Support for Programming Languages and Operating Systems, pages 60--71, 1996. Google ScholarDigital Library
- W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, pages 866--868. Cambridge University Press, second edition, 1992. Google ScholarDigital Library
- J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips keynote presentation, 2006.Google Scholar
- H. Simon, W. Kramer, W. Saphir, J. Shalf, D. Bailey, L. Oliker, M. Banda, C.W. McCurdy, J. Hules, A. Canning, M. Day, P. Colella, D. Serafini, M. Wehner, and P. Nugent. Science-driven system architecture: A new process for leadership class computing. Journal of the Earth Simulator, 2, Jan. 2005.Google Scholar
- E. Su, X. Tian, M. Girkar, G. Haab, S. Shah, and P. Petersen. Compiler support of the workqueuing execution model for Intel SMP architectures. In Fourth European Workshop on OpenMP, 2002.Google Scholar
- M.T. Vandevoorde and E.S. Roberts. Workcrews: an abstraction for controlling parallelism. Int. J. Parallel Program., 17(4):347--366, 1988. Google ScholarDigital Library
- S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. In Proceedings of the International Symposium on Computer architecture, 1995. Google ScholarDigital Library
Index Terms
- Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Recommendations
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
ISCA '07: Proceedings of the 34th annual international symposium on Computer architectureChip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Comments