skip to main content
article

Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Published:09 June 2007Publication History
Skip Abstract Section

Abstract

Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel "tasks" and allow an underlying software layer to schedule these tasks to different threads. Software task scheduling can provide good parallel performance as long as tasks are large compared to the software overheads.

We examine a set of applications from an important emerging domain: Recognition, Mining, and Synthesis (RMS). Many RMS applications are compute-intensive and have abundant thread-level parallelism, and are therefore good targets for running on a CMP. However, a significant number have small tasks for which software task schedulers achieve only limited parallel speedups.

We propose Carbon, a hardware technique to accelerate dynamic task scheduling on scalable CMPs. Carbon has relatively simple hardware, most of which can be placed far from the cores. We compare Carbon to some highly tuned software task schedulers for a set of RMS benchmarks with small tasks. Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on aset of task-parallel benchmarks.

References

  1. U.A. Acar, G.E. Blelloch, and R.D. Blumofe. The data locality of work stealing. In Proceedings of the ACM symposium on Parallel algorithms and architectures, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga. The NAS parallel benchmarks-summary and preliminary results. In Proceedings of the ACM/IEEE conference on Supercomputing, 1991 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G.E. Blelloch. NESL: A Nested Data-Parallel Language (Version 2.6). Pittsburgh, PA, USA, 1993.Google ScholarGoogle Scholar
  4. R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of ACM, 46(5):720--748, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F.W. Burton and M.R. Sleep. Executing functional programs on a virtual tree of processors. In Proceedings of the conference on Functional programming languages and computer architecture, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Canny. A Computational Approach to Edge Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679--698, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Chen, P. Juang, K. Ko, G. Contreras, D. Penry, R. Rangan, A. Stoler, L.-S. Peh, and M. Martonosi. Hardware-modulated parallelism in chip multiprocessors. SIGARCH Comput. Archit. News, 33(4):54--63, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.Google ScholarGoogle Scholar
  11. R. Fedkiw. Physical Simulation. http://graphics.stanford.edu/~fedkiw/.Google ScholarGoogle Scholar
  12. M. Frigo, C.E. Leiserson, and K.H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM conference on Programming language design and implementation, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF '06: Proceedings of the 3rd conference on Computing frontiers, pages 1--8, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R.H. Halstead Jr. Implementation of multilisp: Lisp on a multiprocessor. In Proceedings of the ACM Symposium on LISP and functional programming, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R.A. Hankins, G.N. Chinya, J.D. Collins, P.H. Wang, R. Rakvic, H. Wang, and J.P. Shen. Multiple instruction stream processor. In Proceedings of the 33rd annual international symposium on Computer Architecture, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Hendler and N. Shavit. Non-blocking steal-half work queues. In Proceedings of the symposium on Principles of distributed computing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. High Performance Fortran Forum. High Performance Fortran Language Specification, version 2.0, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Hoffmann, M. Korch, and T. Rauber. Performance evaluation of task pools based on hardware synchronization. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Hoffmann, M. Korch, and T. Rauber. Using hardware operations to reduce the synchronization overhead of task pools. In Proceedings of the 2004 International Conference on Parallel Processing, pages 241--249, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J.C. Hull. Options, Futures, and Other Derivatives, pages 346--348. Prentice Hall, third edition, 1996.Google ScholarGoogle Scholar
  21. Intel® Math Kernel Library Reference Manual.Google ScholarGoogle Scholar
  22. Intel Corporation. Intel Thread Building Blocks, 2006.Google ScholarGoogle Scholar
  23. M. Korch and T. Rauber. A comparison of task pools for dynamic load balancing of irregular algorithms: Research articles. Concurr. Comput. : Pract. Exper., 16(1), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv., 31(4):406--471, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C.W. McCurdy, R. Stevens, H. Simon, W. Kramer, D. Bailey, W. Johnston, C. Catlett, R. Lusk, T. Morgan, J. Meza, M. Banda, Leighton, and J. Hules. Creating science-driven computer architecture: A new path to scientific leadership. LBNL/PUB--5483, Oct. 2002.Google ScholarGoogle Scholar
  26. E. Mohr, D.A. Kranz, and R.H. Halstead Jr. Lazy task creation: a technique for increasing the granularity of parallel programs. In Proceedings of the ACM conference on LISP and functional programming, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G.J. Narlikar. Scheduling threads for low space requirement and good locality. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Open Dynamics Engine v0.5 User Guide.Google ScholarGoogle Scholar
  29. OpenMP Application Program Interface. Version 2.5.Google ScholarGoogle Scholar
  30. J. Philbin, J. Edler, O.J. Anshus, C.C. Douglas, and K. Li. Thread scheduling for cache locality. In Architectural Support for Programming Languages and Operating Systems, pages 60--71, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, pages 866--868. Cambridge University Press, second edition, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips keynote presentation, 2006.Google ScholarGoogle Scholar
  33. H. Simon, W. Kramer, W. Saphir, J. Shalf, D. Bailey, L. Oliker, M. Banda, C.W. McCurdy, J. Hules, A. Canning, M. Day, P. Colella, D. Serafini, M. Wehner, and P. Nugent. Science-driven system architecture: A new process for leadership class computing. Journal of the Earth Simulator, 2, Jan. 2005.Google ScholarGoogle Scholar
  34. E. Su, X. Tian, M. Girkar, G. Haab, S. Shah, and P. Petersen. Compiler support of the workqueuing execution model for Intel SMP architectures. In Fourth European Workshop on OpenMP, 2002.Google ScholarGoogle Scholar
  35. M.T. Vandevoorde and E.S. Roberts. Workcrews: an abstraction for controlling parallelism. Int. J. Parallel Program., 17(4):347--366, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. In Proceedings of the International Symposium on Computer architecture, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Carbon: architectural support for fine-grained parallelism on chip multiprocessors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 35, Issue 2
      May 2007
      527 pages
      ISSN:0163-5964
      DOI:10.1145/1273440
      Issue’s Table of Contents
      • cover image ACM Conferences
        ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
        June 2007
        542 pages
        ISBN:9781595937063
        DOI:10.1145/1250662
        • General Chair:
        • Dean Tullsen,
        • Program Chair:
        • Brad Calder

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2007

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader