skip to main content
article

Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform

Published:07 October 2004Publication History
Skip Abstract Section

Abstract

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response to long-latency cache miss events, and to run helper threads in the shadow of cache misses when the main thread would be otherwise stalled.The concept of VMT is prototyped on an Itanium ® 2 processor using features provided by the Processor Abstraction Layer (PAL) firmware mechanism already present in currently shipping processors. On a 4-way MP physical system equipped with VMT-enabled Itanium 2 processors, helper threading via the VMT mechanism can achieve significant performance gains for a diverse set of real-world workloads, ranging from single-threaded workstation benchmarks to heavily multithreaded large scale decision support systems (DSS) using the IBM DB2 Universal Database. We measure a wall-clock speedup of 5.8% to 38.5% for the workstation benchmarks, and 5.0% to 12.7% on various queries in the DSS workload.

References

  1. T. Aamodt, P. Marcuello, P. Chow, P. Hammarlund, H. Wang, and J. Shen. Hardware Support for Prescient Instruction Prefetch. In 10th International Symposium on High Performance Computer Architecture, February 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Agarwal, B. Lim, D. Kranz, and J. Kubiatowicz. April: A Processor Architecture for Multiprocessing. In 17th Inter- national Symposium on Computer Architecture, June 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Annavaram, J. M. Patel, and E. S. Davidson. Data Prefetching by Dependence Graph Precomputation. In 28th International Symposium on Computer Architecture, pages 52--61, Goteborg, Sweden, June 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Berg and B. Lewis. Threads Primer: A Guide to Multi-threaded Programming. SunSoft Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Bharadwaj, W. Chen, W. Chuang, G. Hoflehner, K. Menezes, K. Muthukumar, and J. Pierce. The Intel IA-64 Compiler Code Generator. IEEE Micro, Sept-Oct 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. Kunkel. A Multithreaded PowerPC Processor for Commercial Servers. IBM Journal of Research and Development, 44(6):885--898, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. S. Chappell, S. P. Kim, S. K. Reinhardt, and Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). In 26th International Symposium on Computer Architecture, pages 186--195, Atlanta, GA, May 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt. Difficult-path Branch Prediction Using Subordinate Microthreads. In 29th International Symposium on Computer Architecture, Anchorage, AK, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 306--317, Austin, TX, December 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In 28th International Symposium on Computer Architecture, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In 11th Supercomputing Conference, July 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Eickemeyer, R. Johnson, S. Kunkel, B.-H. Lim, M. Squillante, and C. Wu. Evaluation of Multithreaded Processors and Thread Switch Policies. In International Symposium on High Performance Computing, pages 75--90, Fukuoka, Japan, November 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. K. Farrens and A. R. Pleszkun. Strategies for Achieving Improved Processor Throughput. In 18th International Symposium on Computer Architecture, May 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Graphviz - open source graph drawing software. http://www.research.att.com/sw/tools/graphviz/.Google ScholarGoogle Scholar
  15. J. W. Haskins Jr., K. R. Hirst, and K. Skadron. Inexpensive Throughput Enhancement in Small-Scale Embedded Micro- processors with Block Multithreading: Extensions, Characterization, and Tradeoffs. In 20th International Performance, Computing, and Communications Conference, April 2001.Google ScholarGoogle Scholar
  16. IBM DB2 Product Family. http://www.ibm.com/software/data/db2/.Google ScholarGoogle Scholar
  17. Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. Intel Corporation, June 2002.Google ScholarGoogle Scholar
  18. Intel Itanium Architecture Software Developer's Manual. Intel Corporation, October 2002.Google ScholarGoogle Scholar
  19. D. Kim, S. Liao, P. Wang, J. del Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical Ex- perimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In 10th Architectural Support for Programming Languages and Operating Systems, pages 159--170, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, C. C. Lim, J. Ng, and D. Sehr. An Advanced Optimizer for the IA-64 Architecture. IEEE Micro, Nov-Dec 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In ACM Conference on Programming Language Design and Implementation, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In 28th International Symposium on Computer Architecture, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, February 2002.Google ScholarGoogle Scholar
  25. A. Moshovos, D. Pnevmatikatos, and A. Baniasadi. Slice Procesors: an Implementation of Operation-based Prediction. In International Conference on Supercomputing, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. C. Mowry, C. Q. Chan, and A. K. Lo. Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory. In 4th International Symposium on High Performance Computer Architecture, February 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. C. Mowry, M. S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Muljono, S. Rusu, B. Cherkauer, and J. Stinson. New 130nm Itanium 2 Processors for 2003. In Hot Chips, 2003.Google ScholarGoogle Scholar
  29. O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In 9th International Symposium on High Performance Computer Architecture, January 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Panait, A. Sasturkar, and W.-F. Wong. Static Identification of Delinquent Loads. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. http://www.tpc.org.Google ScholarGoogle Scholar
  32. J. Redstone, S. Eggers, and H. Levy. Mini-threads: Increasing TLP on Small-Scale SMT Processors. In 9th International Symposium on High Performance Computer Architecture, February 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Roth, A. Moshovos, and G. Sohi. Dependence based prefetching for linked data structures. In 8th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In 7th IEEE International Symposium on High Performance Computer Architecture, Jan 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Sites. Alpha Architecture Reference Manual. Digital Press, Newton, MA, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Song and M. Dubois. Assisted Execution. Technical Report CENG 98--25, Department of EE-Systems, University of Southern California, Oct 1998.Google ScholarGoogle Scholar
  37. SPEC CPU2000 Documentation. http://www.spec.org/osg/cpu2000/docs/.Google ScholarGoogle Scholar
  38. D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In 22nd International Symposium on Computer Architecture, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor. In 5th International Symposium on High Performance Computer Architecture, January 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C. A. Waldspurger and W. E. Weihl. Register Relocation: Flexible Contexts for Multithreading. In 20th International Symposium on Computer Architecture, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Wang, P. Wang, R. D. Weldon, S. Ettinger, H. Saito, M. Girkar, S. Liao, and J. Shen. Speculative Precomputation: Exploring Use of Multithreading Technology for Latency. Intel Technology Journal, February 2002.Google ScholarGoogle Scholar
  42. P. Wang, H. Wang, J. Collins, E. Grochowski, R. Kling, and J. Shen. Memory latency-tolerance approaches for Itanium processors: Out-of-order Execution vs. Speculative Precomputation. In 8th International Symposium on High Performance Computer Architecture, Feb 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. Zilles and G. Sohi. Execution-based Prediction Using Speculative Slices. In 28th International Symposium on Computer Architecture, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 39, Issue 11
        ASPLOS '04
        November 2004
        283 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1037187
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
          October 2004
          296 pages
          ISBN:1581138040
          DOI:10.1145/1024393

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 October 2004

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader