article

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform

Authors:
Perry H. Wang

Intel Corp.

Intel Corp.
View Profile

,
Jamison D. Collins

Intel Corp.

Intel Corp.
View Profile

,
Hong Wang

Intel Corp.

Intel Corp.
View Profile

,
Dongkeun Kim

Intel Corp. and University of Maryland, College Park, MD

Intel Corp. and University of Maryland, College Park, MD
View Profile

,
Bill Greene

Intel Corp.

Intel Corp.
View Profile

,
Kai-Ming Chan

Intel Corp.

Intel Corp.
View Profile

,
Aamir B. Yunus

Intel Corp.

Intel Corp.
View Profile

,
Terry Sych

Intel Corp.

Intel Corp.
View Profile

,
Stephen F. Moore

Intel Corp.

Intel Corp.
View Profile

,
John P. Shen

Intel Corp.

Intel Corp.
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 39 Issue 11November 2004pp 144–155https://doi.org/10.1145/1037187.1024411

Published:07 October 2004Publication History

ACM SIGPLAN Notices

Abstract

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response to long-latency cache miss events, and to run helper threads in the shadow of cache misses when the main thread would be otherwise stalled.The concept of VMT is prototyped on an Itanium ^® 2 processor using features provided by the Processor Abstraction Layer (PAL) firmware mechanism already present in currently shipping processors. On a 4-way MP physical system equipped with VMT-enabled Itanium 2 processors, helper threading via the VMT mechanism can achieve significant performance gains for a diverse set of real-world workloads, ranging from single-threaded workstation benchmarks to heavily multithreaded large scale decision support systems (DSS) using the IBM DB2 Universal Database. We measure a wall-clock speedup of 5.8% to 38.5% for the workstation benchmarks, and 5.0% to 12.7% on various queries in the DSS workload.

References

T. Aamodt, P. Marcuello, P. Chow, P. Hammarlund, H. Wang, and J. Shen. Hardware Support for Prescient Instruction Prefetch. In 10th International Symposium on High Performance Computer Architecture, February 2004. Google ScholarDigital Library
A. Agarwal, B. Lim, D. Kranz, and J. Kubiatowicz. April: A Processor Architecture for Multiprocessing. In 17th Inter- national Symposium on Computer Architecture, June 1990. Google ScholarDigital Library
M. Annavaram, J. M. Patel, and E. S. Davidson. Data Prefetching by Dependence Graph Precomputation. In 28th International Symposium on Computer Architecture, pages 52--61, Goteborg, Sweden, June 2001. ACM. Google ScholarDigital Library
D. Berg and B. Lewis. Threads Primer: A Guide to Multi-threaded Programming. SunSoft Press, 1996. Google ScholarDigital Library
J. Bharadwaj, W. Chen, W. Chuang, G. Hoflehner, K. Menezes, K. Muthukumar, and J. Pierce. The Intel IA-64 Compiler Code Generator. IEEE Micro, Sept-Oct 2000. Google ScholarDigital Library
J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. Kunkel. A Multithreaded PowerPC Processor for Commercial Servers. IBM Journal of Research and Development, 44(6):885--898, 2000. Google ScholarDigital Library
R. S. Chappell, S. P. Kim, S. K. Reinhardt, and Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). In 26th International Symposium on Computer Architecture, pages 186--195, Atlanta, GA, May 1999. ACM. Google ScholarDigital Library
R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt. Difficult-path Branch Prediction Using Subordinate Microthreads. In 29th International Symposium on Computer Architecture, Anchorage, AK, May 2002. Google ScholarDigital Library
J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 306--317, Austin, TX, December 2001. ACM. Google ScholarDigital Library
J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In 28th International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In 11th Supercomputing Conference, July 1997. Google ScholarDigital Library
R. Eickemeyer, R. Johnson, S. Kunkel, B.-H. Lim, M. Squillante, and C. Wu. Evaluation of Multithreaded Processors and Thread Switch Policies. In International Symposium on High Performance Computing, pages 75--90, Fukuoka, Japan, November 1997. Google ScholarDigital Library
M. K. Farrens and A. R. Pleszkun. Strategies for Achieving Improved Processor Throughput. In 18th International Symposium on Computer Architecture, May 1991. Google ScholarDigital Library
Graphviz - open source graph drawing software. http://www.research.att.com/sw/tools/graphviz/.Google Scholar
J. W. Haskins Jr., K. R. Hirst, and K. Skadron. Inexpensive Throughput Enhancement in Small-Scale Embedded Micro- processors with Block Multithreading: Extensions, Characterization, and Tradeoffs. In 20th International Performance, Computing, and Communications Conference, April 2001.Google Scholar
IBM DB2 Product Family. http://www.ibm.com/software/data/db2/.Google Scholar
Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. Intel Corporation, June 2002.Google Scholar
Intel Itanium Architecture Software Developer's Manual. Intel Corporation, October 2002.Google Scholar
D. Kim, S. Liao, P. Wang, J. del Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical Ex- perimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarDigital Library
D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In 10th Architectural Support for Programming Languages and Operating Systems, pages 159--170, October 2002. Google ScholarDigital Library
R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, C. C. Lim, J. Ng, and D. Sehr. An Advanced Optimizer for the IA-64 Architecture. IEEE Micro, Nov-Dec 2000. Google ScholarDigital Library
S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In ACM Conference on Programming Language Design and Implementation, June 2002. Google ScholarDigital Library
C. K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In 28th International Symposium on Computer Architecture, June 2001. Google ScholarDigital Library
D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, February 2002.Google Scholar
A. Moshovos, D. Pnevmatikatos, and A. Baniasadi. Slice Procesors: an Implementation of Operation-based Prediction. In International Conference on Supercomputing, June 2001. Google ScholarDigital Library
T. C. Mowry, C. Q. Chan, and A. K. Lo. Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory. In 4th International Symposium on High Performance Computer Architecture, February 1998. Google ScholarDigital Library
T. C. Mowry, M. S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992. Google ScholarDigital Library
H. Muljono, S. Rusu, B. Cherkauer, and J. Stinson. New 130nm Itanium 2 Processors for 2003. In Hot Chips, 2003.Google Scholar
O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In 9th International Symposium on High Performance Computer Architecture, January 2003. Google ScholarDigital Library
V. Panait, A. Sasturkar, and W.-F. Wong. Static Identification of Delinquent Loads. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarDigital Library
M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. http://www.tpc.org.Google Scholar
J. Redstone, S. Eggers, and H. Levy. Mini-threads: Increasing TLP on Small-Scale SMT Processors. In 9th International Symposium on High Performance Computer Architecture, February 2003. Google ScholarDigital Library
A. Roth, A. Moshovos, and G. Sohi. Dependence based prefetching for linked data structures. In 8th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1998. Google ScholarDigital Library
A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In 7th IEEE International Symposium on High Performance Computer Architecture, Jan 2001. Google ScholarDigital Library
R. Sites. Alpha Architecture Reference Manual. Digital Press, Newton, MA, 1992. Google ScholarDigital Library
Y. Song and M. Dubois. Assisted Execution. Technical Report CENG 98--25, Department of EE-Systems, University of Southern California, Oct 1998.Google Scholar
SPEC CPU2000 Documentation. http://www.spec.org/osg/cpu2000/docs/.Google Scholar
D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In 22nd International Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor. In 5th International Symposium on High Performance Computer Architecture, January 1999. Google ScholarDigital Library
C. A. Waldspurger and W. E. Weihl. Register Relocation: Flexible Contexts for Multithreading. In 20th International Symposium on Computer Architecture, May 1993. Google ScholarDigital Library
H. Wang, P. Wang, R. D. Weldon, S. Ettinger, H. Saito, M. Girkar, S. Liao, and J. Shen. Speculative Precomputation: Exploring Use of Multithreading Technology for Latency. Intel Technology Journal, February 2002.Google Scholar
P. Wang, H. Wang, J. Collins, E. Grochowski, R. Kling, and J. Shen. Memory latency-tolerance approaches for Itanium processors: Out-of-order Execution vs. Speculative Precomputation. In 8th International Symposium on High Performance Computer Architecture, Feb 2002. Google ScholarDigital Library
C. Zilles and G. Sohi. Execution-based Prediction Using Speculative Slices. In 28th International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library

Index Terms

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform
1. Computer systems organization
  1. Architectures
    1. Serial architectures
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform
ASPLOS 2004

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to ...
Read More
Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to ...
Read More
Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform
ASPLOS '04

Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 39, Issue 11
ASPLOS '04
November 2004
283 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1037187
Issue’s Table of Contents
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
October 2004
296 pages
ISBN:1581138040
DOI:10.1145/1024393
General Chair:
Shubu Mukherjee
Intel Corporation
,
Program Chair:
Kathryn S. McKinley
University of Texas at Austin
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 October 2004
Check for updates
Author Tags
DB2 database
PAL
cache miss prefetching
helper thread
itanium processor
multithreading
switch-on-event
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 1,842
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform

Helper threads via virtual multithreading on an experimental itanium^® 2 processor-based platform