Abstract
Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response to long-latency cache miss events, and to run helper threads in the shadow of cache misses when the main thread would be otherwise stalled.The concept of VMT is prototyped on an Itanium ® 2 processor using features provided by the Processor Abstraction Layer (PAL) firmware mechanism already present in currently shipping processors. On a 4-way MP physical system equipped with VMT-enabled Itanium 2 processors, helper threading via the VMT mechanism can achieve significant performance gains for a diverse set of real-world workloads, ranging from single-threaded workstation benchmarks to heavily multithreaded large scale decision support systems (DSS) using the IBM DB2 Universal Database. We measure a wall-clock speedup of 5.8% to 38.5% for the workstation benchmarks, and 5.0% to 12.7% on various queries in the DSS workload.
- T. Aamodt, P. Marcuello, P. Chow, P. Hammarlund, H. Wang, and J. Shen. Hardware Support for Prescient Instruction Prefetch. In 10th International Symposium on High Performance Computer Architecture, February 2004. Google ScholarDigital Library
- A. Agarwal, B. Lim, D. Kranz, and J. Kubiatowicz. April: A Processor Architecture for Multiprocessing. In 17th Inter- national Symposium on Computer Architecture, June 1990. Google ScholarDigital Library
- M. Annavaram, J. M. Patel, and E. S. Davidson. Data Prefetching by Dependence Graph Precomputation. In 28th International Symposium on Computer Architecture, pages 52--61, Goteborg, Sweden, June 2001. ACM. Google ScholarDigital Library
- D. Berg and B. Lewis. Threads Primer: A Guide to Multi-threaded Programming. SunSoft Press, 1996. Google ScholarDigital Library
- J. Bharadwaj, W. Chen, W. Chuang, G. Hoflehner, K. Menezes, K. Muthukumar, and J. Pierce. The Intel IA-64 Compiler Code Generator. IEEE Micro, Sept-Oct 2000. Google ScholarDigital Library
- J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. Kunkel. A Multithreaded PowerPC Processor for Commercial Servers. IBM Journal of Research and Development, 44(6):885--898, 2000. Google ScholarDigital Library
- R. S. Chappell, S. P. Kim, S. K. Reinhardt, and Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). In 26th International Symposium on Computer Architecture, pages 186--195, Atlanta, GA, May 1999. ACM. Google ScholarDigital Library
- R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt. Difficult-path Branch Prediction Using Subordinate Microthreads. In 29th International Symposium on Computer Architecture, Anchorage, AK, May 2002. Google ScholarDigital Library
- J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 306--317, Austin, TX, December 2001. ACM. Google ScholarDigital Library
- J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In 28th International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
- J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In 11th Supercomputing Conference, July 1997. Google ScholarDigital Library
- R. Eickemeyer, R. Johnson, S. Kunkel, B.-H. Lim, M. Squillante, and C. Wu. Evaluation of Multithreaded Processors and Thread Switch Policies. In International Symposium on High Performance Computing, pages 75--90, Fukuoka, Japan, November 1997. Google ScholarDigital Library
- M. K. Farrens and A. R. Pleszkun. Strategies for Achieving Improved Processor Throughput. In 18th International Symposium on Computer Architecture, May 1991. Google ScholarDigital Library
- Graphviz - open source graph drawing software. http://www.research.att.com/sw/tools/graphviz/.Google Scholar
- J. W. Haskins Jr., K. R. Hirst, and K. Skadron. Inexpensive Throughput Enhancement in Small-Scale Embedded Micro- processors with Block Multithreading: Extensions, Characterization, and Tradeoffs. In 20th International Performance, Computing, and Communications Conference, April 2001.Google Scholar
- IBM DB2 Product Family. http://www.ibm.com/software/data/db2/.Google Scholar
- Intel Itanium 2 Processor Reference Manual for Software Development and Optimization. Intel Corporation, June 2002.Google Scholar
- Intel Itanium Architecture Software Developer's Manual. Intel Corporation, October 2002.Google Scholar
- D. Kim, S. Liao, P. Wang, J. del Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical Ex- perimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarDigital Library
- D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In 10th Architectural Support for Programming Languages and Operating Systems, pages 159--170, October 2002. Google ScholarDigital Library
- R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, C. C. Lim, J. Ng, and D. Sehr. An Advanced Optimizer for the IA-64 Architecture. IEEE Micro, Nov-Dec 2000. Google ScholarDigital Library
- S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In ACM Conference on Programming Language Design and Implementation, June 2002. Google ScholarDigital Library
- C. K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In 28th International Symposium on Computer Architecture, June 2001. Google ScholarDigital Library
- D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, February 2002.Google Scholar
- A. Moshovos, D. Pnevmatikatos, and A. Baniasadi. Slice Procesors: an Implementation of Operation-based Prediction. In International Conference on Supercomputing, June 2001. Google ScholarDigital Library
- T. C. Mowry, C. Q. Chan, and A. K. Lo. Comparative Evaluation of Latency Tolerance Techniques for Software Distributed Shared Memory. In 4th International Symposium on High Performance Computer Architecture, February 1998. Google ScholarDigital Library
- T. C. Mowry, M. S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1992. Google ScholarDigital Library
- H. Muljono, S. Rusu, B. Cherkauer, and J. Stinson. New 130nm Itanium 2 Processors for 2003. In Hot Chips, 2003.Google Scholar
- O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors. In 9th International Symposium on High Performance Computer Architecture, January 2003. Google ScholarDigital Library
- V. Panait, A. Sasturkar, and W.-F. Wong. Static Identification of Delinquent Loads. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarDigital Library
- M. Poess and C. Floyd. New TPC Benchmarks for Decision Support and Web Commerce. http://www.tpc.org.Google Scholar
- J. Redstone, S. Eggers, and H. Levy. Mini-threads: Increasing TLP on Small-Scale SMT Processors. In 9th International Symposium on High Performance Computer Architecture, February 2003. Google ScholarDigital Library
- A. Roth, A. Moshovos, and G. Sohi. Dependence based prefetching for linked data structures. In 8th International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1998. Google ScholarDigital Library
- A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In 7th IEEE International Symposium on High Performance Computer Architecture, Jan 2001. Google ScholarDigital Library
- R. Sites. Alpha Architecture Reference Manual. Digital Press, Newton, MA, 1992. Google ScholarDigital Library
- Y. Song and M. Dubois. Assisted Execution. Technical Report CENG 98--25, Department of EE-Systems, University of Southern California, Oct 1998.Google Scholar
- SPEC CPU2000 Documentation. http://www.spec.org/osg/cpu2000/docs/.Google Scholar
- D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In 22nd International Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
- D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor. In 5th International Symposium on High Performance Computer Architecture, January 1999. Google ScholarDigital Library
- C. A. Waldspurger and W. E. Weihl. Register Relocation: Flexible Contexts for Multithreading. In 20th International Symposium on Computer Architecture, May 1993. Google ScholarDigital Library
- H. Wang, P. Wang, R. D. Weldon, S. Ettinger, H. Saito, M. Girkar, S. Liao, and J. Shen. Speculative Precomputation: Exploring Use of Multithreading Technology for Latency. Intel Technology Journal, February 2002.Google Scholar
- P. Wang, H. Wang, J. Collins, E. Grochowski, R. Kling, and J. Shen. Memory latency-tolerance approaches for Itanium processors: Out-of-order Execution vs. Speculative Precomputation. In 8th International Symposium on High Performance Computer Architecture, Feb 2002. Google ScholarDigital Library
- C. Zilles and G. Sohi. Execution-based Prediction Using Speculative Slices. In 28th International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
Index Terms
- Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform
Recommendations
Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform
ASPLOS 2004Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to ...
Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systemsHelper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to ...
Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform
ASPLOS '04Helper threading is a technology to accelerate a program by exploiting a processor's multithreading capability to run ``assist'' threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to ...
Comments