ABSTRACT
Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
- 1.A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: a processor architecture for multiprocessing. In 17th Annual international Symposium on Computer Architecture, pages 104-114, May 1990. Google ScholarDigital Library
- 2.R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. in International Conference on Supercomputing, pages 1-6, June 1990. Google ScholarDigital Library
- 3.C.J. Beckmann and C.D. Polychronopoulos. Microarchitecture support for dynamic scheduling of acyclic task graphs. In 25th Annual International Symposium on Microarchitecture, pages 140-148, December 1992. Google ScholarDigital Library
- 4.B. Calder and D. Grunwalcl. Fast and accurate instruction fetch and branch prediction. In 21st Annual International Symposium on Computer Architecture, pages 2-11, April 1994. Google ScholarDigital Library
- 5.T.M. Conte, K.N. Menezes, RM. Mills, and B.A. Patel. Optimization of instruction fetch mechanisms for high issue rates. In 22nd Annual International Symposium on Computer Architecture, pages 333-344, June 1995. Google ScholarDigital Library
- 6.G.E. Daddis, Jr. and H.C. Tomg. The concurrent execution of multiple instruction streams on superscalar processors. In International Conference on Parallel Processing, pages I:76- 83, August 1991.Google Scholar
- 7.K.M. Dixit. New CPU benchmark suites from SPEC. In COMPCON, Spring 1992, pages 305-310, 1992. Google ScholarDigital Library
- 8.J. Edmondson and R Rubinfield. An overview of the 21164 AXP microprocessor. In Hot Chips VI, pages 1-8, August 1994.Google Scholar
- 9.M. Fillo, S.W. Keckler, W.J. Dally, N.R Carter, A. Chang, Y. Gurevich, and W.S. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, November 1995. Google ScholarDigital Library
- 10.R. Govindarajan, S.S. Nemawarkar, and R LeNir. Design and peformance evaluation of a multithreaded architecture. In First IEEE Symposium on High-Performance Computer Architecture, pages 298-307, January 1995. Google ScholarDigital Library
- 11.M. Gulati and N. Bagherzadeh. Performance study of a multithreaded superscalar microprocessor. In Second International Symposium on High-Performance Computer Architecture, pages 291-301, February 1996. Google ScholarDigital Library
- 12.B.K. Gunther. Superscalarperformance in a multithreaded microprocessor. PhD thesis, University of Tasmania, December 1993.Google Scholar
- 13.H. Hirata, K. Kimura, S. Nagarnine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In 19th Annual International Symposium on Computer Architecture, pages 136-145, May 1992. Google ScholarDigital Library
- 14.S.W. Keckler and W.J. Dally. Processor coupling: Integrating compile time and runtime scheduling for parallelism. In 19th Annual international Symposium on Computer Architecture, pages 202-213, May 1992. Google ScholarDigital Library
- 15.J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A multithreading technique targeting multiprocessors and workstations. In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 308-318, October 1994. Google ScholarDigital Library
- 16.Y. Li and W. Chu. The effects of STEF in finely parallel multithreaded processors. In First IEEE Symposium on High- Performance Computer Architecture, pages 318-325, January 1995. Google ScholarDigital Library
- 17.P.G. Lowney, S.M. Freudenberger, T.j. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. ODonneU, and J.C. Ruttenberg. The mulfiflow trace scheduling compiler. Journal of Supercomputing, 7(1-2):51-142, May 1993. Google ScholarDigital Library
- 18.S. McFarling. Combining branch predictors. TechnicalReport TN-36, DEC-WRL, June 1993.Google Scholar
- 19.R.G. Prasadh and C.-L. Wu. A benchmark evaluation of a multi-threaded RISC processor architecture. In International Conference on Parallel Processing, pages I:84-91, August 1991.Google Scholar
- 20.Microprocessor Report, October 24 1994.Google Scholar
- 21.Microprocessor Report, November 14 1994.Google Scholar
- 22.E.G. Sirer. Measuring limits of fine-grained parallelism. Senior Independent Work, Princeton University, June 1993.Google Scholar
- 23.B.J. Smith. Architecture and applications ofthe HEP multiprocessor computer system. In SPIE Real Time Signal Processing /V, pages 241-248, 1981.Google Scholar
- 24.M.D. Smith, M. Johnson, and M.A. Horowitz. Limits on multiple instruction issue. In Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 290-302, 1989. Google ScholarDigital Library
- 25.G.S. Sohi, S.E. Breach, and T.N. Vijaykumar. Multiscalar processors. In 22nd Annual International Symposium on Computer Architecture, pages 414--425, June 1995. Google ScholarDigital Library
- 26.G.S. Sohi and M. Franklin. High-bandwidth data memory systems for superscalar processors. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 53-62, April 1991. Google ScholarDigital Library
- 27.D.M. TuUsen, S.J. Eggers, and H.M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In 22nd Annual International Symposium on Computer Architecture, pages 392-403, June 1995. Google ScholarDigital Library
- 28.W. Yamamoto and M. Nemirovsky. Increasing superscalar performance through multistreaming. In Conference on Parallel Architectures and Compilation Techniques, pages 49-58, June 1995. Google ScholarDigital Library
- 29.W. Yamamoto, M.J. Serrano, A.R. Talcott, R.C. Wood, and M. Nemirosky. Performance estimation of multistreamed, superscalar processors. In Twenty-Seventh Hawaii International Conference on System Sciences, pages I:195-204, January 1994.Google ScholarCross Ref
- 30.T.-Y. Yeh and Y. Patt. Alternative implementations of twolevel adaptive branch prediction. In 19th Annual International Symposium on Computer Architecture, pages 124-134, May 1992. Google ScholarDigital Library
Index Terms
- Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor
Recommendations
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor
Special Issue: Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96)Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized ...
Exploiting Java instruction/thread level parallelism with horizontal multithreading
Java bytecodes can be executed with the following three methods: a Java interpretor running on a particular machine interprets bytecodes; a Just-In-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the ...
Comments