ABSTRACT
Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular, register renaming a large number of instructions per cycle is difficult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the register file into a global file and several local files, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.
- Aus92a.T. M. Austin and G. S. Sohi, "Dynamic Dependency Analysis of Ordinary Programs," in The 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. Google ScholarDigital Library
- Bur96a.D. Burger, J. R. Goodman, and A. Kagi, "Quantifying Memory Bandwidth Limitations of Current and Future Microprocessors," 23rd Int'l Symposium on Computer Architecture, 1996. Google ScholarDigital Library
- Con95a.T. Conte, K. N. Menezes, P. M. Mills, and B. Patel, "Optimization of Instruction Fetch Mechanisms for High Issue Rates," 22nd Annual Int'l Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
- Dit82a.D. R. Ditzel and H. R. McLellan, "Register Allocation for Free: the C Machine Stack Cache," Proc. Znt. Symp. on Arch. Support for Prog. Lang. and Operating Sys., March 1982. Google ScholarDigital Library
- Fra92a.M. Franklin and G. S. Sohi "Register Traflic Analysis for Streamlining Inter-Operation Communication in Fine-Grain Parallel Processors," 25th Annual Symposium on Microarchitecture, Dec. 1992. Google ScholarDigital Library
- Fra92b.M. Franklin and G. S. Sohi, "The Expandable Split Window Architecture for Exploiting Fine-Grain Parallelism," in The 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, May 1992. Google ScholarDigital Library
- Fra93a.M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, University of Wisconsin-Madison, 1993. Google ScholarDigital Library
- Fra94a.M. Franklin and M. Smotherman, "A Fill-Unit Approach to Multiple Instruction Issue," 27th Int'l Symposium on Microarchitecture. Dec. 1994. Google ScholarDigital Library
- Fra95a.M. Franklin and S. Dutta, "Control Flow Prediction with Tree-Lie Subgraphs for Superscalar Processors," 28th Annual Symposium on Microarchitecture, Nov. 1995. Google ScholarDigital Library
- Hao96a.E. Hao, P-Y. Chang, M. Evers, and Y. Patt, "Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures," 29th Annual Int'l Symposium on Microarchitecture (to appear), Dec. 1996. Google ScholarDigital Library
- Hil84a.M. D. Hill and A. J. Smith, "Experimental Evaluation of On-Chip Microprocessor Cache Memories," Proc. 11th Annual Symposium on Computer Architecture, June 1984. Google ScholarDigital Library
- Hwu87a.W. W. Hwu and Y. N. Patt, "Design Choices for the HPSm Microprocessor Chip," in Proc. 20th Annual Hawaii International Conference on System Sciences, Kona, HI, January 1987.Google Scholar
- IBM90a.IBM, "Special Issue on the IBM RISC System/6000 Processor," IBM Journal of Research and Development, January 1990.Google Scholar
- Lam92a.M. S. Lam and R. P. Wilson, "Limits of Control Flow on Parallelism," Proc. International Symposium on Computer Architecture, May 1992. Google ScholarDigital Library
- Mel88a.S. W. Melvin. M. C. Shebanow, and Y. N. Patt, "Hardware Support for Large Atomic Units in Dynamically Scheduled Machines," in Proc. 21st Annual Workshop on Microprogramming and Microarchitecture, San Diego, CA, November 1988. Google ScholarDigital Library
- Mit97a.Tulika Mitra. "Performance Evaluation of Improved Superscalar Issue Mechanisms," in M.E. Project Report, Dept. of Computer Science, Indian Institute of Science, January 1997.Google Scholar
- Pal96a.S. Palacharla. N. Jouppi, and J. E. Smith, "Quantifying the Complexity of Superscalar Processors," Univ. of Wisconsin-Madison Technical Report, vol. CS-T&96- 1328, November 1996, (Available at http:l/www.cs.wisc.edultrs.html; a version to appear in ISCA'97).Google Scholar
- Pat85a.Y. N. Patt, W. W. Hwu, and M. Shebanow, "HPS, A New Microarchitecture: Rationale and Introduction," in Proc. 18th Annual Workshop on Microprogramming, Pacific Grove, CA, December 1985. Google ScholarDigital Library
- Pat85b.Y. N. Patt, S. W. Melvin, W. W. Hwu, and M. Shebanow, "Critical Issues Regarding HPS, A High Performance Microarchitecture," in Proc. 1Sth Annual Workshop on Microprogramming, Pacific Grove, CA, December 1985. Google ScholarDigital Library
- Rot96a.E. Rotenberg, S. Bennett. and J. E. Smith, "Trace Cache: ALow Latency Approach to High Bandwidth Instruction Fetching," in 29th Annual Int'l Symposium on Microarchirecture. Paris, Dec. 1996. Google ScholarDigital Library
- Rus78a.R. M. Russel, "The Cray-1 Computer System," Communications of the ACM, vol. 21. pp. 63-72, Jan, 1978. Google ScholarDigital Library
- Smi84a.J. E. Smith, "Decoupled Access/Execute Architectures," ACM Transactions on Computer Systems, Nov. 1984, Google ScholarDigital Library
- Smo95a.M. Smotherman and M. Franklin, "Improving CISC Instruction Decoding Performance Using a Fill Unit," 28th Annual Symposium on Microarchtecture, Dec. 1995. Google ScholarDigital Library
- Spr94a.E. Sprangler and Y. N. Patt, "Facilitating Superscalar Processing via a Combined Static/Dynamic Register Renaming Scheme," 27th Annual Int'l Symposium on Microarchitecture. Dec.1994. Google ScholarDigital Library
- Tom67a.R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of Research and Development, January 1967.Google ScholarDigital Library
- Uht92a.A. K. Uht. "Concurrency Extraction via Hardware Methods Executing the Static Instruction Stream," IEEE Transactions on Computers, vol. 41. July 1992. Google ScholarDigital Library
- Wal91a.D. Wall, "Limits of Instruction Level Parallelism," 4t/t International Conf. on Arch.Support for Prog.Langs, and Op.Sys. April 1991. Google ScholarDigital Library
- Wei95a.Shlomo Weiss, "Implementing Register Interlocks in Parallel-Pipeline, Multiple Instruction Queue, Superscalalr Processoors," Proc. First Int'l Symposium on High Performance Computer Architecture, 1995, Google ScholarDigital Library
- Yeh93b.T-Y. Yeh and Y. N. Patt, "A Compnrison of Dynnmic Branch Predictors that use Two Levels of Branch Histoly." 20th Int'l Symposium on Computer Architecture, 1993. Google ScholarDigital Library
- Yeh93a.T-Y. Yeh, D. MArr. and Y. Patt, "Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache," Proc. 7th ACM Int'l Conference on Supercomputing. July 1993. Google ScholarDigital Library
Index Terms
- Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Recommendations
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Special Issue: Proceedings of the 24th annual international symposium on Computer architecture (ISCA '97)Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages ...
Exploiting selective instruction reuse and value prediction in a superscalar architecture
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches ...
Superscalar Instruction Issue
While providing a considerable potential for parallel execution, the performance of a superscalar microarchitecture depends heavily on the particular instruction issue scheme chosen. In this paper, we focus on the instruction issue task of superscalar ...
Comments