Abstract
We present the PARROT concept that seeks to achievehigher performance with reduced energy consumptionthrough gradual optimization of frequently executed codetraces. The PARROT microarchitectural framework integratestrace caching, dynamic optimizations and pipelinedecoupling. We employ a selective approach for applyingcomplex mechanisms only upon the most frequently usedtraces to maximize the performance gain at any givenpower constraint, thus attaining finer control of tradeoffsbetween performance and power awareness.We show that the PARROT based microarchitecture canimprove the performance of aggressively designed processorsby providing the means to improve the utilizationof their more elaborate resources. At the same time, rigorousselection of traces prior to storage and optimizationprovides the key to attenuating increases in thepower budget.For resource-constrained designs, PARROT based architecturesdeliver better performance (up to an average16% increase in IPC) at a comparable energy level,whereas the conventional path to a similar performanceimprovement consumes an average 70% more energy.Meanwhile, for those designs which can tolerate a higherpower budget, PARROT gracefully scales up to use additionalexecution resources in a uniformly efficient manner.In particular, a PARROT-style doubly-wide machinedelivers an average 45% IPC improvement while actuallyimproving the cubic-MIPS-per-WATT power awarenessmetric by over 50%.
- {1} Y. Almog, R. Rosner, N. Schwartz and A. Schmorak, "Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture", in CGO'04, March 2004. Google ScholarDigital Library
- {2} V. Bala, E. Duesterwald and S. Banerjia, "Transparent Dynamic Optimization: The Design and Implementation of Dynamo", TR HPL-1999-78, HP Labs.Google Scholar
- {3} M. Bekerman, A. Mendelson and G Sheaffer, "Performance and Hardware Complexity Tradeoffs in Designing Multithreaded Architectures", in PACT, pp 24-34, Oct. 1996. Google ScholarDigital Library
- {4} B. Black and J.P. Shen, "Turboscalar: A High Frequency High IPC Microarchitecture", in ISCA27, June 2000.Google Scholar
- {5} D.M. Brooks et al, "Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors", IEEE Micro, 20(6):36-44, Nov./Dec. 2000. Google ScholarDigital Library
- {6} D. Brooks, V. Tiwari and M. Martonosi, "Wattch: a Framework for Architectural-level Power Analysis and Optimizations", in ISCA27, 83-94, June 2000, Google ScholarDigital Library
- {7} G. Cai, C.H. Lim and W.R. Daasch, "Thermal-Scheduling For Ultra Low Power Mobile Microprocessor", in Proc. WCED'02, 2002.Google Scholar
- {8} A. Dhodapkar, C. Lim, G. Cai and R. Daasch, "TEM2P2EST: A Thermal Enabled Multi-Model Power/Performance ESTimator", in PACS Workshop, held in conjunction with ASPLOS, 2000. Google ScholarDigital Library
- {9} K. Ebcioglu and E.R. Altman, "DAISY: Dynamic Compilation for 100% Architectural Compatibility", in ISCA24, pp. 26-37, 1997. Google ScholarDigital Library
- {10} B. Fahs, S. Bose, M. Crum, B. Slechta, F. Spadini, T. Tung, S.J. Patel and S.S. Lumetta, "Performance Characterization of a Hard-ware Mechanism for Dynamic Optimization", MICRO34, Dec. 2001. Google ScholarDigital Library
- {11} M Franklin and G.S. Sohi, "The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism", in ISCA19, 1992. Google ScholarDigital Library
- {12} D. Friendly, S. Patel and Y. Patt, "Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors", in MICRO31, Nov. 1998. Google ScholarDigital Library
- {13} M. Gschwind, E.R. Altman, S. Sathaye, P. Ledak and D. Appenzeller, "Dynamic and Transparent Binary Translation", in IEEE Computer Magazine 33(3), pp. 54-59, 2000. Google ScholarDigital Library
- {14} G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker and P. Roussel, "The Microarchitecture of the Pentium ® 4 Processor", in Intel Technology Journal, 2001.Google Scholar
- {15} Q. Jacobson, E. Rotenberg and J.E. Smith, "Path-Based Next Trace Prediction", in MICRO30, 1997. Google ScholarDigital Library
- {16} S. Jourdan, L. Rappoport, Y. Almog, M. Erez, A. Yoaz, and R. Ronen, "eXtended Block Cache", in HPCA6, Jan. 2000.Google Scholar
- {17} O. Kosyakovsky, A. Mendelson and A. Kolodny, "The Use of Profile-based Trace Classification for Improving the Power and Performance of Trace Cache Systems", in 4th FDDO, Austin, Dec. 2001.Google Scholar
- {18} M.S. Lam and R.P. Wilson, "Limits of Control Flow on Parallelism", in Proc. 19th ISCA, pp. 46 -57, May 1992. Google ScholarDigital Library
- {19} S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank and R.A. Bringmann, "Effective Compiler Support for Predicated Execution using the Hyperblock", in MICRO25, 1992. Google ScholarDigital Library
- {20} S. Melvin and Y Patt, "Enhancing Instruction Scheduling with a Block-Structured ISA", in Intern. Journal of Parallel Prog., 23(3) pp 221-243, Jun. 1995 Google ScholarDigital Library
- {21} M.C. Merten, A.R. Trick, C.N. George, J. Gyllenhaal, and W.W. Hwu, "A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization", in ISCA26, 1999. Google ScholarDigital Library
- {22} M.C. Merten, A.R. Trick, E. M. Nystrom, R.D. Barnes and W. Mwu, "A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots", in ISCA27, May 2000.Google ScholarDigital Library
- {23} R. Nair and M.E. Hopkins, "Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups", in ISCA24, pp. 13-25, 1997. Google ScholarDigital Library
- {24} A. Parikh, M. Kandemir, N. Vijaykrishnan and M.J. Irwin, "VLIW Scheduling for Energy and Performance" in Proc. IEEE Workshop on VLIW, pp. 111-117. April 2001. Google ScholarDigital Library
- {25} S. Patel and S. Lumetta, "rePlay: A Hardware Framework for Dynamic Optimization", in IEEE Trans. on Computers, 50(6), pp 590-608, June 2001 Google ScholarDigital Library
- {26} S. Patel, T. Tung, S Bose and M. Crum, "Increasing the Size of Atomic Instruction Blocks using Control Flow Assertions", in MICRO33, 2000. Google ScholarDigital Library
- {27} A. Peleg and U. Weiser. "Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent of Virtual Address Line", U. S. Patent 5,381,533, Jan. 1995.Google Scholar
- {28} M. Postiff, G. Tyson and T. Mudge, "Performance Limits of Trace Caches", in Journal of ILP, vol. 1, Oct. 1999.Google Scholar
- {29} R. Rosner, A. Mendelson and R. Ronen, "Filtering Techniques to Improve Trace-Cache Efficiency", in PACT'01, Sept. 2001. Google ScholarDigital Library
- {30} R. Rosner, M. Moffie, Y. Sazeides and R. Ronen, "Selecting Long Atomic Traces for High Coverage", in ICS'03, pp. 2-11, 2003. Google ScholarDigital Library
- {31} E. Rotenberg, S. Bennett and J. Smith, "A Trace Cache Microarchitecture and Evaluation", in IEEE Trans. on Computers, 48(2), pp 111-120, Feb. 1999 Google ScholarDigital Library
- {32} B. Solomon, R. Ronen, D. Orenstien, Y. Almog and A. Mendelson "Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA", in ISLPED'01, Aug. 2001. Google ScholarDigital Library
- {33} B. Slechta et al., "Dynamic Optimizations of Micro-Operations", in HPCA9, Feb. 2003. Google ScholarDigital Library
- {34} V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P.N. Strenski and P.G. Emma, "Optimizing Pipelines for Power and Performance", MICRO35, 2002. Google ScholarDigital Library
Recommendations
Power Awareness through Selective Dynamically Optimized Traces
ISCA '04: Proceedings of the 31st annual international symposium on Computer architectureWe present the PARROT concept that seeks to achievehigher performance with reduced energy consumptionthrough gradual optimization of frequently executed codetraces. The PARROT microarchitectural framework integratestrace caching, dynamic optimizations ...
Customizing VLIW processors from dynamically profiled execution traces
The design philosophy of VLIW processors is to maximize instruction level parallelism (ILP) starting from compiler and machine code level to all the way down to memory and computational blocks. For this purpose, VLIW tailoring has been an important ...
Dynamically Scheduling VLIW Instructions
Very long instruction word (VLIW) machines potentially provide the most direct way to exploit instruction-level parallelism; however, they cannot be used to emulate current general-purpose instruction set architectures. In addition, programs scheduled ...
Comments