ABSTRACT
Although some instructions hurt performance more than others, current processors typically apply scheduling and speculation as if each instruction was equally costly. Instruction cost can be naturally expressed through the critical path: if we could predict it at run-time, egalitarian policies could be replaced with cost-sensitive strategies that will grow increasingly effective as processors become more parallel.
This paper introduces a hardware predictor of instruction criticality and uses it to improve performance. The predictor is both effective and simple in its hardware implementation. The effectiveness at improving performance stems from using a dependence-graph model of the microarchitectural critical path that identifies execution bottlenecks by incorporating both data and machine-specific dependences. The simplicity stems from a token-passing algorithm that computes the critical path without actually building the dependence graph.
By focusing processor policies on critical instructions, our predictor enables a large class of optimizations. It can (i) give priority to critical instructions for scarce resources (functional units, ports, predictor entries); and (ii) suppress speculation on non-critical instructions, thus reducing “useless” misspeculations. We present two case studies that illustrate the potential of the two types of optimization, we show that (i) critical-path-based dynamic instruction scheduling and steering in a clustered architecture improves performance by as much as 21% (10% on average); and (ii) focusing value prediction only on critical instructions improves performance by as much as 5%, due to removing nearly half of the misspeculations.
- 1.V. Agarwal, M.S. Hrishikesh, S. Keckler, and D. Burger. Clock rate versus IPC: The end of the road for conventional microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA '00), Vancouver, June 10-14 2000. Google ScholarDigital Library
- 2.A. Baniasadi and A. Moshovos. Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors. In Proceedings of the 33th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-O0), pages 337-347, December 10-13 2000. Google ScholarDigital Library
- 3.P. Barford and M. Crovella. Critical path analysis of TCP transactions. In Proceedings qf ACM SIGCOMM 2000, January 2000. Google ScholarDigital Library
- 4.D. C. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-1997-1342, University of Wisconsin, Madison, June 1997.Google ScholarDigital Library
- 5.B. Calder, G. Reinman, and D. Tullsen. Selective value prediction. In Proceedings of the 26th Annual International Symposium on Computer Architecture (1SCA '99), pages 64-75, New York, N.Y., May 1-5 1999. Google ScholarDigital Library
- 6.K.I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-97), pages 149-159, Los Alamitos, December 1-3 1997. Google ScholarDigital Library
- 7.B. R. Fisk and R. I. Bahar. The non-critical buffer: Using load latency tolerance to improve data cache efficiency. In IEEE International ConJerence on Computer Design, Austin, TX, 1999. Google ScholarDigital Library
- 8.L. Gwennap. Digital 21264 sets new standard. Microprocessor Report, 10:9-15, October 1996.Google Scholar
- 9.J. K. Hollingsworth. Critical path profiling of message passing and shared-memory programs. IEEE Transactions on Parallel and Distributed Systems, 9(10), October 1998. Google ScholarDigital Library
- 10.G. Kemp and M. Franklin. PEWs: A decentralized dynamic scheduling algorithm for ILP processing. In International Conference on Parallel Processing, pages 239-246, Aug 1996.Google Scholar
- 11.M. H. Lipasti and J. P. Shen. Exceeding the dataflow limit via value prediction. In Proceedings qfthe 29th Annual lnternational Symposium on Microarchitecture, pages 226-237, Paris, France, December 2-4, 1996. Google ScholarDigital Library
- 12.S. Palacharla, N. P. Jouppi, and J. E. Smith. Complexityeffective superscalar processors. In 24th Annual International Symposium on Computer Architecture, pages 206-218, 1997. Google ScholarDigital Library
- 13.E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace processors. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-97), pages 138-148, Los Alamitos, December 1-3 1997. Google ScholarDigital Library
- 14.A. Roth and GS. Sohi. Speculative data-driven muhithreading. In Proceedings of the Seventh International Symposium on High-PerJormance Computer Architecture, Jan 2001. Google ScholarDigital Library
- 15.M. Schlansker, V. Kathail, and S. Anik. Height reduction of control recurrences for ILP processors. In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 40-51, San Jose, California, November 30-December 2, 1994. Google ScholarDigital Library
- 16.M. Schlansker, S. Mahlke, and R. Johnson. Control CPR: A branch height reduction optimization for EPIC architectures. In Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation, pages 155- 168, 1999. Google ScholarDigital Library
- 17.S. T. Srinivasan, R. Dz ching Ju, A. R. Lebeck, and C. Wilkerson. Locality vs. criticality. In Proceedings of the 28th Annual International Symposium on Computer Architecture, June 2001. Google ScholarDigital Library
- 18.S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-98), pages 148-159, Los Alamitos, November 30-December 2 1998. Google ScholarDigital Library
- 19.J. Stark, P. Racunas, and Y. N. Patt. Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-97), pages 34-45, Los Alamitos, December 1-3 1997. Google ScholarDigital Library
- 20.D. Tullsen and B. Calder. Computing along the critical path. Technical report, University of California, San Diego, Oct 1998.Google Scholar
- 21.E. Tune, D. Liang, D. M. Tullsen, and B. Calder. Dynamic prediction of critical path instructions. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, Jan 2001. Google ScholarDigital Library
- 22.K. Wang and M. Franklin. Highly accurate data value prediction using hybrid predictors. In Proceedings f'the 30th Annual IEEE/ACM International S),mposium on Microarchitecture (MICRO-97), pages 281-291, Los Alamitos, December 1- 3 1997. Google ScholarDigital Library
- 23.A. Yoaz, M. Erez, R. Ronen, and S. Jourdan. Speculative techniques for improving load related instruction scheduling. In Proceedings (of the 26th Annual International Symposium on Computer Architecture, June 1999. Google ScholarDigital Library
Index Terms
- Focusing processor policies via critical-path prediction
Recommendations
Focusing processor policies via critical-path prediction
Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)Although some instructions hurt performance more than others, current processors typically apply scheduling and speculation as if each instruction was equally costly. Instruction cost can be naturally expressed through the critical path: if we could ...
Difficult-path branch prediction using subordinate microthreads
Special Issue: Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02)Branch misprediction penalties continue to increase as microprocessor cores become wider and deeper. Thus, improving branch prediction accuracy remains an important challenge. Simultaneous Subordinate Microthreading (SSMT) provides a means to improve ...
Comments