ABSTRACT
Current instruction fetch policies in SMT processors are oriented towards optimization of overall throughput and/or fairness. However, they provide no control over how individual threads are executed, leading to performance unpredictability, since the IPC of a thread depends on the workload it is executed in and on the fetch policy used.From the point of view of the Operating System (OS), it is the job scheduler that determines how jobs are executed. However, when the OS runs on an SMT processor, the job scheduler cannot guarantee execution time constraints of any job due to this performance unpredictability.In this paper we propose a novel kind of collaboration between the OS and the SMT hardware that enables the OS to enforce that a high priority thread runs at a specific fraction of its full speed. We present an extensive evaluation using many different workloads, that shows that this mechanism gives the required performance in more than 97% of all cases considered, and even more than 99% for the less extreme cases. At the same time, our mechanism does not need to trade off predictability against overall throughput, as it maximizes the IPC of the remaining low priority threads, giving 94% on average (and 97.5% on average for the less extreme cases) of the throughput obtained using instruction fetch policies oriented toward throughput maximization, such as icount.
- D. Alpert. Will microprocessors become simpler? Microprocessor Report, Nov. 2003.Google Scholar
- J. Burns and J.-L. Gaudiot. Quantifying the SMT layout overhead-does SMT pull its weight? Proceedings of the 6th Intl. Conference on High Performance Computer Architecture, pages 109--120, Jan. 2000.Google Scholar
- J. Burns and J.-L. Gaudiot. SMT layout overhead and scalability. IEEE Transactions on Parallel and Distributed Systems, 13(1):142--155, Feb. 2002. Google ScholarDigital Library
- F. J. Cazorla, E. Fernandez, A. Ramirez, and M. Valero. Improving memory latency aware fetch policies for SMT processors. Proceedings of the 5th International Symposium on High Performance Computing, Oct. 2003.Google ScholarCross Ref
- D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. Proceedings of Design Automation Conference, June 2000.Google Scholar
- G. K. Dorai and D. Yeung. Transparent threads: Resource sharing in smt processors for high single-thread performance. Proceedings of the 11th Intl. Conference on Parallel Architectures and Compilation Techniques, pages 30--41, Sept. 2002. Google ScholarDigital Library
- A. El-Moursy and D. Albonesi. Front-end policies for improved issue efficiency in SMT processors. Proceedings of the 9th Intl. Conference on High Performance Computer Architecture, Feb. 2003. Google ScholarDigital Library
- P. N. Glaskowsky. IBM previews Power5. Microprocessor Report, Sept. 2003.Google Scholar
- M. Gulati and N. Bagherzadeh. Performance study of a multithreaded superscalar microprocessor. Proceedings of the 2nd Intl. Conference on High Performance Computer Architecture, pages 291--301, Feb. 1996. Google ScholarDigital Library
- S. Hily and A. Seznec. Contention on 2nd level cache may limit the effectiveness of simultaneous multithreading. Technical Report 1086, IRISA, Feb. 1997.Google Scholar
- H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An elementary processor architecture with simultaneous instruction issuing from multiple threads. Proceedings of the 19th Annual Intl. Symposium on Computer Architecture, pages 136--145, May 1992. Google ScholarDigital Library
- R. Jain, C. Hughes, and S. Adve. Soft real-time scheduling on simultaneous multithreaded processors. Proceedings of the 5th International Symposium on Real-Time Systems Symposium, pages 134--145, Dec. 2002. Google ScholarDigital Library
- R. Kalla, B. Sinharoy, and J. Tendler. SMT implementation in POWER 5. Hot Chips, 15, Aug. 2003.Google Scholar
- P. Knijnenburg, A. Ramirez, J. Larriba, and M. Valero. Branch classification for SMT fetch gating. Proceedings of the 6th Workshop on Multithreaded Execution, Architecture, and Compilation, pages 49--56, 2002.Google Scholar
- C. Limousin, J. Sebot, A. Vartanian, and N. Drach-Temam. Improving 3D geometry transformations on a simultaneous multithreaded SIMD processor. Proceedings of the 15th Intl. Conference on Supercomputing, pages 236--245, May 2001. Google ScholarDigital Library
- K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT processors. Proceedings of the International Symposium on Performance Analysis of Systems and Software, pages 164--171, Nov. 2001.Google Scholar
- D. T. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1), Feb. 2002.Google Scholar
- T. Sherwood, E. Perelman, and B. Calder. Basic block distribution analysis to find periodic behavior and simulation points in applications. Proceedings of the 10th Intl. Conference on Parallel Architectures and Compilation Techniques, Sept. 2001. Google ScholarDigital Library
- R. Shin, S.-W. Lee, and J. L. Gaudiot. Dynamic scheduling issues in smt architectures. Proceedings of the International Parallel and Distributed Processing Symposium, Apr. 2003. Google ScholarDigital Library
- A. Snavely, D. Tullsen, and G. Voelker. Symbiotic job scheduling with priorities for a simultaneous multithreaded processor. Proceedings of the 9th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, pages 234--244, Nov. 2000. Google ScholarDigital Library
- D. Tullsen and J. Brown. Handling long-latency loads in a simultaneous multithreaded processor. Proceedings of the 34th Annual ACM/IEEE Intl. Symposium on Microarchitecture, Dec. 2001. Google ScholarDigital Library
- D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. Proceedings of the 23th Annual Intl. Symposium on Computer Architecture, pages 191--202, Apr. 1996. Google ScholarDigital Library
- D. Tullsen, S. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. Proceedings of the 22th Annual Intl. Symposium on Computer Architecture, 1995. Google ScholarDigital Library
- R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling. Proceedings of the 30th Annual Intl. Symposium on Computer Architecture, pages 84--97, June 2003. Google ScholarDigital Library
- W. Yamamoto and M. Nemirovsky. Increasing superscalar performance through multistreaming. Proceedings of the 4th Intl. Conference on Parallel Architectures and Compilation Techniques, pages 49--58, June 1995. Google ScholarDigital Library
Index Terms
- Predictable performance in SMT processors
Recommendations
Architectural support for real-time task scheduling in SMT processors
CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systemsIn Simultaneous Multithreaded (SMT) architectures most hardware resources are shared between threads. This provides a good cost/performance trade-off which renders these architectures suitable for use in embedded systems. However, since threads share ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Predictable Performance in SMT Processors: Synergy between the OS and SMTs
Current Operating Systems (OS) perceive the different contexts of Simultaneous Multithreaded (SMT) processors as multiple independent processing units, although, in reality, threads executed in these units compete for the same hardware resources. ...
Comments