Tolerating memory latency is essential to achieving high performance in scalable shared-memory multiprocessors. In addition, tolerating instruction (pipeline dependency) latency is essential to maximize the performance of individual processors. Multiple-context processors have been proposed as a universal mechanism to mitigate the negative effects of latency. These processors tolerate latency by switching to a concurrent thread of execution whenever one of the threads blocks due to a high-latency operation. Multiple context processors built so far, however, either have a high context-switch cost which disallows tolerance of short latencies (e.g., due to pipeline dependencies), or alternatively they require excessive concurrency from the software.
We propose a multiple-context architecture that combines full single-thread support with cycle-by-cycle context interleaving to provide lower switch costs and the ability to tolerate short latencies. We compare the performance of our proposal with that of earlier approaches, showing that our approach offers substantially better performance for parallel applications. We also explore using our approach for uniprocessor workstations--an important environment for commodity microprocessors. We show that our approach also offers much better performance for multiprogrammed uniprocessor workloads.
Finally, we explore the implementation issues for both our proposed and existing multiple-context architectures. One of the larger costs for a multiple-context processor arises in providing a cache capable of handling multiple outstanding requests, and we propose a lockup-free cache which provides high performance at a reasonable cost. We also show that amount of processor state that needs to be replicated to support multiple contexts is modest and the extra complexity required to control the multiple contexts under both our proposed and existing approaches is manageable. The performance benefits and reasonable implementation cost of our approach make it a promising candidate for addition to future microprocessors.
Cited By
- Dimitriou G and Polychronopoulos C Hardware support for multithreaded execution of loops with limited parallelism Proceedings of the 10th Panhellenic conference on Advances in Informatics, (622-632)
- Saxena N, Fernandez-Gomez S, Huang W, Mitra S, Yu S and McCluskey E (2019). Dependable Computing and Online Testing in Adaptive and Configurable Systems, IEEE Design & Test, 17:1, (29-41), Online publication date: 1-Jan-2000.
- Motomura M, Inoue T, Torii S and Konagaya A Ordered multithreading Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, (37-48)
- Laudon J, Gupta A and Horowitz M (1994). Interleaving, ACM SIGPLAN Notices, 29:11, (308-318), Online publication date: 1-Nov-1994.
- Laudon J, Gupta A and Horowitz M Interleaving Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, (308-318)
- Laudon J, Gupta A and Horowitz M (1994). Interleaving, ACM SIGOPS Operating Systems Review, 28:5, (308-318), Online publication date: 1-Dec-1994.
Index Terms
- Architectural and implementation tradeoffs for multiple-context processors
Recommendations
Prediction caches for superscalar processors
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on MicroarchitectureProcessor cycle times are currently much faster than memory cycle times, and this gap continues to increase. Adding a high speed cache memory allows the processor to run at full speed, as long as the data it needs is present in the cache. However, ...
Effects of Multithreading on Cache Performance
Special issue on cache memory and related problemsAs the performance gap between processor and memory grows, memory latency becomes a major bottleneck in achieving high processor utilization. Multithreading has emerged as one of the most promising and exciting techniques used to tolerate memory latency ...