ABSTRACT
The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of memory accesses with other computation and memory accesses. Previous studies on relaxed models have shown that the latency of write accesses can be hidden by buffering writes and allowing reads to bypass pending writes. Hiding the latency of reads by exploiting the overlap allowed by relaxed models is inherently more difficult, however, simply because the processor depends on the return value for its future computation.
This paper explores the use of dynamically scheduled processors to exploit the overlap allowed by relaxed models for hiding the latency of reads. Our results are based on detailed simulation studies of several parallel applications. The results show that a substantial fraction of the read latency can be hidden using this technique. However, the major improvements in performance are achieved only at large instruction window sizes.
- 1.Sarita Adve and Mark Hill. Weak ordering - A new definition. In Proceedings of the 17th Annual international Symposium on Computer Architecture, pages 2-14, May 1990. Google ScholarDigital Library
- 2.Anant Agarwal, Beng-Hong Lim, David Kranz, and Jolm Kubiatowicz. April: A processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104---114, May 1990. Google ScholarDigital Library
- 3.Jean-Loup Baer and Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, pages 176-186, November 1991. Google ScholarDigital Library
- 4.James Boyle et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987. Google ScholarDigital Library
- 5.W. Buchholz, editor. Planning a Computer System: Project Stretch. McGraw-Hill, 1962. Google ScholarDigital Library
- 6.Michel Dubois, Christoph Scheurich, and Fay6 Briggs. Memory access buffering in multiprocessors. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 43'!. A.'!.2, June 1986. Google ScholarDigital Library
- 7.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Performance evaluation of memory consistency models for shared-memory multiprocessors. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245-257, April 1991. Google ScholarDigital Library
- 8.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing, pages 1:355-364, August 1991.Google Scholar
- 9.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Hiding memory latency using dynamic scheduling in sharedmemory multiprocessors. Technical report, Stanford University, April 1992.Google Scholar
- 10.Kourosh Gharachorloo, Dan Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15-26, May 1990. Google ScholarDigital Library
- 11.Stephen R. Goldschmidt and Helen Davis. Tango introduction and tutorial. Technical Report CSL-TR-90-410, Stanford University, 1990. Google ScholarDigital Library
- 12.James R. Goodman. Cache consistency and sequential consistency. Technical Report Computer Sciences #1006, University of Wisconsin, Madison, February 1991.Google Scholar
- 13.E. Gornish, E. Granston, and A. Veidenbaum. Compilerdirected data prefetching in multiprocessors with memory hierarchies. In International Conference on Supercomputing, pages 354-368, 1990. Google ScholarDigital Library
- 14.Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceeding of the 18th Annual International Symposium on Computer Architecture, pages 254-263, May 1991. Google ScholarDigital Library
- 15.Robert H. Halstead, Jr. and Tetsuya Fujita. MASA: A multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 443-451, June 1988. Google ScholarDigital Library
- 16.John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990. Google ScholarDigital Library
- 17.R. A. iannucci. Toward a dataflow/von Neumann hybrid architecture. In Proceedings of the 15th Annual international Symposium on Computer Architecture, pages 131-140, June 1988. Google ScholarDigital Library
- 18.Mike Johnson. Superscalar Microprocessor Design. Prentice Hall, 1991.Google Scholar
- 19.R. M. Keller. Look-ahead processors. Computing Surveys, 7(4):177-195, 1975. Google ScholarDigital Library
- 20.Eric J. Koldinger, Susan J. Eggers, and Henry M. Levy. On the validity of trace-driven simulation for multiprocessors. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 244-253, May 1991. Google ScholarDigital Library
- 21.D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 81-85, 1981. Google ScholarDigital Library
- 22.Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241-248, September 1979.Google Scholar
- 23.J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. IEEE Computer, 17:6--22, 1984.Google ScholarDigital Library
- 24.Roland L. Lee. The Effectiveness of Caches amt Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1987. Google ScholarDigital Library
- 25.Roland L. Lee, Pen-Chung Yew, and Duncan H. Lawrie. Data prefetching in shared memory multiprocessors. In Proceedings of the 1987 International Conference on Parallel Processing, pages 28-31, August 1987.Google Scholar
- 26.Jeffrey D. McDonald and Donald Baganoff. Vectorization of a particle simulation method for hypersonic ratified flow. In AIAA Thermodynamics, Plasmadynamics and Lasers Conference, June 1988.Google Scholar
- 27.Stephen Melvin and Yale Patt. Exploiting fine-grained parallelism through a combination of hardware and software techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 287-296, May 1991. Google ScholarDigital Library
- 28.Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, June 1991. Google ScholarDigital Library
- 29.Allan K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, May 1989. Google ScholarDigital Library
- 30.Jonathan Rose. Locusroute: A parallel global router for standard cells. In Design Automation Conference, pages 189- 195, June 1988. Google ScholarDigital Library
- 31.Jaswinder Pal Singh and John L. Hennessy. Parallelizing the simulation of ocean eddy currents. Technical Report CSL- TR-89-388, Stanford University, August 1989.Google Scholar
- 32.Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared Memory. Technical Report CSL-TR-91-469, Stanford University, May 1991. Google ScholarDigital Library
- 33.Burton J. Smith. Architecture and applications of the HEP muttiprocessor computer system. SPIE, 298:241-248, 1981.Google ScholarCross Ref
- 34.J. E. Smith and A. R. Pleszkun. Implementation of precise interrupts in pipelined processors. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 36-44, June 1985. Google ScholarDigital Library
- 35.Larry Soule and Anoop Gupta. Parallel distributed-time logic simulation. IEEE Design and Test of Computers, 6(6):32-48, December 1989. Google ScholarDigital Library
- 36.R. M. Tomasulo. An efficient hardware algorithm for exploiting multiple arithmetic units. IBM Journal, 11:25-33, 1967.Google ScholarDigital Library
- 37.Wolf-Dietrich Weber and Anoop Gupta. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th Annual international Symposium on Computer Architecture, pages 273-280, June 1989. Google ScholarDigital Library
Index Terms
- Hiding memory latency using dynamic scheduling in shared-memory multiprocessors
Recommendations
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors
Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of ...
Scalable directory architecture for distributed shared memory chip multiprocessors
Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Comments