Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92: Proceedings of the 19th annual international symposium on Computer architectureMay 1992Pages 22–33https://doi.org/10.1145/139669.139678

Published:01 April 1992Publication History

ISCA '92: Proceedings of the 19th annual international symposium on Computer architecture

Pages 22–33

ABSTRACT

The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of memory accesses with other computation and memory accesses. Previous studies on relaxed models have shown that the latency of write accesses can be hidden by buffering writes and allowing reads to bypass pending writes. Hiding the latency of reads by exploiting the overlap allowed by relaxed models is inherently more difficult, however, simply because the processor depends on the return value for its future computation.

This paper explores the use of dynamically scheduled processors to exploit the overlap allowed by relaxed models for hiding the latency of reads. Our results are based on detailed simulation studies of several parallel applications. The results show that a substantial fraction of the read latency can be hidden using this technique. However, the major improvements in performance are achieved only at large instruction window sizes.

References

1.Sarita Adve and Mark Hill. Weak ordering - A new definition. In Proceedings of the 17th Annual international Symposium on Computer Architecture, pages 2-14, May 1990. Google ScholarDigital Library
2.Anant Agarwal, Beng-Hong Lim, David Kranz, and Jolm Kubiatowicz. April: A processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104---114, May 1990. Google ScholarDigital Library
3.Jean-Loup Baer and Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, pages 176-186, November 1991. Google ScholarDigital Library
4.James Boyle et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987. Google ScholarDigital Library
5.W. Buchholz, editor. Planning a Computer System: Project Stretch. McGraw-Hill, 1962. Google ScholarDigital Library
6.Michel Dubois, Christoph Scheurich, and Fay6 Briggs. Memory access buffering in multiprocessors. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 43'!. A.'!.2, June 1986. Google ScholarDigital Library
7.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Performance evaluation of memory consistency models for shared-memory multiprocessors. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245-257, April 1991. Google ScholarDigital Library
8.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing, pages 1:355-364, August 1991.Google Scholar
9.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Hiding memory latency using dynamic scheduling in sharedmemory multiprocessors. Technical report, Stanford University, April 1992.Google Scholar
10.Kourosh Gharachorloo, Dan Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15-26, May 1990. Google ScholarDigital Library
11.Stephen R. Goldschmidt and Helen Davis. Tango introduction and tutorial. Technical Report CSL-TR-90-410, Stanford University, 1990. Google ScholarDigital Library
12.James R. Goodman. Cache consistency and sequential consistency. Technical Report Computer Sciences #1006, University of Wisconsin, Madison, February 1991.Google Scholar
13.E. Gornish, E. Granston, and A. Veidenbaum. Compilerdirected data prefetching in multiprocessors with memory hierarchies. In International Conference on Supercomputing, pages 354-368, 1990. Google ScholarDigital Library
14.Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceeding of the 18th Annual International Symposium on Computer Architecture, pages 254-263, May 1991. Google ScholarDigital Library
15.Robert H. Halstead, Jr. and Tetsuya Fujita. MASA: A multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 443-451, June 1988. Google ScholarDigital Library
16.John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990. Google ScholarDigital Library
17.R. A. iannucci. Toward a dataflow/von Neumann hybrid architecture. In Proceedings of the 15th Annual international Symposium on Computer Architecture, pages 131-140, June 1988. Google ScholarDigital Library
18.Mike Johnson. Superscalar Microprocessor Design. Prentice Hall, 1991.Google Scholar
19.R. M. Keller. Look-ahead processors. Computing Surveys, 7(4):177-195, 1975. Google ScholarDigital Library
20.Eric J. Koldinger, Susan J. Eggers, and Henry M. Levy. On the validity of trace-driven simulation for multiprocessors. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 244-253, May 1991. Google ScholarDigital Library
21.D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 81-85, 1981. Google ScholarDigital Library
22.Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241-248, September 1979.Google Scholar
23.J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. IEEE Computer, 17:6--22, 1984.Google ScholarDigital Library
24.Roland L. Lee. The Effectiveness of Caches amt Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1987. Google ScholarDigital Library
25.Roland L. Lee, Pen-Chung Yew, and Duncan H. Lawrie. Data prefetching in shared memory multiprocessors. In Proceedings of the 1987 International Conference on Parallel Processing, pages 28-31, August 1987.Google Scholar
26.Jeffrey D. McDonald and Donald Baganoff. Vectorization of a particle simulation method for hypersonic ratified flow. In AIAA Thermodynamics, Plasmadynamics and Lasers Conference, June 1988.Google Scholar
27.Stephen Melvin and Yale Patt. Exploiting fine-grained parallelism through a combination of hardware and software techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 287-296, May 1991. Google ScholarDigital Library
28.Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, June 1991. Google ScholarDigital Library
29.Allan K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, May 1989. Google ScholarDigital Library
30.Jonathan Rose. Locusroute: A parallel global router for standard cells. In Design Automation Conference, pages 189- 195, June 1988. Google ScholarDigital Library
31.Jaswinder Pal Singh and John L. Hennessy. Parallelizing the simulation of ocean eddy currents. Technical Report CSL- TR-89-388, Stanford University, August 1989.Google Scholar
32.Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared Memory. Technical Report CSL-TR-91-469, Stanford University, May 1991. Google ScholarDigital Library
33.Burton J. Smith. Architecture and applications of the HEP muttiprocessor computer system. SPIE, 298:241-248, 1981.Google ScholarCross Ref
34.J. E. Smith and A. R. Pleszkun. Implementation of precise interrupts in pipelined processors. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 36-44, June 1985. Google ScholarDigital Library
35.Larry Soule and Anoop Gupta. Parallel distributed-time logic simulation. IEEE Design and Test of Computers, 6(6):32-48, December 1989. Google ScholarDigital Library
36.R. M. Tomasulo. An efficient hardware algorithm for exploiting multiple arithmetic units. IBM Journal, 11:25-33, 1967.Google ScholarDigital Library
37.Wolf-Dietrich Weber and Anoop Gupta. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th Annual international Symposium on Computer Architecture, pages 273-280, June 1989. Google ScholarDigital Library

Index Terms

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

Recommendations

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors
Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)

The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of ...
Read More
Scalable directory architecture for distributed shared memory chip multiprocessors

Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Read More
Cache memory design and performance issues in shared-memory multiprocessors
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '92: Proceedings of the 19th annual international symposium on Computer architecture
May 1992
439 pages
ISBN:0897915097
DOI:10.1145/139669
Chairman:
Allan Gottlieb
New York Unvi., New York, NY
ACM SIGARCH Computer Architecture News Volume 20, Issue 2
Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)
May 1992
429 pages
ISSN:0163-5964
DOI:10.1145/146628
Editor:
Allan Gotlieb
New York Univ., New York, NY
Issue’s Table of Contents
Copyright © 1992 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 1992
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 841
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92: Proceedings of the 19th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

Scalable directory architecture for distributed shared memory chip multiprocessors

Cache memory design and performance issues in shared-memory multiprocessors