skip to main content
10.1145/139669.139678acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article
Free Access

Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

Published:01 April 1992Publication History

ABSTRACT

The large latency of memory accesses is a major impediment to achieving high performance in large scale shared-memory multi-processsors. Relaxing the memory consistency model is an attractive technique for hiding this latency by allowing the overlap of memory accesses with other computation and memory accesses. Previous studies on relaxed models have shown that the latency of write accesses can be hidden by buffering writes and allowing reads to bypass pending writes. Hiding the latency of reads by exploiting the overlap allowed by relaxed models is inherently more difficult, however, simply because the processor depends on the return value for its future computation.

This paper explores the use of dynamically scheduled processors to exploit the overlap allowed by relaxed models for hiding the latency of reads. Our results are based on detailed simulation studies of several parallel applications. The results show that a substantial fraction of the read latency can be hidden using this technique. However, the major improvements in performance are achieved only at large instruction window sizes.

References

  1. 1.Sarita Adve and Mark Hill. Weak ordering - A new definition. In Proceedings of the 17th Annual international Symposium on Computer Architecture, pages 2-14, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.Anant Agarwal, Beng-Hong Lim, David Kranz, and Jolm Kubiatowicz. April: A processor architecture for multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 104---114, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.Jean-Loup Baer and Tien-Fu Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, pages 176-186, November 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.James Boyle et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.W. Buchholz, editor. Planning a Computer System: Project Stretch. McGraw-Hill, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.Michel Dubois, Christoph Scheurich, and Fay6 Briggs. Memory access buffering in multiprocessors. In Proceedings of the 13th Annual International Symposium on Computer Architecture, pages 43'!. A.'!.2, June 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Performance evaluation of memory consistency models for shared-memory multiprocessors. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245-257, April 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two techniques to enhance the performance of memory consistency models. In Proceedings of the 1991 International Conference on Parallel Processing, pages 1:355-364, August 1991.Google ScholarGoogle Scholar
  9. 9.Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Hiding memory latency using dynamic scheduling in sharedmemory multiprocessors. Technical report, Stanford University, April 1992.Google ScholarGoogle Scholar
  10. 10.Kourosh Gharachorloo, Dan Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 15-26, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.Stephen R. Goldschmidt and Helen Davis. Tango introduction and tutorial. Technical Report CSL-TR-90-410, Stanford University, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.James R. Goodman. Cache consistency and sequential consistency. Technical Report Computer Sciences #1006, University of Wisconsin, Madison, February 1991.Google ScholarGoogle Scholar
  13. 13.E. Gornish, E. Granston, and A. Veidenbaum. Compilerdirected data prefetching in multiprocessors with memory hierarchies. In International Conference on Supercomputing, pages 354-368, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceeding of the 18th Annual International Symposium on Computer Architecture, pages 254-263, May 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.Robert H. Halstead, Jr. and Tetsuya Fujita. MASA: A multithreaded processor architecture for parallel symbolic computing. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 443-451, June 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.R. A. iannucci. Toward a dataflow/von Neumann hybrid architecture. In Proceedings of the 15th Annual international Symposium on Computer Architecture, pages 131-140, June 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.Mike Johnson. Superscalar Microprocessor Design. Prentice Hall, 1991.Google ScholarGoogle Scholar
  19. 19.R. M. Keller. Look-ahead processors. Computing Surveys, 7(4):177-195, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.Eric J. Koldinger, Susan J. Eggers, and Henry M. Levy. On the validity of trace-driven simulation for multiprocessors. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 244-253, May 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.D. Kroft. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th Annual International Symposium on Computer Architecture, pages 81-85, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):241-248, September 1979.Google ScholarGoogle Scholar
  23. 23.J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target buffer design. IEEE Computer, 17:6--22, 1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.Roland L. Lee. The Effectiveness of Caches amt Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, May 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.Roland L. Lee, Pen-Chung Yew, and Duncan H. Lawrie. Data prefetching in shared memory multiprocessors. In Proceedings of the 1987 International Conference on Parallel Processing, pages 28-31, August 1987.Google ScholarGoogle Scholar
  26. 26.Jeffrey D. McDonald and Donald Baganoff. Vectorization of a particle simulation method for hypersonic ratified flow. In AIAA Thermodynamics, Plasmadynamics and Lasers Conference, June 1988.Google ScholarGoogle Scholar
  27. 27.Stephen Melvin and Yale Patt. Exploiting fine-grained parallelism through a combination of hardware and software techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 287-296, May 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87-106, June 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. 29.Allan K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, Department of Computer Science, Rice University, May 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30.Jonathan Rose. Locusroute: A parallel global router for standard cells. In Design Automation Conference, pages 189- 195, June 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. 31.Jaswinder Pal Singh and John L. Hennessy. Parallelizing the simulation of ocean eddy currents. Technical Report CSL- TR-89-388, Stanford University, August 1989.Google ScholarGoogle Scholar
  32. 32.Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford Parallel Applications for Shared Memory. Technical Report CSL-TR-91-469, Stanford University, May 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. 33.Burton J. Smith. Architecture and applications of the HEP muttiprocessor computer system. SPIE, 298:241-248, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  34. 34.J. E. Smith and A. R. Pleszkun. Implementation of precise interrupts in pipelined processors. In Proceedings of the 12th Annual International Symposium on Computer Architecture, pages 36-44, June 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 35.Larry Soule and Anoop Gupta. Parallel distributed-time logic simulation. IEEE Design and Test of Computers, 6(6):32-48, December 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. 36.R. M. Tomasulo. An efficient hardware algorithm for exploiting multiple arithmetic units. IBM Journal, 11:25-33, 1967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. 37.Wolf-Dietrich Weber and Anoop Gupta. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th Annual international Symposium on Computer Architecture, pages 273-280, June 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  ISCA '92: Proceedings of the 19th annual international symposium on Computer architecture
                  May 1992
                  439 pages
                  ISBN:0897915097
                  DOI:10.1145/139669
                  • cover image ACM SIGARCH Computer Architecture News
                    ACM SIGARCH Computer Architecture News  Volume 20, Issue 2
                    Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)
                    May 1992
                    429 pages
                    ISSN:0163-5964
                    DOI:10.1145/146628
                    Issue’s Table of Contents

                  Copyright © 1992 Authors

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 April 1992

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate543of3,203submissions,17%

                  Upcoming Conference

                  ISCA '24

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader