research-article

Software data spreading: leveraging distributed caches to improve single thread performance

Authors:
Md Kamruzzaman

University of California - San Diego, San Diego, CA, USA

University of California - San Diego, San Diego, CA, USA
View Profile

,
Steven Swanson

University of California - San Diego, San Diego, CA, USA

University of California - San Diego, San Diego, CA, USA
View Profile

,
Dean M. Tullsen

University of California - San Diego, San Diego, CA, USA

University of California - San Diego, San Diego, CA, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 45 Issue 6June 2010pp 460–470https://doi.org/10.1145/1809028.1806648

Published:05 June 2010Publication History

ACM SIGPLAN Notices

Abstract

Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages the cache capacity of extra cores and extra sockets rather than their computational resources. Software data spreading is a software-only technique that uses compiler-directed thread migration to aggregate cache capacity across cores and chips and improve performance. This paper describes an automated scheme that applies data spreading to various types of loops. Experiments with a set of SPEC2000, SPEC2006, NAS, and microbenchmark workloads show that data spreading can provide speedup of over 2, averaging 17% for the SPEC and NAS applications on two systems. In addition, despite using more cores for the same computation, data spreading actually saves power since it reduces access to DRAM.

References

First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). 2008. Intel White paper.Google Scholar
D. H. Bailey, E. Barzcz, L. Dagum, and H. D. Simon. NAS parallel benchmark results. IEEE Concurrency, February 1993. Google ScholarDigital Library
J. A. Brown and D. M. Tullsen. The Shared-Thread Multiprocessor. In International Conference on Supercomputing, June 2008. Google ScholarDigital Library
J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd annual International Symposium on Computer Architecture, June 2006. Google ScholarDigital Library
R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous subordinate microthreading (ssmt). In Proceedings of the international symposium on Computer Architecture, May 1999. Google ScholarDigital Library
J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic speculative precompuation. In Proceedings of the International Symposium on Microarchitecture, December 2001. Google ScholarDigital Library
J. Collins, H. Wang, D. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
J. L. Henning. SPEC CPU2000: Measuring cpu performance in the new millennium. Computer, July 2000. Google ScholarDigital Library
J. L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News, September 2006. Google ScholarDigital Library
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler. A NUCA substrate for flexible CMP cache sharing. In International Conference on Supercomputing, June 2005. Google ScholarDigital Library
E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: accommodating software diversity in chip multiprocessors. SIGARCH Comput. Archit. News, May 2007. Google ScholarDigital Library
N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the international symposium on Computer Architecture, June 1990. Google ScholarDigital Library
D. Kim, S. Liao, P. Wang, J. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. Shen. Physical experiment with prefetching helper threads on Intel's hyper-threaded processors. In International Symposium on Code Generation and Optimization, March 2004. Google ScholarDigital Library
D. Kim and D. Yeung. Design and evaluation of compiler algorithm for pre-execution. In Proceedings of the international conference on Architectural support for programming languages and operating systems, October 2002. Google ScholarDigital Library
W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks. System level analysis of fast, per-core dvfs using on-chip switching regulators. Proceedings of the 14th International Symposium on High Performance Computer Architecture, February 2008.Google Scholar
Koushik Chakraborty and Philip M. Wells and Gurindar S. Sohi. Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly. In Proceedings of the international conference on Architectural support for programming languages and operating systems, November 2006. Google ScholarDigital Library
V. Krishnan and J. Torrellas. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, September 1999. Google ScholarDigital Library
R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoined-core chip multiprocessing. In Proceedings of the International Symposium on Microarchitecture, December 2004. Google ScholarDigital Library
R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-isa heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st Annual International Symposium on Computer Architecture, June 2004. Google ScholarDigital Library
M. Lam, E. Rothberg, and M. Wolf. The cache performance and optimization of blocked algorithms. In Proceedings of the international conference on Architectural support for programming languages and operating systems, April 1991. Google ScholarDigital Library
S. Liao, P. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. Shen. Post-pass binary adaptation for software-based speculative precomputation. In Proceedings of the conference on Programming Language Design and Implementation, October 2002. Google ScholarDigital Library
C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, February 2004. Google ScholarDigital Library
C.-K. Luk. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 conference on Programming Language Design and Implementation, June 2005. Google ScholarDigital Library
M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd annual International Symposium on Computer Architecture, June 2005. Google ScholarDigital Library
P. Marcuello, A. González, and J. Tubella. Speculative multithreaded processors. In 12th International Conference on Supercomputing, November 1998. Google ScholarDigital Library
H. McGhan. Niagara 2 opens the floodgates. Microprocessor Reports, November 2006.Google Scholar
A. McKeller and E. Coffman. The organization of matrices and matrix operations in a paged multiprogramming environment. Communications of the ACM, Mar. 1969. Google ScholarDigital Library
P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In Proceedings of the 10th International Symposium on High Performance Computer Architecture, February 2004. Google ScholarDigital Library
C. G. Quiñones, C. Madriles, J. Sánchez, P. Marcuello, A. González, and D. M. Tullsen. Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005. Google ScholarDigital Library
G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar processors. In Proceedings of the International Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
R. Strong, J. Mudigonda, J. C. Mogul, N. Binkert, and D. Tullsen. Fast switching of threads between cores. SIGOPS Oper. Syst. Rev., April 2009. Google ScholarDigital Library
D. M. Tullsen. Simulation and modeling of a simultaneous multithreading processor. In Proceedings of the 22nd Annual Computer Measurement Group Conference, December 1996.Google Scholar
W. Zhang, B. Calder, and D. Tullsen. A self-repairing prefetcher in an event-driven dynamic optimization framework. In International Symposium on Code Generation and Optimization, March 2006. Google ScholarDigital Library
W. Zhang, B. Calder, and D. M. Tullsen. An event-driven multithreaded dynamic optimization framework. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, September 2005. Google ScholarDigital Library
W. Zhang, D. Tullsen, and B. Calder. Accelerating and adapting precomputation threads for efficient prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, January 2007. Google ScholarDigital Library
C. Zilles and G. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, July 2001. Google ScholarDigital Library

Index Terms

Software data spreading: leveraging distributed caches to improve single thread performance
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Software data spreading: leveraging distributed caches to improve single thread performance
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a ...
Read More
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Read More
Inter-core prefetching for multicore processors using migrating helper threads
ASPLOS '11

Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 45, Issue 6
PLDI '10
June 2010
496 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1809028
Issue’s Table of Contents
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2010
514 pages
ISBN:9781450300193
DOI:10.1145/1806596
General Chair:
Ben Zorn
Microsoft Research
,
Program Chair:
Alex Aiken
Stanford University
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2010
Check for updates
Author Tags
chip multiprocessors
compilers
single-thread performance
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 482
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Software data spreading: leveraging distributed caches to improve single thread performance

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Software data spreading: leveraging distributed caches to improve single thread performance

Inter-core prefetching for multicore processors using migrating helper threads

Inter-core prefetching for multicore processors using migrating helper threads