article

Open Access

A fast and accurate framework to analyze and optimize cache memory behavior

Authors:
Xavier Vera

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
View Profile

,
Nerina Bermudo

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
View Profile

,
Josep Llosa

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
View Profile

,
Antonio González

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain

Universitat Politècnica de Catalunya-Barcelona, Barcelona, Spain
View Profile

Authors Info & Claims

ACM Transactions on Programming Languages and Systems Volume 26 Issue 2pp 263–300https://doi.org/10.1145/973097.973099

Published:01 March 2004Publication History

ACM Transactions on Programming Languages and Systems

Abstract

The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature.This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.

References

Abella, J., González, A., Llosa, J., and Vera, X. 2002. Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In Proceedings of 31st International Conference on Parallel Processing (ICPP'02). Google Scholar
Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D. A. 1999. DBMSs on a modern processor: where does time go? In Proceedings of the 25th VLDB Conference (Edinburgh, Scotland). Google Scholar
Ammons, G., Ball, T., and Larus, J. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'97). 85--96. Google Scholar
Ayguadé, E. et al. 1995. A uniform internal representation for high-level and instruction-level transformations. Tech rep. UPC-DAC-95-02. Universitat Politècnica de Catalunya, Barcelona, Spain.Google Scholar
Banerjee, U. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, MA. Google Scholar
Banerjee, U. 1993. Loop transformations for restructuring compilers: The Foundation. Kluwer Academic Publishers, Norwell, MA. Google Scholar
Bedichek, R. 1995. Talismam: Fast and accurate multicomputer simulation. In Proceedings of ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'95). 14--24. Google Scholar
Carr, S. and Kennedy, K. 1992. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing (SC'92). 114--124. Google Scholar
Carr, S., McKinley, K., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the VI International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'94). 252--262. Google Scholar
Chatterjee, S., Jain, V. V., Lebeck, A. R., Mundhra, S., and Thottethodi, M. 1999. Nonlinear array layout for hierarchical memory systems. In Proceedings of the ACM International Conference on Supercomputing (Rhodes, Greece) (ICS'99). 444--453. Google Scholar
Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. 2001. Exact analysis of the cache behavior of nested loops. In ACM SIGPLAN '01 Conference on Programming Language Design and Implementation (PLDI'01). 286--297. Google Scholar
Clauss, P. 1996. Counting solutions to linear and non-linear constraints through Ehrhart polynomials. In Proceedings of ACM International Conference on Supercomputing (Philadelphia) (ICS'96). 278--285. Google Scholar
Coleman, S. and McKinley, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'95). 279--290. Google Scholar
DeGroot, M. 1998. Probability and Statistics. Addison-Wesley, Reading, MA.Google Scholar
Feautrier, P. 1996. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, G. R. Perrin and A. Darte, Eds. Lecture Notes in Computer Science, vol. 1132. Springer-Verlag, Berlin, Germany, 79--103. Google Scholar
Fraguela, B. B., Doallo, R., and Zapata, E. L. 1999. Automatic analytical modeling for the estimation of cache misses. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'99). Google Scholar
Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformations. J. Parallel. Distrib. Comput. 5, 587--616. Google Scholar
Gee, J., Hill, M., Pnevmatikatos, D., and Smith, A. 1993. Cache performance of the spec92 benchmark suite. IEEE Micro 13, 4 (Aug.), 17--27. Google Scholar
Ghosh, S. 1999. Compiler analysis framework for tuning memory behavior. Ph.D. dissertation. Princeton University, Princeton, NJ. Google Scholar
Ghosh, S., Martonosi, M., and Malik, S. 1998. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'98). 228--239. Google Scholar
Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Programm. Lang. Syst. 21, 4, 703--746. Google Scholar
Ghosh, S., Martonosi, M., and Malik, S. 2000. Automated cache optimizations using CME driven diagnosis. In Proceedings of the International Conference on Supercomputing (ICS'00). 316--326. Google Scholar
Goldberg, A. and Hennessy, J. 1991. Performance debugging shared memory multiprocessor programs with mtool. In Proceedings of Supercomputing (SC'91). 481--490. Google Scholar
Goldschmidt, S. and Hennessy, J. 1993. The accuracy of trace-driven simulation of multiprocessors. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'93). 146--157. Google Scholar
Haghighat, M. R. and Polychronopoulos, C. D. 1993. Symbolic analysis: A basis for parallelization, optimization and scheduling of programs. In 1993 Workshop on Languages and Compilers for Parallel Computing (LCPC'93). Springer Verlag, Portland, Ore., 567--585. Google Scholar
Hill, M. n.d. DineroIII: a uniprocessor cache simulator (http://www.cs.wisc.edu/&sim;larus/warts.html).Google Scholar
Kandemir, M., Choudhary, A., Banerjee, P., and Ramanujam, J. 1999. A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans. Parallel Distrib. Syst. 10, 2 (Feb.), 115--135. Google Scholar
Kennedy, K., Callahan, D., and Porterfield, A. 1990. Analyzing and visualizing performance of memory hierarchy. In Instrumentation for Visualization. ACM Press, New York, NY. Google Scholar
Kreisel, G. and Krevine, J. L. 1967. Elements of Mathematical Logic. North-Holland, Amsterdam, The Netherlands.Google Scholar
Lam, M., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance of blocked algorithms. In Proceedings of the IV International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'91). Google Scholar
Lebeck, A. and Wood, D. 1994. Cache profiling and the spec benchmarks: A case study. IEEE Comput. 27, 10 (Oct.), 15--26. Google Scholar
Magnusson, P. 1993. A design for efficient simulation of a multiprocessor. In Proceedings of the Western Simulation Multiconference on International Workshop on MASCOTS-93. (La Jolla, CA). 69--78. Google Scholar
Martonosi, M., Gupta, A., and Anderson, T. 1992. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'92). 1--12. Google Scholar
Martonosi, M., Gupta, A., and Anderson, T. 1993. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'93). Google Scholar
McCabe, M. 1989. Introduction to the Practice of Statistics. Freeman & Co., New York, NY.Google Scholar
McKinley, K. S. and Temam, O. 1996. A quantitative analysis of loop nest locality. In Proceedings of the VII Int. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'96). Google Scholar
MIPS. 1988. RISCompiler Languages Programmer's Guide. MIPS Computer Systems, Sunnyvale, CA.Google Scholar
Mowry, T., Lam, M., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the V International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'92). 62--73. Google Scholar
Padua, D. et al. 1994. Polaris Developer's Document. Available online at http://polaris.is.uiuc.edu/polaris/polaris--developer/polaris--developer.html.Google Scholar
Pugh, W. 1991. The omega test: A fast and practical integer programming algorithm for dependence analysis. Proceedings of the ACM/IEEE Conference of Supercomputing (SC'91) (Albuquerque, NM). 4--13. Google Scholar
Pugh, W. 1994. Counting solutions to presburguer formulas: How and why. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI'94). Google Scholar
Rivera, G. and Tseng, C.-W. 1998. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98). 38--49. Google Scholar
Rivera, G. and Tseng, C.-W. 1999. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99). Google Scholar
Sánchez, F. and González, A. 1998. Fast, flexible and accurate data locality analysis. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'98).Google Scholar
Sánchez, F. and González, A. 2000. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO-33). Google Scholar
Sugumar, R. 1993. Multi-configuration simulation algorithms for the evaluation of computer designs. Ph.D. thesis, University of Michigan. Google Scholar
Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'94). 261--271. Google Scholar
Temam, O., Granston, E., and Jalby, W. 1993. To copy or not to copy: A compile-time technique for accessing when data copying should be used to eliminate cache conflicts. In Proceedings of Supercomputing (SC'93). 410--419. Google Scholar
Uhlig, R. A. and Mudge, T. N. 1997. Trace-driven memory simulation: a survey. ACM Comput. Surv. 29, 3 (Sept.), 128--170. Google Scholar
van der Deijl, E., Kanbier, G., Temam, O., and Granston, E. 1997. A cache visualization tool. IEEE Comput. 30, 7 (July), 71--78. Google Scholar
Vera, X., Llosa, J., and González, A. 2002. Near-optimal padding for removing conflict misses. In Proceedings of the 15th Workshop on Languages and Compilers for Parallel Computers (LCPC'02). Google Scholar
Vera, X. and Xue, J. 2002. Let's study whole program cache behaviour analytically. In Proceedings of the International Symposium on High-Performance Computer Architecture (Cambridge, U.K.). (HPCA 8). Google Scholar
Wilde, D. 1993. A library for doing polyhedral operations. Tech. rep. 785, Oregon State University.Google Scholar
Witchel, E. and Rosenblum, M. 1996. Embra: Fast and flexible machine simulation. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'96). Google Scholar
Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'91). 30--44. Google Scholar

Index Terms

A fast and accurate framework to analyze and optimize cache memory behavior

Recommendations

Counter-Based Cache Replacement and Bypassing Algorithms

Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Read More
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

With the ever-widening performance gap between processors and main memory, cache memory, which is used to bridge this gap, is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. Other programs, however, ...
Read More
Heterogeneous way-size cache
ICS '06: Proceedings of the 20th annual international conference on Supercomputing

Set-associative cache architectures are commonly used. These caches consist of a number of ways, each of the same size. We have observed that the different ways have very different utilization, which motivates the design of caches with heterogeneous way ...
Read More

Reviews

Reviewer: Eno Thereska

The disparity between central processing unit (CPU) and memory speeds is addressed in this paper. Memory speeds lag behind CPU speeds. Caches are often used to mitigate this problem; programs that exhibit a high degree of locality can benefit most from these caches. It is usually difficult, however, for compilers to optimize a given program to take advantage of these caches because, according to the authors, there is a large optimization space, and compilers may need extra help from other tools, such as cache miss equations (CMEs). According to the authors, CMEs help in analyzing the cache memory behavior of a given program. The problem is that solving CMEs is an NP-complete problem. This paper presents an approximation to the general CME problem. The user of the algorithms presented in this paper has a clear tradeoff between the accuracy desired, and time to solve a set of CMEs. It remains to be seen, however, what the exact impact on having an approximate solution to a set of CMEs would have on a real application. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Programming Languages and Systems Volume 26, Issue 2
March 2004
192 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/973097
Issue’s Table of Contents

Copyright © 2004 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2004
Published in toplas Volume 26, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache memories
optimization
sampling
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 1,256
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Counter-Based Cache Replacement and Bypassing Algorithms

Cache miss equations: a compiler framework for analyzing and tuning memory behavior

Heterogeneous way-size cache

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Counter-Based Cache Replacement and Bypassing Algorithms

Cache miss equations: a compiler framework for analyzing and tuning memory behavior

Heterogeneous way-size cache

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media