skip to main content
article
Open Access

A fast and accurate framework to analyze and optimize cache memory behavior

Authors Info & Claims
Published:01 March 2004Publication History
Skip Abstract Section

Abstract

The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature.This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.

References

  1. Abella, J., González, A., Llosa, J., and Vera, X. 2002. Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In Proceedings of 31st International Conference on Parallel Processing (ICPP'02). Google ScholarGoogle Scholar
  2. Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D. A. 1999. DBMSs on a modern processor: where does time go? In Proceedings of the 25th VLDB Conference (Edinburgh, Scotland). Google ScholarGoogle Scholar
  3. Ammons, G., Ball, T., and Larus, J. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'97). 85--96. Google ScholarGoogle Scholar
  4. Ayguadé, E. et al. 1995. A uniform internal representation for high-level and instruction-level transformations. Tech rep. UPC-DAC-95-02. Universitat Politècnica de Catalunya, Barcelona, Spain.Google ScholarGoogle Scholar
  5. Banerjee, U. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, MA. Google ScholarGoogle Scholar
  6. Banerjee, U. 1993. Loop transformations for restructuring compilers: The Foundation. Kluwer Academic Publishers, Norwell, MA. Google ScholarGoogle Scholar
  7. Bedichek, R. 1995. Talismam: Fast and accurate multicomputer simulation. In Proceedings of ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'95). 14--24. Google ScholarGoogle Scholar
  8. Carr, S. and Kennedy, K. 1992. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing (SC'92). 114--124. Google ScholarGoogle Scholar
  9. Carr, S., McKinley, K., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the VI International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'94). 252--262. Google ScholarGoogle Scholar
  10. Chatterjee, S., Jain, V. V., Lebeck, A. R., Mundhra, S., and Thottethodi, M. 1999. Nonlinear array layout for hierarchical memory systems. In Proceedings of the ACM International Conference on Supercomputing (Rhodes, Greece) (ICS'99). 444--453. Google ScholarGoogle Scholar
  11. Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. 2001. Exact analysis of the cache behavior of nested loops. In ACM SIGPLAN '01 Conference on Programming Language Design and Implementation (PLDI'01). 286--297. Google ScholarGoogle Scholar
  12. Clauss, P. 1996. Counting solutions to linear and non-linear constraints through Ehrhart polynomials. In Proceedings of ACM International Conference on Supercomputing (Philadelphia) (ICS'96). 278--285. Google ScholarGoogle Scholar
  13. Coleman, S. and McKinley, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'95). 279--290. Google ScholarGoogle Scholar
  14. DeGroot, M. 1998. Probability and Statistics. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  15. Feautrier, P. 1996. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, G. R. Perrin and A. Darte, Eds. Lecture Notes in Computer Science, vol. 1132. Springer-Verlag, Berlin, Germany, 79--103. Google ScholarGoogle Scholar
  16. Fraguela, B. B., Doallo, R., and Zapata, E. L. 1999. Automatic analytical modeling for the estimation of cache misses. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'99). Google ScholarGoogle Scholar
  17. Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformations. J. Parallel. Distrib. Comput. 5, 587--616. Google ScholarGoogle Scholar
  18. Gee, J., Hill, M., Pnevmatikatos, D., and Smith, A. 1993. Cache performance of the spec92 benchmark suite. IEEE Micro 13, 4 (Aug.), 17--27. Google ScholarGoogle Scholar
  19. Ghosh, S. 1999. Compiler analysis framework for tuning memory behavior. Ph.D. dissertation. Princeton University, Princeton, NJ. Google ScholarGoogle Scholar
  20. Ghosh, S., Martonosi, M., and Malik, S. 1998. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'98). 228--239. Google ScholarGoogle Scholar
  21. Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Programm. Lang. Syst. 21, 4, 703--746. Google ScholarGoogle Scholar
  22. Ghosh, S., Martonosi, M., and Malik, S. 2000. Automated cache optimizations using CME driven diagnosis. In Proceedings of the International Conference on Supercomputing (ICS'00). 316--326. Google ScholarGoogle Scholar
  23. Goldberg, A. and Hennessy, J. 1991. Performance debugging shared memory multiprocessor programs with mtool. In Proceedings of Supercomputing (SC'91). 481--490. Google ScholarGoogle Scholar
  24. Goldschmidt, S. and Hennessy, J. 1993. The accuracy of trace-driven simulation of multiprocessors. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'93). 146--157. Google ScholarGoogle Scholar
  25. Haghighat, M. R. and Polychronopoulos, C. D. 1993. Symbolic analysis: A basis for parallelization, optimization and scheduling of programs. In 1993 Workshop on Languages and Compilers for Parallel Computing (LCPC'93). Springer Verlag, Portland, Ore., 567--585. Google ScholarGoogle Scholar
  26. Hill, M. n.d. DineroIII: a uniprocessor cache simulator (http://www.cs.wisc.edu/∼larus/warts.html).Google ScholarGoogle Scholar
  27. Kandemir, M., Choudhary, A., Banerjee, P., and Ramanujam, J. 1999. A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans. Parallel Distrib. Syst. 10, 2 (Feb.), 115--135. Google ScholarGoogle Scholar
  28. Kennedy, K., Callahan, D., and Porterfield, A. 1990. Analyzing and visualizing performance of memory hierarchy. In Instrumentation for Visualization. ACM Press, New York, NY. Google ScholarGoogle Scholar
  29. Kreisel, G. and Krevine, J. L. 1967. Elements of Mathematical Logic. North-Holland, Amsterdam, The Netherlands.Google ScholarGoogle Scholar
  30. Lam, M., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance of blocked algorithms. In Proceedings of the IV International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'91). Google ScholarGoogle Scholar
  31. Lebeck, A. and Wood, D. 1994. Cache profiling and the spec benchmarks: A case study. IEEE Comput. 27, 10 (Oct.), 15--26. Google ScholarGoogle Scholar
  32. Magnusson, P. 1993. A design for efficient simulation of a multiprocessor. In Proceedings of the Western Simulation Multiconference on International Workshop on MASCOTS-93. (La Jolla, CA). 69--78. Google ScholarGoogle Scholar
  33. Martonosi, M., Gupta, A., and Anderson, T. 1992. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'92). 1--12. Google ScholarGoogle Scholar
  34. Martonosi, M., Gupta, A., and Anderson, T. 1993. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'93). Google ScholarGoogle Scholar
  35. McCabe, M. 1989. Introduction to the Practice of Statistics. Freeman & Co., New York, NY.Google ScholarGoogle Scholar
  36. McKinley, K. S. and Temam, O. 1996. A quantitative analysis of loop nest locality. In Proceedings of the VII Int. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'96). Google ScholarGoogle Scholar
  37. MIPS. 1988. RISCompiler Languages Programmer's Guide. MIPS Computer Systems, Sunnyvale, CA.Google ScholarGoogle Scholar
  38. Mowry, T., Lam, M., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the V International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'92). 62--73. Google ScholarGoogle Scholar
  39. Padua, D. et al. 1994. Polaris Developer's Document. Available online at http://polaris.is.uiuc.edu/polaris/polaris--developer/polaris--developer.html.Google ScholarGoogle Scholar
  40. Pugh, W. 1991. The omega test: A fast and practical integer programming algorithm for dependence analysis. Proceedings of the ACM/IEEE Conference of Supercomputing (SC'91) (Albuquerque, NM). 4--13. Google ScholarGoogle Scholar
  41. Pugh, W. 1994. Counting solutions to presburguer formulas: How and why. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI'94). Google ScholarGoogle Scholar
  42. Rivera, G. and Tseng, C.-W. 1998. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98). 38--49. Google ScholarGoogle Scholar
  43. Rivera, G. and Tseng, C.-W. 1999. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99). Google ScholarGoogle Scholar
  44. Sánchez, F. and González, A. 1998. Fast, flexible and accurate data locality analysis. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'98).Google ScholarGoogle Scholar
  45. Sánchez, F. and González, A. 2000. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO-33). Google ScholarGoogle Scholar
  46. Sugumar, R. 1993. Multi-configuration simulation algorithms for the evaluation of computer designs. Ph.D. thesis, University of Michigan. Google ScholarGoogle Scholar
  47. Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'94). 261--271. Google ScholarGoogle Scholar
  48. Temam, O., Granston, E., and Jalby, W. 1993. To copy or not to copy: A compile-time technique for accessing when data copying should be used to eliminate cache conflicts. In Proceedings of Supercomputing (SC'93). 410--419. Google ScholarGoogle Scholar
  49. Uhlig, R. A. and Mudge, T. N. 1997. Trace-driven memory simulation: a survey. ACM Comput. Surv. 29, 3 (Sept.), 128--170. Google ScholarGoogle Scholar
  50. van der Deijl, E., Kanbier, G., Temam, O., and Granston, E. 1997. A cache visualization tool. IEEE Comput. 30, 7 (July), 71--78. Google ScholarGoogle Scholar
  51. Vera, X., Llosa, J., and González, A. 2002. Near-optimal padding for removing conflict misses. In Proceedings of the 15th Workshop on Languages and Compilers for Parallel Computers (LCPC'02). Google ScholarGoogle Scholar
  52. Vera, X. and Xue, J. 2002. Let's study whole program cache behaviour analytically. In Proceedings of the International Symposium on High-Performance Computer Architecture (Cambridge, U.K.). (HPCA 8). Google ScholarGoogle Scholar
  53. Wilde, D. 1993. A library for doing polyhedral operations. Tech. rep. 785, Oregon State University.Google ScholarGoogle Scholar
  54. Witchel, E. and Rosenblum, M. 1996. Embra: Fast and flexible machine simulation. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'96). Google ScholarGoogle Scholar
  55. Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'91). 30--44. Google ScholarGoogle Scholar

Index Terms

  1. A fast and accurate framework to analyze and optimize cache memory behavior

          Recommendations

          Reviews

          Eno Thereska

          The disparity between central processing unit (CPU) and memory speeds is addressed in this paper. Memory speeds lag behind CPU speeds. Caches are often used to mitigate this problem; programs that exhibit a high degree of locality can benefit most from these caches. It is usually difficult, however, for compilers to optimize a given program to take advantage of these caches because, according to the authors, there is a large optimization space, and compilers may need extra help from other tools, such as cache miss equations (CMEs). According to the authors, CMEs help in analyzing the cache memory behavior of a given program. The problem is that solving CMEs is an NP-complete problem. This paper presents an approximation to the general CME problem. The user of the algorithms presented in this paper has a clear tradeoff between the accuracy desired, and time to solve a set of CMEs. It remains to be seen, however, what the exact impact on having an approximate solution to a set of CMEs would have on a real application. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Programming Languages and Systems
            ACM Transactions on Programming Languages and Systems  Volume 26, Issue 2
            March 2004
            192 pages
            ISSN:0164-0925
            EISSN:1558-4593
            DOI:10.1145/973097
            Issue’s Table of Contents

            Copyright © 2004 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 March 2004
            Published in toplas Volume 26, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader