Abstract
The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed.Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature.This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.
- Abella, J., González, A., Llosa, J., and Vera, X. 2002. Near-optimal loop tiling by means of cache miss equations and genetic algorithms. In Proceedings of 31st International Conference on Parallel Processing (ICPP'02). Google Scholar
- Ailamaki, A., DeWitt, D. J., Hill, M. D., and Wood, D. A. 1999. DBMSs on a modern processor: where does time go? In Proceedings of the 25th VLDB Conference (Edinburgh, Scotland). Google Scholar
- Ammons, G., Ball, T., and Larus, J. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'97). 85--96. Google Scholar
- Ayguadé, E. et al. 1995. A uniform internal representation for high-level and instruction-level transformations. Tech rep. UPC-DAC-95-02. Universitat Politècnica de Catalunya, Barcelona, Spain.Google Scholar
- Banerjee, U. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, Norwell, MA. Google Scholar
- Banerjee, U. 1993. Loop transformations for restructuring compilers: The Foundation. Kluwer Academic Publishers, Norwell, MA. Google Scholar
- Bedichek, R. 1995. Talismam: Fast and accurate multicomputer simulation. In Proceedings of ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'95). 14--24. Google Scholar
- Carr, S. and Kennedy, K. 1992. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing (SC'92). 114--124. Google Scholar
- Carr, S., McKinley, K., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In Proceedings of the VI International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'94). 252--262. Google Scholar
- Chatterjee, S., Jain, V. V., Lebeck, A. R., Mundhra, S., and Thottethodi, M. 1999. Nonlinear array layout for hierarchical memory systems. In Proceedings of the ACM International Conference on Supercomputing (Rhodes, Greece) (ICS'99). 444--453. Google Scholar
- Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. 2001. Exact analysis of the cache behavior of nested loops. In ACM SIGPLAN '01 Conference on Programming Language Design and Implementation (PLDI'01). 286--297. Google Scholar
- Clauss, P. 1996. Counting solutions to linear and non-linear constraints through Ehrhart polynomials. In Proceedings of ACM International Conference on Supercomputing (Philadelphia) (ICS'96). 278--285. Google Scholar
- Coleman, S. and McKinley, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'95). 279--290. Google Scholar
- DeGroot, M. 1998. Probability and Statistics. Addison-Wesley, Reading, MA.Google Scholar
- Feautrier, P. 1996. Automatic parallelization in the polytope model. In The Data Parallel Programming Model, G. R. Perrin and A. Darte, Eds. Lecture Notes in Computer Science, vol. 1132. Springer-Verlag, Berlin, Germany, 79--103. Google Scholar
- Fraguela, B. B., Doallo, R., and Zapata, E. L. 1999. Automatic analytical modeling for the estimation of cache misses. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'99). Google Scholar
- Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformations. J. Parallel. Distrib. Comput. 5, 587--616. Google Scholar
- Gee, J., Hill, M., Pnevmatikatos, D., and Smith, A. 1993. Cache performance of the spec92 benchmark suite. IEEE Micro 13, 4 (Aug.), 17--27. Google Scholar
- Ghosh, S. 1999. Compiler analysis framework for tuning memory behavior. Ph.D. dissertation. Princeton University, Princeton, NJ. Google Scholar
- Ghosh, S., Martonosi, M., and Malik, S. 1998. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'98). 228--239. Google Scholar
- Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Programm. Lang. Syst. 21, 4, 703--746. Google Scholar
- Ghosh, S., Martonosi, M., and Malik, S. 2000. Automated cache optimizations using CME driven diagnosis. In Proceedings of the International Conference on Supercomputing (ICS'00). 316--326. Google Scholar
- Goldberg, A. and Hennessy, J. 1991. Performance debugging shared memory multiprocessor programs with mtool. In Proceedings of Supercomputing (SC'91). 481--490. Google Scholar
- Goldschmidt, S. and Hennessy, J. 1993. The accuracy of trace-driven simulation of multiprocessors. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'93). 146--157. Google Scholar
- Haghighat, M. R. and Polychronopoulos, C. D. 1993. Symbolic analysis: A basis for parallelization, optimization and scheduling of programs. In 1993 Workshop on Languages and Compilers for Parallel Computing (LCPC'93). Springer Verlag, Portland, Ore., 567--585. Google Scholar
- Hill, M. n.d. DineroIII: a uniprocessor cache simulator (http://www.cs.wisc.edu/∼larus/warts.html).Google Scholar
- Kandemir, M., Choudhary, A., Banerjee, P., and Ramanujam, J. 1999. A linear algebra framework for automatic determination of optimal data layouts. IEEE Trans. Parallel Distrib. Syst. 10, 2 (Feb.), 115--135. Google Scholar
- Kennedy, K., Callahan, D., and Porterfield, A. 1990. Analyzing and visualizing performance of memory hierarchy. In Instrumentation for Visualization. ACM Press, New York, NY. Google Scholar
- Kreisel, G. and Krevine, J. L. 1967. Elements of Mathematical Logic. North-Holland, Amsterdam, The Netherlands.Google Scholar
- Lam, M., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance of blocked algorithms. In Proceedings of the IV International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'91). Google Scholar
- Lebeck, A. and Wood, D. 1994. Cache profiling and the spec benchmarks: A case study. IEEE Comput. 27, 10 (Oct.), 15--26. Google Scholar
- Magnusson, P. 1993. A design for efficient simulation of a multiprocessor. In Proceedings of the Western Simulation Multiconference on International Workshop on MASCOTS-93. (La Jolla, CA). 69--78. Google Scholar
- Martonosi, M., Gupta, A., and Anderson, T. 1992. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'92). 1--12. Google Scholar
- Martonosi, M., Gupta, A., and Anderson, T. 1993. Effectiveness of trace sampling for performance debugging tools. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'93). Google Scholar
- McCabe, M. 1989. Introduction to the Practice of Statistics. Freeman & Co., New York, NY.Google Scholar
- McKinley, K. S. and Temam, O. 1996. A quantitative analysis of loop nest locality. In Proceedings of the VII Int. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'96). Google Scholar
- MIPS. 1988. RISCompiler Languages Programmer's Guide. MIPS Computer Systems, Sunnyvale, CA.Google Scholar
- Mowry, T., Lam, M., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the V International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'92). 62--73. Google Scholar
- Padua, D. et al. 1994. Polaris Developer's Document. Available online at http://polaris.is.uiuc.edu/polaris/polaris--developer/polaris--developer.html.Google Scholar
- Pugh, W. 1991. The omega test: A fast and practical integer programming algorithm for dependence analysis. Proceedings of the ACM/IEEE Conference of Supercomputing (SC'91) (Albuquerque, NM). 4--13. Google Scholar
- Pugh, W. 1994. Counting solutions to presburguer formulas: How and why. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI'94). Google Scholar
- Rivera, G. and Tseng, C.-W. 1998. Data transformations for eliminating conflict misses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98). 38--49. Google Scholar
- Rivera, G. and Tseng, C.-W. 1999. A comparison of compiler tiling algorithms. In Proceedings of the 8th International Conference on Compiler Construction (CC'99). Google Scholar
- Sánchez, F. and González, A. 1998. Fast, flexible and accurate data locality analysis. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'98).Google Scholar
- Sánchez, F. and González, A. 2000. Modulo scheduling for a fully-distributed clustered VLIW architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO-33). Google Scholar
- Sugumar, R. 1993. Multi-configuration simulation algorithms for the evaluation of computer designs. Ph.D. thesis, University of Michigan. Google Scholar
- Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'94). 261--271. Google Scholar
- Temam, O., Granston, E., and Jalby, W. 1993. To copy or not to copy: A compile-time technique for accessing when data copying should be used to eliminate cache conflicts. In Proceedings of Supercomputing (SC'93). 410--419. Google Scholar
- Uhlig, R. A. and Mudge, T. N. 1997. Trace-driven memory simulation: a survey. ACM Comput. Surv. 29, 3 (Sept.), 128--170. Google Scholar
- van der Deijl, E., Kanbier, G., Temam, O., and Granston, E. 1997. A cache visualization tool. IEEE Comput. 30, 7 (July), 71--78. Google Scholar
- Vera, X., Llosa, J., and González, A. 2002. Near-optimal padding for removing conflict misses. In Proceedings of the 15th Workshop on Languages and Compilers for Parallel Computers (LCPC'02). Google Scholar
- Vera, X. and Xue, J. 2002. Let's study whole program cache behaviour analytically. In Proceedings of the International Symposium on High-Performance Computer Architecture (Cambridge, U.K.). (HPCA 8). Google Scholar
- Wilde, D. 1993. A library for doing polyhedral operations. Tech. rep. 785, Oregon State University.Google Scholar
- Witchel, E. and Rosenblum, M. 1996. Embra: Fast and flexible machine simulation. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'96). Google Scholar
- Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'91). 30--44. Google Scholar
Index Terms
- A fast and accurate framework to analyze and optimize cache memory behavior
Recommendations
Counter-Based Cache Replacement and Bypassing Algorithms
Recent studies have shown that in highly associative caches, the performance gap between the Least Recently Used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve ...
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
With the ever-widening performance gap between processors and main memory, cache memory, which is used to bridge this gap, is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. Other programs, however, ...
Heterogeneous way-size cache
ICS '06: Proceedings of the 20th annual international conference on SupercomputingSet-associative cache architectures are commonly used. These caches consist of a number of ways, each of the same size. We have observed that the different ways have very different utilization, which motivates the design of caches with heterogeneous way ...
Comments