Abstract
The Roofline model offers insight on how to improve the performance of software and hardware.
Supplemental Material
Available for Download
Appendix associated with the Roofline article
- Adve, V. Analyzing the Behavior and Performance of Parallel Programs, Ph.D. thesis, University of Wisconsin, 1993; www.cs.wisc.edu/techreports/1993/TR1201.pdf. Google ScholarDigital Library
- AMD. Software Optimization Guide for AMD Family 10h Processors, Publication 40546, Apr. 2008; www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf.Google Scholar
- Amdahl, G. Validity of the single processor approach to achieving large-scale computing capabilities. In Proceedings of the AFIPS Conference, 1967, 483--485. Google ScholarDigital Library
- Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Keutzer, K., Patterson, D., Plishker, W., Shalf, J., Williams, S., and Yelick, K. The Landscape of Parallel Computing Research: A View from Berkeley Technical Report UCB/EECS-2006-183. EECS, University of California, Berkeley, Dec. 2006.Google Scholar
- Bienia, C., Kumar, S., Singh, J., and Li, K. The PARSEC Benchmark Suite: Characterization and Architectural Implications, Technical Report TR-811-008. Princeton University, Jan. 2008.Google ScholarDigital Library
- Bird, S., Waterman, A., Klues, K., Datta, K., Liu, R., Nishtala, R., Williams, S., Asanovi, K., Demmel, J., Patterson, D., and Yelick, K. A case for sensible performance counters. Submitted to the First USENIX Workshop on Hot Topics in Parallelism (Berkeley CA, Mar. 30--31, 2009); www.usenix.org/events/hotpar09/.Google Scholar
- Boyd, E., Azeem, W., Lee, H., Shih, T., Hung, S., and Davidson, E. A hierarchical approach to modeling and improving the performance of scientific applications on the KSR1. In Proceedings of the 1994 International Conference on Parallel Processing, 1994, 188--192. Google ScholarDigital Library
- Callahan, D., Cocke, J., and Kennedy, K. Estimating interlock and improving balance for pipelined machines. Journal of Parallel Distributed Computing 5(1988), 334--358. Google ScholarDigital Library
- Carr, S. and Kennedy, K. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems 16, 4 (Nov. 1994). Google ScholarDigital Library
- Chong, J. Private communication on financial PDE solvers, 2008.Google Scholar
- Colella, P. Defining Software Requirements for Scientific Computing, Presentation, 2004.Google Scholar
- Datta, K., Murphy, M., Volkov, V., Williams, S., Carter J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE SC08 Conference (Austin, TX, Nov. 15--21). IEEE Press, Piscataway, NJ, 2008, 1--12. Google ScholarDigital Library
- Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, R., and Yelick, K. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Adaptation 93, 2 (2005).Google ScholarCross Ref
- Dubois, M. and Briggs, F.A. Performance of synchronized iterative processes in multiprocessor systems. IEEE Transactions on Software Engineering SE-8, 4 (July 1982), 419--431. Google ScholarDigital Library
- Frigo, M. and Johnson, S. The design and implementation of FFTW3. Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Platform Adaptation 93, 2 (2005).Google Scholar
- Harris, M. Mapping computational concepts to GPUs. In ACM SIGGRAPH Courses, Chapter 31 (Los Angeles, July 31-Aug. 4). ACM Press, New York, 2005. Google ScholarDigital Library
- Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers, Boston, MA. 2007. Google ScholarDigital Library
- Hill, M. and Marty, M. Amdahl's Law in the multicore era. IEEE Computer (July 2008), 33--38. Google ScholarDigital Library
- Hill, M. and Smith, A. Evaluating associativity in CPU caches. IEEE Transactions on Computers 38, 12 (Dec. 1989), 1612--1630. Google ScholarDigital Library
- Lazowska, E., Zahorjan, J., Graham, S., and Sevcik, K. Quantitative System Performance: Computer System Analysis Using Queueing Network Models, Prentice Hall, Upper Saddle River, NJ, 1984. Google ScholarDigital Library
- Little, J.D.C. A proof of the queueing formula L = λ W. Operations Research 9, 3 (1961), 383--387.Google ScholarDigital Library
- McCalpin, J. STREAM: Sustainable Memory Bandwidth in High-Performance Computers, 1995; www.cs.virginia.edu/stream.Google Scholar
- Patterson, D. Latency lags bandwidth. Commun. ACM 47,10 (Oct. 2004). Google ScholarDigital Library
- Thomasian, A. and Bay, P. Analytic queueing network models for parallel processing of task systems. IEEE Transactions on Computers C-35, 12 (Dec. 1986), 1045--1054. Google ScholarDigital Library
- Tikir, M., Carrington, L., Strohmaier, E., and Snavely, A. A genetic algorithms approach to modeling the performance of memory-bound computations. In Proceedings of the SC07 Conference (Reno, NV, Nov. 10--16). ACM Press, New York, 2007. Google ScholarDigital Library
- Vuduc, R., Demmel, J., Yelick, K., Kamil, S., Nishtala, R., and Lee, B. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the ACM/IEEESC02 Conference (Baltimore, MD, Nov. 16--22). IEEE Computer Society Press, Los Alamitos, CA, 2002. Google ScholarDigital Library
- Williams, S. Autotuning Performance on Multicore Computers, Ph.D. Thesis. University of California, Berkeley, Dec. 2008; www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-164.html. Google ScholarDigital Library
- Williams, S., Carter, J., Oliker, L., Shalf, J., and Yelick, K. Lattice Boltzmann simulation optimization on leading multicore platforms. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Symposium (Miami, FL, Apr. 14--18, 2008), 1--14.Google ScholarCross Ref
- Williams, S., Oliker, L, Vuduc, F., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the ACM/IEEE SC07 Conference (Reno, NV, Nov. 10--16). ACM Press, New York, 2007. Google ScholarDigital Library
- Woo, S., Ohara, M., Torrie, E., Singh, J.-P., and Gupta, A. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. ACM Press, New York, 1995, 24--37. Google ScholarDigital Library
Index Terms
- Roofline: an insightful visual performance model for multicore architectures
Recommendations
Roofline-aware DVFS for GPUs
ADAPT '14: Proceedings of International Workshop on Adaptive Self-tuning Computing SystemsGraphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the ...
Metrics and Design of an Instruction Roofline Model for AMD GPUs
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU ...
Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology
Accelerator Programming Using DirectivesAbstractIn this paper, we show that OpenMP 4.5 based implementation of TestSNAP, a proxy-app for the Spectral Neighbor Analysis Potential (SNAP) in LAMMPS, can be ported across the NVIDIA, Intel, and AMD GPUs. Roofline analysis is employed to assess the ...
Comments