research-article

Free Access

Roofline: an insightful visual performance model for multicore architectures

Authors:
Samuel Williams

Lawrence Berkeley National Laboratory, Berkeley, CA

Lawrence Berkeley National Laboratory, Berkeley, CA
View Profile

,
Andrew Waterman

University of California, Berkeley

University of California, Berkeley
View Profile

,
David Patterson

University of California, Berkeley

University of California, Berkeley
View Profile

Authors Info & Claims

Communications of the ACM Volume 52 Issue 4April 2009pp 65–76https://doi.org/10.1145/1498765.1498785

Published:01 April 2009Publication History

Communications of the ACM

Abstract

The Roofline model offers insight on how to improve the performance of software and hardware.

Supplemental Material

Available for Download

pdf

cacm_roofline_appendix_a.pdf (1.3 MB)

Appendix associated with the Roofline article

References

Adve, V. Analyzing the Behavior and Performance of Parallel Programs, Ph.D. thesis, University of Wisconsin, 1993; www.cs.wisc.edu/techreports/1993/TR1201.pdf. Google ScholarDigital Library
AMD. Software Optimization Guide for AMD Family 10h Processors, Publication 40546, Apr. 2008; www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf.Google Scholar
Amdahl, G. Validity of the single processor approach to achieving large-scale computing capabilities. In Proceedings of the AFIPS Conference, 1967, 483--485. Google ScholarDigital Library
Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Keutzer, K., Patterson, D., Plishker, W., Shalf, J., Williams, S., and Yelick, K. The Landscape of Parallel Computing Research: A View from Berkeley Technical Report UCB/EECS-2006-183. EECS, University of California, Berkeley, Dec. 2006.Google Scholar
Bienia, C., Kumar, S., Singh, J., and Li, K. The PARSEC Benchmark Suite: Characterization and Architectural Implications, Technical Report TR-811-008. Princeton University, Jan. 2008.Google ScholarDigital Library
Bird, S., Waterman, A., Klues, K., Datta, K., Liu, R., Nishtala, R., Williams, S., Asanovi, K., Demmel, J., Patterson, D., and Yelick, K. A case for sensible performance counters. Submitted to the First USENIX Workshop on Hot Topics in Parallelism (Berkeley CA, Mar. 30--31, 2009); www.usenix.org/events/hotpar09/.Google Scholar
Boyd, E., Azeem, W., Lee, H., Shih, T., Hung, S., and Davidson, E. A hierarchical approach to modeling and improving the performance of scientific applications on the KSR1. In Proceedings of the 1994 International Conference on Parallel Processing, 1994, 188--192. Google ScholarDigital Library
Callahan, D., Cocke, J., and Kennedy, K. Estimating interlock and improving balance for pipelined machines. Journal of Parallel Distributed Computing 5(1988), 334--358. Google ScholarDigital Library
Carr, S. and Kennedy, K. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems 16, 4 (Nov. 1994). Google ScholarDigital Library
Chong, J. Private communication on financial PDE solvers, 2008.Google Scholar
Colella, P. Defining Software Requirements for Scientific Computing, Presentation, 2004.Google Scholar
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. Stencil computation optimization and autotuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE SC08 Conference (Austin, TX, Nov. 15--21). IEEE Press, Piscataway, NJ, 2008, 1--12. Google ScholarDigital Library
Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, R., and Yelick, K. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Adaptation 93, 2 (2005).Google ScholarCross Ref
Dubois, M. and Briggs, F.A. Performance of synchronized iterative processes in multiprocessor systems. IEEE Transactions on Software Engineering SE-8, 4 (July 1982), 419--431. Google ScholarDigital Library
Frigo, M. and Johnson, S. The design and implementation of FFTW3. Proceedings of the IEEE: Special Issue on Program Generation, Optimization, and Platform Adaptation 93, 2 (2005).Google Scholar
Harris, M. Mapping computational concepts to GPUs. In ACM SIGGRAPH Courses, Chapter 31 (Los Angeles, July 31-Aug. 4). ACM Press, New York, 2005. Google ScholarDigital Library
Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers, Boston, MA. 2007. Google ScholarDigital Library
Hill, M. and Marty, M. Amdahl's Law in the multicore era. IEEE Computer (July 2008), 33--38. Google ScholarDigital Library
Hill, M. and Smith, A. Evaluating associativity in CPU caches. IEEE Transactions on Computers 38, 12 (Dec. 1989), 1612--1630. Google ScholarDigital Library
Lazowska, E., Zahorjan, J., Graham, S., and Sevcik, K. Quantitative System Performance: Computer System Analysis Using Queueing Network Models, Prentice Hall, Upper Saddle River, NJ, 1984. Google ScholarDigital Library
Little, J.D.C. A proof of the queueing formula L = λ W. Operations Research 9, 3 (1961), 383--387.Google ScholarDigital Library
McCalpin, J. STREAM: Sustainable Memory Bandwidth in High-Performance Computers, 1995; www.cs.virginia.edu/stream.Google Scholar
Patterson, D. Latency lags bandwidth. Commun. ACM 47,10 (Oct. 2004). Google ScholarDigital Library
Thomasian, A. and Bay, P. Analytic queueing network models for parallel processing of task systems. IEEE Transactions on Computers C-35, 12 (Dec. 1986), 1045--1054. Google ScholarDigital Library
Tikir, M., Carrington, L., Strohmaier, E., and Snavely, A. A genetic algorithms approach to modeling the performance of memory-bound computations. In Proceedings of the SC07 Conference (Reno, NV, Nov. 10--16). ACM Press, New York, 2007. Google ScholarDigital Library
Vuduc, R., Demmel, J., Yelick, K., Kamil, S., Nishtala, R., and Lee, B. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the ACM/IEEESC02 Conference (Baltimore, MD, Nov. 16--22). IEEE Computer Society Press, Los Alamitos, CA, 2002. Google ScholarDigital Library
Williams, S. Autotuning Performance on Multicore Computers, Ph.D. Thesis. University of California, Berkeley, Dec. 2008; www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-164.html. Google ScholarDigital Library
Williams, S., Carter, J., Oliker, L., Shalf, J., and Yelick, K. Lattice Boltzmann simulation optimization on leading multicore platforms. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Symposium (Miami, FL, Apr. 14--18, 2008), 1--14.Google ScholarCross Ref
Williams, S., Oliker, L, Vuduc, F., Shalf, J., Yelick, K., and Demmel, J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the ACM/IEEE SC07 Conference (Reno, NV, Nov. 10--16). ACM Press, New York, 2007. Google ScholarDigital Library
Woo, S., Ohara, M., Torrie, E., Singh, J.-P., and Gupta, A. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. ACM Press, New York, 1995, 24--37. Google ScholarDigital Library

Index Terms

Roofline: an insightful visual performance model for multicore architectures
1. Hardware
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Control structures

Recommendations

Roofline-aware DVFS for GPUs
ADAPT '14: Proceedings of International Workshop on Adaptive Self-tuning Computing Systems

Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the ...
Read More
Metrics and Design of an Instruction Roofline Model for AMD GPUs
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU ...
Read More
Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology
Accelerator Programming Using Directives
Abstract
In this paper, we show that OpenMP 4.5 based implementation of TestSNAP, a proxy-app for the Spectral Neighbor Analysis Potential (SNAP) in LAMMPS, can be ported across the NVIDIA, Intel, and AMD GPUs. Roofline analysis is employed to assess the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 52, Issue 4
A Direct Path to Dependable Software
April 2009
134 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1498765
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,579
  Total Citations
  View Citations
- 17,832
  Total Downloads
- Downloads (Last 12 months)2,293
- Downloads (Last 6 weeks)372
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Roofline-aware DVFS for GPUs

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Roofline-aware DVFS for GPUs

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media