research-article

Analyzing program flow within a many-kernel OpenCL application

Authors:
Perhaad Mistry

Northeastern University, Boston, MA

Northeastern University, Boston, MA
View Profile

,
Chris Gregg

University of Virginia, Charlottesville, VA

University of Virginia, Charlottesville, VA
View Profile

,
Norman Rubin

Advanced Micro Devices, Boxborough, MA

Advanced Micro Devices, Boxborough, MA
View Profile

,
David Kaeli

Northeastern University, Boston, MA

Northeastern University, Boston, MA
View Profile

,
Kim Hazelwood

University of Virginia, Charlottesville, VA

University of Virginia, Charlottesville, VA
View Profile

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing UnitsMarch 2011Article No.: 10Pages 1–8https://doi.org/10.1145/1964179.1964193

Published:05 March 2011Publication History

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Pages 1–8

ABSTRACT

Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.

References

CUDA programming Guide, version 2.0. NVIDIA Corporation.Google Scholar
Cuda Visual Profiler. NVIDIA Corporation.Google Scholar
J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.-T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst., 15:357--390, November 1997. Google ScholarDigital Library
H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer Vision-ECCV 2006, 2006. Google ScholarDigital Library
M. Brown and D. Lowe. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision, 74(1):59--73, 2007. Google ScholarDigital Library
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH 2004 Papers, page 786. ACM, 2004. Google ScholarDigital Library
J. Burkardt. Example avi files. World Wide Web.Google Scholar
T. Chen, D. Budnikov, C. Hughes, and Y.-K. Chen. Computer vision on multi-core processors: Articulated body tracking. In Multimedia and Expo, 2007 IEEE International Conference on, pages 1862--1865, 2--5 2007.Google ScholarCross Ref
M. Cowgill. Opensurf gpu enhancement. World Wide Web, 2009.Google Scholar
G. Du, F. Su, and A. Cai. Face recognition using SURF features. In Proc. of SPIE Vol, volume 7496, pages 749628--1, 2009.Google Scholar
C. Evans. Notes on the opensurf library. University of Bristol, Tech. Rep. CSTR-09-001, January, 2009.Google Scholar
P. Furgale, C. Tong, and G. Kenway. ECE1724 Project Speeded-Up Speeded-Up Robust Features. 2009.Google Scholar
K. Furlinger and S. Moore. Continuous runtime profiling of OpenMP applications. In Proceedings of the International Conference on Parallel Computing (ParCo 07)(Advances in Parallel Computing, volume 15.Google Scholar
D. Gerstmann. Opencl event model usage. SIGGRAPH ASIA 2009.Google Scholar
M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. GPU Gems, 3(39):851--876, 2007.Google Scholar
V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B. Lu. Person-specific SIFT features for face recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP, volume 2, 2007.Google ScholarCross Ref
A. D. Malony, S. Biersdorff, W. Spear, and S. Mayanglambam. An experimental approach to performance measurement of heterogeneous parallel applications using cuda. In ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing, pages 127--136, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. Munshi. The OpenCL specification version 1.1. Khronos OpenCL Working Group, 2010.Google Scholar
B. Purnomo, N. Rubin, and M. Houston. ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs. In ACM SIGGRAPH Posters. ACM, 2010. Google ScholarDigital Library
G. Remedy. 2010.Google Scholar
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 73--82, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
K. Spafford, J. Meredith, J. Vetter, J. Chen, R. Grout, and R. Sankaran. Accelerating S3D: A GPGPU Case Study. In Euro-Par 2009, Parallel Processing-Workshops. The Netherlands, August 25--28, 2009, Workshops, page 122. Not Avail, 2010. Google ScholarDigital Library
S. Srinivasan, Z. Fang, R. Iyer, S. Zhang, M. Espig, D. Newell, D. Cermak, Y. Wu, I. Kozintsev, and H. Haussecker. Performance characterization and optimization of mobile augmented reality on handheld platforms. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 128--137. Citeseer, 2009. Google ScholarDigital Library
B. Sukhwani and M. C. Herbordt. Gpu acceleration of a production molecular docking code. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 19--27, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
Venkata, S. K. and Ahn, I. and Donghwan Jeon and Gupta, A. and Louie, C. and Garcia, S. and Belongie, S. and Taylor, M. B. Sd-vbs: The san diego vision benchmark suite. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 55--64, 2009. Google ScholarDigital Library
S. Warn, W. Emeneker, J. Gauch, J. Cothren, and A. Apon. Accelerating image feature comparisons using cuda on commodity hardware. Knoxville, TN, July 2010. Symposium on Application Accelerators in High Performance Computing (SAAHPC).Google Scholar
N. Zhang. Computing Optimised Parallel Speeded-Up Robust Features (P-SURF) on Multi-Core Processors. International Journal of Parallel Programming, 38(2):138--158, 2010.Google ScholarCross Ref

Index Terms

Analyzing program flow within a many-kernel OpenCL application

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a ...
Read More
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUs

Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
March 2011
101 pages
ISBN:9781450305693
DOI:10.1145/1964179

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPGPU
OpenCL
SURF
computer vision
heterogeneous computing
performance tools
profiling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate57of129submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 881
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Analyzing program flow within a many-kernel OpenCL application

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Analyzing program flow within a many-kernel OpenCL application

GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media