skip to main content
10.1145/1964179.1964193acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Analyzing program flow within a many-kernel OpenCL application

Published:05 March 2011Publication History

ABSTRACT

Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.

References

  1. CUDA programming Guide, version 2.0. NVIDIA Corporation.Google ScholarGoogle Scholar
  2. Cuda Visual Profiler. NVIDIA Corporation.Google ScholarGoogle Scholar
  3. J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.-T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst., 15:357--390, November 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer Vision-ECCV 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Brown and D. Lowe. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision, 74(1):59--73, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH 2004 Papers, page 786. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Burkardt. Example avi files. World Wide Web.Google ScholarGoogle Scholar
  8. T. Chen, D. Budnikov, C. Hughes, and Y.-K. Chen. Computer vision on multi-core processors: Articulated body tracking. In Multimedia and Expo, 2007 IEEE International Conference on, pages 1862--1865, 2--5 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. M. Cowgill. Opensurf gpu enhancement. World Wide Web, 2009.Google ScholarGoogle Scholar
  10. G. Du, F. Su, and A. Cai. Face recognition using SURF features. In Proc. of SPIE Vol, volume 7496, pages 749628--1, 2009.Google ScholarGoogle Scholar
  11. C. Evans. Notes on the opensurf library. University of Bristol, Tech. Rep. CSTR-09-001, January, 2009.Google ScholarGoogle Scholar
  12. P. Furgale, C. Tong, and G. Kenway. ECE1724 Project Speeded-Up Speeded-Up Robust Features. 2009.Google ScholarGoogle Scholar
  13. K. Furlinger and S. Moore. Continuous runtime profiling of OpenMP applications. In Proceedings of the International Conference on Parallel Computing (ParCo 07)(Advances in Parallel Computing, volume 15.Google ScholarGoogle Scholar
  14. D. Gerstmann. Opencl event model usage. SIGGRAPH ASIA 2009.Google ScholarGoogle Scholar
  15. M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. GPU Gems, 3(39):851--876, 2007.Google ScholarGoogle Scholar
  16. V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B. Lu. Person-specific SIFT features for face recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP, volume 2, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  18. A. D. Malony, S. Biersdorff, W. Spear, and S. Mayanglambam. An experimental approach to performance measurement of heterogeneous parallel applications using cuda. In ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing, pages 127--136, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Munshi. The OpenCL specification version 1.1. Khronos OpenCL Working Group, 2010.Google ScholarGoogle Scholar
  20. B. Purnomo, N. Rubin, and M. Houston. ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs. In ACM SIGGRAPH Posters. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Remedy. 2010.Google ScholarGoogle Scholar
  22. S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 73--82, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Spafford, J. Meredith, J. Vetter, J. Chen, R. Grout, and R. Sankaran. Accelerating S3D: A GPGPU Case Study. In Euro-Par 2009, Parallel Processing-Workshops. The Netherlands, August 25--28, 2009, Workshops, page 122. Not Avail, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Srinivasan, Z. Fang, R. Iyer, S. Zhang, M. Espig, D. Newell, D. Cermak, Y. Wu, I. Kozintsev, and H. Haussecker. Performance characterization and optimization of mobile augmented reality on handheld platforms. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 128--137. Citeseer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Sukhwani and M. C. Herbordt. Gpu acceleration of a production molecular docking code. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 19--27, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Venkata, S. K. and Ahn, I. and Donghwan Jeon and Gupta, A. and Louie, C. and Garcia, S. and Belongie, S. and Taylor, M. B. Sd-vbs: The san diego vision benchmark suite. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 55--64, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Warn, W. Emeneker, J. Gauch, J. Cothren, and A. Apon. Accelerating image feature comparisons using cuda on commodity hardware. Knoxville, TN, July 2010. Symposium on Application Accelerators in High Performance Computing (SAAHPC).Google ScholarGoogle Scholar
  28. N. Zhang. Computing Optimised Parallel Speeded-Up Robust Features (P-SURF) on Multi-Core Processors. International Journal of Parallel Programming, 38(2):138--158, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Analyzing program flow within a many-kernel OpenCL application

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
            March 2011
            101 pages
            ISBN:9781450305693
            DOI:10.1145/1964179

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 5 March 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate57of129submissions,44%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader