ABSTRACT
Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify application bottlenecks. In this paper, we analyze and profile the components of the Speeded Up Robust Features (SURF) Computer Vision algorithm written in OpenCL. Our profiling framework is developed using built-in OpenCL API function calls, without the need for an external profiler. We show we can begin to identify performance bottlenecks and performance issues present in individual components on different hardware platforms. We demonstrate that by using run-time profiling using the OpenCL specification, we can provide an application developer with a fine-grained look at performance, and that this information can be used to tailor performance improvements for specific platforms.
- CUDA programming Guide, version 2.0. NVIDIA Corporation.Google Scholar
- Cuda Visual Profiler. NVIDIA Corporation.Google Scholar
- J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.-T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihl. Continuous profiling: where have all the cycles gone? ACM Trans. Comput. Syst., 15:357--390, November 1997. Google ScholarDigital Library
- H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer Vision-ECCV 2006, 2006. Google ScholarDigital Library
- M. Brown and D. Lowe. Automatic panoramic image stitching using invariant features. International Journal of Computer Vision, 74(1):59--73, 2007. Google ScholarDigital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. In ACM SIGGRAPH 2004 Papers, page 786. ACM, 2004. Google ScholarDigital Library
- J. Burkardt. Example avi files. World Wide Web.Google Scholar
- T. Chen, D. Budnikov, C. Hughes, and Y.-K. Chen. Computer vision on multi-core processors: Articulated body tracking. In Multimedia and Expo, 2007 IEEE International Conference on, pages 1862--1865, 2--5 2007.Google ScholarCross Ref
- M. Cowgill. Opensurf gpu enhancement. World Wide Web, 2009.Google Scholar
- G. Du, F. Su, and A. Cai. Face recognition using SURF features. In Proc. of SPIE Vol, volume 7496, pages 749628--1, 2009.Google Scholar
- C. Evans. Notes on the opensurf library. University of Bristol, Tech. Rep. CSTR-09-001, January, 2009.Google Scholar
- P. Furgale, C. Tong, and G. Kenway. ECE1724 Project Speeded-Up Speeded-Up Robust Features. 2009.Google Scholar
- K. Furlinger and S. Moore. Continuous runtime profiling of OpenMP applications. In Proceedings of the International Conference on Parallel Computing (ParCo 07)(Advances in Parallel Computing, volume 15.Google Scholar
- D. Gerstmann. Opencl event model usage. SIGGRAPH ASIA 2009.Google Scholar
- M. Harris, S. Sengupta, and J. Owens. Parallel prefix sum (scan) with CUDA. GPU Gems, 3(39):851--876, 2007.Google Scholar
- V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, HiPEAC '09, pages 19--33, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B. Lu. Person-specific SIFT features for face recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP, volume 2, 2007.Google ScholarCross Ref
- A. D. Malony, S. Biersdorff, W. Spear, and S. Mayanglambam. An experimental approach to performance measurement of heterogeneous parallel applications using cuda. In ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing, pages 127--136, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. Munshi. The OpenCL specification version 1.1. Khronos OpenCL Working Group, 2010.Google Scholar
- B. Purnomo, N. Rubin, and M. Houston. ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs. In ACM SIGGRAPH Posters. ACM, 2010. Google ScholarDigital Library
- G. Remedy. 2010.Google Scholar
- S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 73--82, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- K. Spafford, J. Meredith, J. Vetter, J. Chen, R. Grout, and R. Sankaran. Accelerating S3D: A GPGPU Case Study. In Euro-Par 2009, Parallel Processing-Workshops. The Netherlands, August 25--28, 2009, Workshops, page 122. Not Avail, 2010. Google ScholarDigital Library
- S. Srinivasan, Z. Fang, R. Iyer, S. Zhang, M. Espig, D. Newell, D. Cermak, Y. Wu, I. Kozintsev, and H. Haussecker. Performance characterization and optimization of mobile augmented reality on handheld platforms. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 128--137. Citeseer, 2009. Google ScholarDigital Library
- B. Sukhwani and M. C. Herbordt. Gpu acceleration of a production molecular docking code. In Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 19--27, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Venkata, S. K. and Ahn, I. and Donghwan Jeon and Gupta, A. and Louie, C. and Garcia, S. and Belongie, S. and Taylor, M. B. Sd-vbs: The san diego vision benchmark suite. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 55--64, 2009. Google ScholarDigital Library
- S. Warn, W. Emeneker, J. Gauch, J. Cothren, and A. Apon. Accelerating image feature comparisons using cuda on commodity hardware. Knoxville, TN, July 2010. Symposium on Application Accelerators in High Performance Computing (SAAHPC).Google Scholar
- N. Zhang. Computing Optimised Parallel Speeded-Up Robust Features (P-SURF) on Multi-Core Processors. International Journal of Parallel Programming, 38(2):138--158, 2010.Google ScholarCross Ref
Index Terms
- Analyzing program flow within a many-kernel OpenCL application
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsHeterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a ...
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
GPGPU-7: Proceedings of Workshop on General Purpose Processing Using GPUsGraphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance ...
Comments