skip to main content
research-article
Open Access

Halide: decoupling algorithms from schedules for high-performance image processing

Published:27 December 2017Publication History
Skip Abstract Section

Abstract

Writing high-performance code on modern machines requires not just locally optimizing inner loops, but globally reorganizing computations to exploit parallelism and locality---doing things such as tiling and blocking whole pipelines to fit in cache. This is especially true for image processing pipelines, where individual stages do much too little work to amortize the cost of loading and storing results to and from off-chip memory. As a result, the performance difference between a naive implementation of a pipeline and one globally optimized for parallelism and locality is often an order of magnitude. However, using existing programming tools, writing high-performance image processing code requires sacrificing simplicity, portability, and modularity. We argue that this is because traditional programming models conflate the computations defining the algorithm with decisions about intermediate storage and the order of computation, which we call the schedule.

We propose a new programming language for image processing pipelines, called Halide, that separates the algorithm from its schedule. Programmers can change the schedule to express many possible organizations of a single algorithm. The Halide compiler then synthesizes a globally combined loop nest for an entire algorithm, given a schedule. Halide models a space of schedules which is expressive enough to describe organizations that match or outperform state-of-the-art hand-written implementations of many computational photography and computer vision algorithms. Its model is simple enough to do so often in only a few lines of code, and small changes generate efficient implementations for x86, ARM, Graphics Processors (GPUs), and specialized image processors, all from a single algorithm.

Halide has been public and open source for over four years, during which it has been used by hundreds of programmers to deploy code to tens of thousands of servers and hundreds of millions of phones, processing billions of images every day.

References

  1. Adams, A., Talvala, E., Park, S.H., Jacobs, D.E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H.P.A., Matusik, W., Pulli, K., Horowitz, M., Levoy, M. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph. 29, 4 (2010), 29:1--29:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aubry, M., Paris, S., Hasinoff, S.W., Kautz, J., Durand, F. Fast local Laplacian filters: Theory and applications. ACM Trans. Graph. 33, 5 (2014), 167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bacon, D.F., Graham, S.L., Sharp, O.J. Compiler transformations for high-performance computing. ACM Comput Surv. 26, 4 (Dec. 1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Blythe, D. The Direct3D 10 system. ACM Trans. Graph. 25, (2006), 724--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Buck, I. GPU computing: Programming a massively parallel processor. In Proceedings of the International Symposium on Code Generation and Optimization (Tessellations Publishing, Phoenix, Arizona, 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chamberlain, B., Callahan, D., Zima, H. Parallel programmability and the Chapel language. Int J High Perform Comput Appl. 21, (2007), 291--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chen, J., Paris, S., Durand, F. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph. 26, 3 (2007), 103:1--103:9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Elliott, C. Functional image synthesis. In Proceedings of Bridges 2001, Mathematical Connections in Art, Music, and Science (IEEE Computer Society, Washington, DC, USA, 2001).Google ScholarGoogle Scholar
  9. Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P. Sequoia: Programming the memory hierarchy. In ACM/IEEE conference on Supercomputing (ACM, New York, NY, 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Feautrier, P. Dataflow analysis of array and scalar references. Int J Parallel Program. 20, 1 (1991), 23--53.Google ScholarGoogle ScholarCross RefCross Ref
  11. Frigo, M., Johnson, S.G. The design and implementation of FFTW3. Proc IEEE 93, 2 (2005).Google ScholarGoogle ScholarCross RefCross Ref
  12. Gordon, M.I., Thies, W., Karczmarek, M., Lin, J., Meli, A.S., Leger, C., Lamb, A.A., Wong, J., Hoffman, H., Maze, D.Z., Amarasinghe, S. A stream compiler for communication-exposed architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ACM, New York, NY, 2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J. High performance discrete Fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE (Washington, DC, January 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Halide source repository. http://github.com/halide/Halide.Google ScholarGoogle Scholar
  15. Hasinoff, S.W., Sharlet, D., Geiss, R., Adams, A., Barron, J.T., Kainz, F., Chen, J., Levoy, M. Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graph. 35, 6 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Holzmann, G. Beyond Photography: The Digital Darkroom. Prentice Hall, Englewood Cliffs, NJ, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mullapudi, R.T., Adams, A., Sharlet, D., Ragan-Kelley, J., Fatahalian, K. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mullapudi, R.T., Vasista, V., Bondhugula, U. PolyMage: Automatic optimization for image processing pipelines. In ACM SIGPLAN Notices (ACM, New York, NY, 2015), volume 50, 429--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. The OpenCL specification, version 1.2. http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf, 2011.Google ScholarGoogle Scholar
  20. Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93, 2 (2005), 232--275.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ragan-Kelley, J. Decoupling algorithms from the organization of computation for high performance image processing. PhD thesis, Massachusetts Institute of Technology (2014).Google ScholarGoogle Scholar
  22. Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., Durand, F. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph. 31, 4 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (ACM, New York, NY, 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rudy, G., Khan, M.M., Hall, M., Chen, C., Chame, J. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing LCPC'10, (Springer-Verlag, Berlin, Heidelberg, 2011), 136--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Suriana, P., Adams, A., Kamil, S. Parallel associative reductions in halide. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (ACM, New York, NY, 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Halide: decoupling algorithms from schedules for high-performance image processing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 61, Issue 1
          January 2018
          110 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/3176926
          Issue’s Table of Contents

          Copyright © 2017 Owner/Author

          This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 December 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format