Abstract
Writing high-performance code on modern machines requires not just locally optimizing inner loops, but globally reorganizing computations to exploit parallelism and locality---doing things such as tiling and blocking whole pipelines to fit in cache. This is especially true for image processing pipelines, where individual stages do much too little work to amortize the cost of loading and storing results to and from off-chip memory. As a result, the performance difference between a naive implementation of a pipeline and one globally optimized for parallelism and locality is often an order of magnitude. However, using existing programming tools, writing high-performance image processing code requires sacrificing simplicity, portability, and modularity. We argue that this is because traditional programming models conflate the computations defining the algorithm with decisions about intermediate storage and the order of computation, which we call the schedule.
We propose a new programming language for image processing pipelines, called Halide, that separates the algorithm from its schedule. Programmers can change the schedule to express many possible organizations of a single algorithm. The Halide compiler then synthesizes a globally combined loop nest for an entire algorithm, given a schedule. Halide models a space of schedules which is expressive enough to describe organizations that match or outperform state-of-the-art hand-written implementations of many computational photography and computer vision algorithms. Its model is simple enough to do so often in only a few lines of code, and small changes generate efficient implementations for x86, ARM, Graphics Processors (GPUs), and specialized image processors, all from a single algorithm.
Halide has been public and open source for over four years, during which it has been used by hundreds of programmers to deploy code to tens of thousands of servers and hundreds of millions of phones, processing billions of images every day.
- Adams, A., Talvala, E., Park, S.H., Jacobs, D.E., Ajdin, B., Gelfand, N., Dolson, J., Vaquero, D., Baek, J., Tico, M., Lensch, H.P.A., Matusik, W., Pulli, K., Horowitz, M., Levoy, M. The Frankencamera: An experimental platform for computational photography. ACM Trans. Graph. 29, 4 (2010), 29:1--29:12. Google ScholarDigital Library
- Aubry, M., Paris, S., Hasinoff, S.W., Kautz, J., Durand, F. Fast local Laplacian filters: Theory and applications. ACM Trans. Graph. 33, 5 (2014), 167. Google ScholarDigital Library
- Bacon, D.F., Graham, S.L., Sharp, O.J. Compiler transformations for high-performance computing. ACM Comput Surv. 26, 4 (Dec. 1994). Google ScholarDigital Library
- Blythe, D. The Direct3D 10 system. ACM Trans. Graph. 25, (2006), 724--734. Google ScholarDigital Library
- Buck, I. GPU computing: Programming a massively parallel processor. In Proceedings of the International Symposium on Code Generation and Optimization (Tessellations Publishing, Phoenix, Arizona, 2007). Google ScholarDigital Library
- Chamberlain, B., Callahan, D., Zima, H. Parallel programmability and the Chapel language. Int J High Perform Comput Appl. 21, (2007), 291--312. Google ScholarDigital Library
- Chen, J., Paris, S., Durand, F. Real-time edge-aware image processing with the bilateral grid. ACM Trans. Graph. 26, 3 (2007), 103:1--103:9. Google ScholarDigital Library
- Elliott, C. Functional image synthesis. In Proceedings of Bridges 2001, Mathematical Connections in Art, Music, and Science (IEEE Computer Society, Washington, DC, USA, 2001).Google Scholar
- Fatahalian, K., Horn, D.R., Knight, T.J., Leem, L., Houston, M., Park, J.Y., Erez, M., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P. Sequoia: Programming the memory hierarchy. In ACM/IEEE conference on Supercomputing (ACM, New York, NY, 2006). Google ScholarDigital Library
- Feautrier, P. Dataflow analysis of array and scalar references. Int J Parallel Program. 20, 1 (1991), 23--53.Google ScholarCross Ref
- Frigo, M., Johnson, S.G. The design and implementation of FFTW3. Proc IEEE 93, 2 (2005).Google ScholarCross Ref
- Gordon, M.I., Thies, W., Karczmarek, M., Lin, J., Meli, A.S., Leger, C., Lamb, A.A., Wong, J., Hoffman, H., Maze, D.Z., Amarasinghe, S. A stream compiler for communication-exposed architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ACM, New York, NY, 2002). Google ScholarDigital Library
- Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J. High performance discrete Fourier transforms on graphics processors. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE (Washington, DC, January 2008). Google ScholarDigital Library
- Halide source repository. http://github.com/halide/Halide.Google Scholar
- Hasinoff, S.W., Sharlet, D., Geiss, R., Adams, A., Barron, J.T., Kainz, F., Chen, J., Levoy, M. Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graph. 35, 6 (2016). Google ScholarDigital Library
- Holzmann, G. Beyond Photography: The Digital Darkroom. Prentice Hall, Englewood Cliffs, NJ, 1988. Google ScholarDigital Library
- Mullapudi, R.T., Adams, A., Sharlet, D., Ragan-Kelley, J., Fatahalian, K. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (2016). Google ScholarDigital Library
- Mullapudi, R.T., Vasista, V., Bondhugula, U. PolyMage: Automatic optimization for image processing pipelines. In ACM SIGPLAN Notices (ACM, New York, NY, 2015), volume 50, 429--443. Google ScholarDigital Library
- The OpenCL specification, version 1.2. http://www.khronos.org/registry/cl/specs/opencl-1.2.pdf, 2011.Google Scholar
- Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93, 2 (2005), 232--275.Google ScholarCross Ref
- Ragan-Kelley, J. Decoupling algorithms from the organization of computation for high performance image processing. PhD thesis, Massachusetts Institute of Technology (2014).Google Scholar
- Ragan-Kelley, J., Adams, A., Paris, S., Levoy, M., Amarasinghe, S., Durand, F. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph. 31, 4 (2012). Google ScholarDigital Library
- Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (ACM, New York, NY, 2013). Google ScholarDigital Library
- Rudy, G., Khan, M.M., Hall, M., Chen, C., Chame, J. A programming language interface to describe transformations and code generation. In Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing LCPC'10, (Springer-Verlag, Berlin, Heidelberg, 2011), 136--150. Google ScholarDigital Library
- Suriana, P., Adams, A., Kamil, S. Parallel associative reductions in halide. In Proceedings of the 2017 International Symposium on Code Generation and Optimization (ACM, New York, NY, 2017). Google ScholarDigital Library
Index Terms
- Halide: decoupling algorithms from schedules for high-performance image processing
Recommendations
Automatically scheduling halide image processing pipelines
The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a ...
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationImage processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Comments