ABSTRACT
Most automatic parallelizers are based on detection of independent computations, and most of them cannot do anything if there is a true dependence between computations. However, this can be surmounted for programs that perform prefix computations (scans). We present a method for automatically parallelizing such "inherently sequential" programs. Our method, which handles arbitrarily nested loops, identifies situations where the computation performed by the loop body is equivalent to a matrix vector product over a semi-ring. We also deal with mutually dependent variables in the loop. Our method is implemented in a polyhedral program transformation and code generation system and generates OpenMP code. We also present strategies to improve the performance of the generated code, an analytical performance model for the expected speedup, as well as a method to choose the parallelization parameters optimally. We show experimentally that the scan parallelizations performed by our system are effective, yielding linear (iso-efficient) speedup in situations where no other parallelism is available.
- U. Bondhugula and J. Ramanujam, "Pluto: A practical and fully automatic polyhedral program optimization system," in Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08), 2008. Google ScholarDigital Library
- "Par4all," http://www.par4all.org/.Google Scholar
- L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, and P. Sadayappan, Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework. New Orleans, LA: IEEE Computer Society Press, 2010.Google Scholar
- U. of Illinois at Urbana-Champaign, "Polaris: Automatic parallelization of conventional fortran programs," http://polaris.cs.uiuc.edu/polaris/polaris-old.html.Google Scholar
- P. University, "Cetus: A source-to-source compiler infrastructure for c programs," http://cetus.ecn.purdue.edu/.Google Scholar
- G. E. Blelloch, "Scans as primitive parallel operations," IEEE Trans. Comput., vol. 38(11), pp. 1526--1538, November 1989. Google ScholarDigital Library
- X. Redon and P. Feautrier, "Scheduling reductions," in Proceedings of the 8th international conference on Supercomputing, ser. ICS '94. New York, NY, USA: ACM, 1994, pp. 117--125. Google ScholarDigital Library
- P. M. Kogge and H. S. Stone, "A parallel algorithm for the efficient solution of a general class of recurrence equations," IEEE Trans. Comput., vol. 22(8), pp. 786--793, August 1973. Google ScholarDigital Library
- K. Matsuzaki, Z. Hu, and M. Takeichi, "Towards automatic parallelization of tree reductions in dynamic programming," in Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, ser. SPAA '06. New York, NY, USA: ACM, 2006, pp. 39--48. Google ScholarDigital Library
- S.-C. K. D. N. Xu and Z. Hu, "Ptype system: A featherweight parallelizability detector," in proceedings of 2d asian symposium on programming languages and systems (APLAS 2004), LNCS 3302. Springer, LNCS, 2004, pp. 197--212.Google Scholar
- A. L. Fisher and A. M. Ghuloum, "Parallelizing complex scans and reductions," in Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, ser. PLDI '94. New York, NY, USA: ACM, 1994, pp. 135--146. Google ScholarDigital Library
- S. Sato and H. Iwasaki, "Automatic parallelization via matrix multiplication," in Proc. 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI 2011), to appear, 2011. Google ScholarDigital Library
- X. Redon and P. Feautrier, "Detection of recurrences in sequential programs with loops," in Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe, ser. PARLE '93. London, UK: Springer-Verlag, 1993, pp. 132--145. Google ScholarDigital Library
- P. Feautrier, "Automatic parallelization in the polytope model," The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications, pp. 79--103, 1996. Google ScholarDigital Library
- C. Bastoul, "Code generation in the polyhedral model is easier than you think," in Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT '04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 7--16. Google ScholarDigital Library
- C. S. University, "Alphaz," https://www.cs.colostate.edu/AlphaZ/wiki/doku.php.Google Scholar
- P. Feautrier, "Dataflow analysis of array and scalar references," International Journal of Parallel Programming, vol. 20, 1991.Google ScholarCross Ref
- J. Bentley, Programming pearls, ser. ACM Press Series. Addison-Wesley, 2000.Google Scholar
- S. Sato, "Automatic parallelization via matrix multiplication," Master's thesis, The University of Electro-Communications, Japan, 2011.Google Scholar
- D. Merrill and A. Grimshaw, "Parallel scan for stream architectures," Technical Report CS2009-14, pp. 373--381, 2009.Google Scholar
- S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, "Scan primitives for gpu computing," Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pp. 97--106, 2007. Google ScholarDigital Library
- P. Clauss and V. Loechner, "Parametric analysis of polyhedral iteration spaces," J. VLSI Signal Process. Syst., vol. 19, pp. 179--194, July 1998. Google ScholarDigital Library
- T. J. Ypma, "Historical development of the newton-raphson method," SIAM Rev., vol. 37, pp. 531--551, December 1995. Google ScholarDigital Library
- R. M. Karp, R. E. Miller, and S. Winograd, "The organization of computations for uniform recurrence equations," J. ACM, vol. 14(3), pp. 563--590, July 1967. Google ScholarDigital Library
- G. Fettweis, L. Thiele, and G. Meyr, "Algorithm transformations for unlimited parallelism," in Circuits and Systems, 1990., IEEE International Symposium on, may 1990, pp. 1756--1759 vol.3.Google Scholar
- G. Fettweis and H. Meyr, "High-speed parallel viterbi decoding: algorithm and vlsi-architecture," Communications Magazine, IEEE, vol. 29, no. 5, pp. 46--55, may 1991. Google ScholarDigital Library
- R. E. Ladner and M. J. Fischer, "Parallel prefix computation," J. ACM, vol. 27(4), pp. 831--838, October 1980. Google ScholarDigital Library
- A. L. Fisher and A. M. Ghuloum, "Parallelizing complex scans and reductions," in Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, ser. PLDI '94. New York, NY, USA: ACM, 1994, pp. 135--146. Google ScholarDigital Library
- W. L. L. Han and J. M. Tuck, "Speculative parallelization of partial reduction variables," pp. 141--150, 2010. Google ScholarDigital Library
Index Terms
- Scan detection and parallelization in "inherently sequential" nested loop programs
Comments