Abstract
Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to outer loops. This paper proposes a three-step approach, called single-dimension software pipelining (SSP), to software pipeline a loop nest at an arbitrary loop level that has a rectangular iteration space and contains no sibling inner loops in it. The first step identifies the most profitable loop level for software pipelining in terms of initiation rate, data reuse potential, or any other optimization criteria. The second step simplifies the multidimensional data-dependence graph (DDG) of the selected loop level into a one-dimensional DDG and constructs a one-dimensional (1D) schedule. Based on the one-dimensional schedule, the third step derives a simple mapping function that specifies the schedule time for the operation instances in the multidimensional loop. The classical modulo scheduling is subsumed by SSP as a special case. SSP is also closely related to hyperplane scheduling, and, in fact, extends it to be resource constrained. We prove that SSP schedules are correct and at least as efficient as those schedules generated by traditional modulo scheduling methods. We extend SSP to schedule imperfect loop nests, which are most common at the instruction level. Multiple initiation intervals are naturally allowed to improve execution efficiency. Feasibility and correctness of our approach are verified by a prototype implementation in the ORC compiler for the IA-64 architecture, tested with loop nests from Livermore and SPEC2000 floating-point benchmarks. Preliminary experimental results reveal that, compared to modulo scheduling, software pipelining at an appropriate loop level results in significant performance improvement. Software pipelining is beneficial even with prior loop transformations.
- Aiken, A. and Nicolau, A. 1990. Fine-grain parallelization and the wavefront method. In Selected Papers of the 2nd Workshop on Languages and Compilers for Parallel Computing. Pitman Publishing, London. 1--16. Google ScholarDigital Library
- Allan, V. H., Jones, R. B., Lee, R. M., and Allan, S. J. 1995. Software pipelining. ACM Computing Surveys 27, 3 (Sept.), 367--432. Google ScholarDigital Library
- Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In POPL '83: Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. ACM Press, New York. 177--189. Google ScholarDigital Library
- Banerjee, U. K. 1993. Loop Transformations for Restructuring Compilers: The Foundations. Kluwer Academic Publ., Norwell, MA. Google ScholarDigital Library
- Carr, S. and Kennedy, K. 1994. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. on Prog. Lang. and Systems 16, 6 (Nov.), 1768--1810. Google ScholarDigital Library
- Carr, S., McKinley, K. S., and Tseng, C.-W. 1994. Compiler optimizations for improving data locality. In ASPLOS-VI: Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York. 252--262. Google ScholarDigital Library
- Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In HICSS'96: Proceedings of the 29th Hawaii International Conference on System Sciences (HICSS'96) Volume 1: Software Technology and Architecture. IEEE Computer Society, Washington, D.C. 183. Google ScholarDigital Library
- Darte, A. and Robert, Y. 1994. Constructive methods for scheduling uniform loop nests. IEEE Transactions on Parallel and Distributed Systems 5, 8 (Aug.), 814--822. Google ScholarDigital Library
- Darte, A., Schreiber, R., Rau, B. R., and Vivien, F. 2002. Constructing and exploiting linear schedules with prescribed parallelism. ACM Trans. Des. Autom. Electron. Syst. 7, 1, 159--172. Google ScholarDigital Library
- Feautrier, P. 1996. Automatic parallelization in the polytope model. Lecture Notes in Computer Science 1132, 79--103. Google ScholarDigital Library
- Gao, G. R., Ning, Q., and Van Dongen, V. 1993. Software pipelining for nested loops. ACAPS Tech Memo 53, School of Computer Science, McGill Univ., Montréal, Québec.Google Scholar
- Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Prog. Lang. and Syst. 21, 4, 703--746. Google ScholarDigital Library
- Govindarajan, R., Altman, E. R., and Gao, G. R. 1996. A framework for resource-constrained rate-optimal software pipelining. IEEE Trans. on Parallel and Distributed Systems 7, 11 (Nov.), 1133--1149. Google ScholarDigital Library
- Huff, R. A. 1993. Lifetime-sensitive modulo scheduling. In PLDI'93: Proc. of the ACM SIGPLAN 1993 Conf. on Programming Language Design and Implementation. ACM Press, New York. 258--267. Google ScholarDigital Library
- Intel. 2001. Intel IA-64 Architecture Software Developer's Manual, Vol. 1: IA-64 Application Architecture. Intel Corporation, Santa Clara, CA.Google Scholar
- Kennedy, K. and McKinley, K. S. 1992. Optimizing for parallelism and data locality. In ICS'92: Proceedings of the 6th International Conference on Supercomputing. ACM Press, New York. 323--334. Google ScholarDigital Library
- Lam, M. 1988. Software pipelining: an effective scheduling technique for vliw machines. In PLDI '88: Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation. ACM Press, New York. 318--328. Google ScholarDigital Library
- Lamport, L. 1974. The parallel execution of DO loops. Communications of the ACM 17, 2 (Feb.), 83--93. Google ScholarDigital Library
- Moon, S.-M. and Ebcioğlu, K. 1997. Parallelizing nonnumerical code with selective scheduling and software pipelining. ACM Transactions on Programming Languages and Systems 19, 6 (Nov.), 853--898. Google ScholarDigital Library
- Muthukumar, K. and Doshi, G. 2001. Software pipelining of nested loops. Lecture Notes in Computer Science 2027, 165--181. Google ScholarDigital Library
- Passos, N. L. and Sha, E. H.-M. 1996. Achieving full parallelism using multidimensional retiming. IEEE Trans. Parallel Distrib. Syst. 7, 11, 1150--1163. Google ScholarDigital Library
- Petkov, D., Harr, R., and Amarasinghe, S. 2002. Efficient pipelining of nested loops: unroll-and-squash. In 16th Intl. Parallel and Distributed Processing Symposium (IPDPS '02). IEEE, Washigton, D.C. Google ScholarDigital Library
- Ramanujam, J. 1994. Optimal software pipelining of nested loops. In Proceedings of the 8th International Symposium on Parallel Processing. IEEE Computer Society, Washington, D.C. 335--342. Google ScholarDigital Library
- Rau, B. R. 1994. Iterative modulo scheduling: an algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture. ACM Press, New York. 63--74. Google ScholarDigital Library
- Rau, B. R. and Fisher, J. A. 1993. Instruction-level parallel processing: History, overview and perspective. Journal of Supercomputing 7, 9--50. Google ScholarDigital Library
- Rong, H., Douillet, A., Govindarajan, R., and Gao, G. R. 2004a. Code generation for single-dimension software pipelining of multi-dimensional loops. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Washington, D.C. 175--186. Google ScholarDigital Library
- Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. R. 2004b. Single-dimension software pipelining for multi-dimensional loops. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, Washington, D.C. 163--174. Google ScholarDigital Library
- Rong, H., Douillet, A., and R.Gao, G. 2005. Register allocation for software pipelined multi-dimensional loops. In PLDI'05: Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. ACM Press, New York. Google ScholarDigital Library
- Rong, H., Tang, Z., Govindarajan, R., Douillet, A., and Gao, G. R. 2007. Single-dimension software pipelining for multi-dimensional loops. CAPSL technical memo, Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware. January. In ftp://ftp.capsl.udel.edu/pub/doc/memos/memo049.ps.gz.Google Scholar
- Wang, J. and Gao, G. R. 1996. Pipelining-dovetailing: A transformation to enhance software pipelining for nested loops. In CC '96: Proceedings of the 6th International Conference on Compiler Construction. Springer-Verlag, New York. 1--17. Google ScholarDigital Library
- Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm. In PLDI'91: Proc. of the ACM SIGPLAN 1991 Conf. on Prog. Lang. Design and Implementation. ACM Press, New York. 30--44. Google ScholarDigital Library
- Wolf, M. E., Maydan, D. E., and Chen, D.-K. 1996. Combining loop transformations considering caches and scheduling. In MICRO 29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C. 274--286. Google ScholarDigital Library
Index Terms
- Single-dimension software pipelining for multidimensional loops
Recommendations
Modulo scheduling of loops in control-intensive non-numeric programs
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on MicroarchitectureMuch of the previous work on modulo scheduling has targeted numeric programs, in which, often, the majority of the loops are well-behaved loop-counter-based loops without early exits. In control-intensive non-numeric programs, the loops frequently have ...
Single-Dimension Software Pipelining for Multi-Dimensional Loops
CGO '04: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimizationTraditionally, software pipelining is applied either to theinnermost loop of a given loop nest or from the innermostloop to outer loops. In this paper, we propose a three-stepapproach, called Single-dimension Software Pipelining(SSP), to software ...
Software Pipelining Irregular Loops On the TMS320C6000 VLIW DSP Architecture
LCTES '01: Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systemsThe TMS320C6000 architecture is a leading family of Digital Signal Processors (DSPs). To achieve peak performance, this VLIW architecture relies heavily on software pipelining. Traditionally, software pipelining has been restricted to regular (FOR) ...
Comments