ABSTRACT
In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture.
Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.
- Arvind and R. S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Transactions on Computers, 39(3):300--318, 1990. Google ScholarDigital Library
- K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec. 2006.Google Scholar
- Bluespec, Inc. Bluespec System Verilog Reference Guide. 2007.Google Scholar
- D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John, C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, and W. Yoder. Scaling to the End of Silicon with EDGE Architectures. Computer, 37(7):44--55, July 2004. Google ScholarDigital Library
- K. M. Chandy and J. Misra. Parallel Program Design: a Foundation. Addison-Wesley, 1988. Google ScholarDigital Library
- K. Compton and S. Hauck. Reconfigurable Computing: A Survey Of Systems and Software. ACM Computer Survey, 34(2):171--210, June 2002. Google ScholarDigital Library
- J. B. Dennis and D. P. Misunas. A Preliminary Architecture for a Basic Data-Flow Processor. In Proceedings of the 2nd annual Symposium on Computer Architecture, pages 126--132, 1975. Google ScholarDigital Library
- E. W. Dijkstra. Guarded Commands, Nondeterminacy and Formal Derivation of Programs. Communications of the ACM, 18(8):453--457, Aug. 1975. Google ScholarDigital Library
- J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. S. Mukherjee, H. Patil, S. Wallace, N. Binkert, R. Espasa, and T. Juan. Asim: A Performance Model Framework. Computer, 35(2):68--76, 2002. Google ScholarDigital Library
- J. S. Emer and D. W. Clark. A Characterization of Processor Performance in the vax-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA), pages 301--310, 1984. Google ScholarDigital Library
- R. A. V. D. Geijin and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical report, 1997.Google Scholar
- V. Govindaraju, C.-H. Ho, and K. Sankaralingam. Dynamically Specialized Datapaths for Energy Efficient Computing. In Proceedings of 17th International Conference on High Performance Computer Architecture (HPCA), 2011. Google ScholarDigital Library
- J. Hauser and J. Wawrzynek. Garp: A MIPS Processor with a Reconfigurable Coprocessor. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, pages 12--21, April 1997. Google ScholarDigital Library
- J. Hoogerbrugge and H. Corporaal. Transport-Triggering vs. Operation-Triggering. In Lecture Notes in Computer Science 786, Compiler Construction, pages 435--449. Springer-Verlag, 1994. Google ScholarDigital Library
- D. E. Knuth, J. Morris, and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Journal of Computing, 6(2):323--350, 1977.Google ScholarCross Ref
- H. T. Kung. The CMU Warp Processor. In F. A. Matsen and T. Tajima, editors, Supercomputers: Algorithms, Architectures, and Scientific Computation, pages 235--247. 1986. Google ScholarDigital Library
- A. Marquardt, V. Betz, and J. Rose. Speed and Area Tradeoffs in Cluster-Based FPGA Architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 8(1):84--93, Feb. 2000. Google ScholarDigital Library
- B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Proceedings of 13th International Conference on Field-Programmable Logic and Applications, pages 61--70, Sep. 2003.Google ScholarCross Ref
- D. G. Merrill and A. S. Grimshaw. Revisiting Sorting for GPGPU Stream Architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 545--546, 2010. Google ScholarDigital Library
- E. Mirsky and A. DeHon. MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, pages 157--166, Apr. 1996.Google ScholarCross Ref
- G. Panesar, D. Towner, A. Duller, A. Gray, and W. Robbins. Deterministic Parallel Processing. International Journal of Parallel Programming, 34(4):323--341, Aug. 2006. Google ScholarDigital Library
- H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine, and R. Taylor. PipeRench: A Virtualized Programmable Datapath in 0.18 Micron Technology. In Proceedings of the 2002 IEEE Custom Integrated Circuits Conference, pages 63--66, May 2002.Google ScholarCross Ref
- S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers. The WaveScalar Architecture. ACM Transactions on Computer Systems, 25(2):4:1--4:54, May 2007. Google ScholarDigital Library
- M. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J. Lee, W. Lee, et al. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, 22(2):25--35, 2002. Google ScholarDigital Library
- D. Truong, W. Cheng, T. Mohsenin, Z. Yu, A. Jacobson, G. Landge, M. Meeuwsen, C. Watnik, A. Tran, Z. Xiao, E. Work, J. Webb, P. Mejia, and B. Baas. A 167-Processor Computational Platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits, 44(4):1130--1144, April 2009.Google ScholarCross Ref
- Z.-A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA), pages 225--235, Jun. 2000. Google ScholarDigital Library
- Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, T. Mohsenin, M. Singh, and B. Baas. An Asynchronous Array of Simple Processors for DSP Applications. In Solid-State Circuits Conference (ISSCC), Digest of Technical Papers, pages 1696--1705, Feb. 2006.Google Scholar
Index Terms
- Triggered instructions: a control paradigm for spatially-programmed architectures
Recommendations
Triggered instructions: a control paradigm for spatially-programmed architectures
ICSA '13In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition ...
Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures
There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to ...
Dynamic coalescing for 16-bit instructions
In the embedded domain, memory usage and energy consumption are critical constraints.Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit ...
Comments