ABSTRACT
We present a dynamic optimization technique, thread warping, that uses a single processor on a multiprocessor system to dynamically synthesize threads into custom accelerator circuits on FPGAs (field-programmable gate arrays). Building on dynamic synthesis for single-processor single-thread systems, known as warp processing, thread warping improves performances of multiprocessor systems by speeding up individual threads and by allowing more threads to execute concurrently. Furthermore, thread warping maintains the important separation of function from architecture, enabling portability of applications to architectures with different quantities of microprocessors and FPGA.an advantage not shared by static compilation/synthesis approaches. We introduce a framework of architecture, CAD tools, and operating system that together support thread warping. We summarize experiments on an extensive architectural simulation framework we developed, showing application speedups of 4x to 502x, averaging 130x compared to a multiprocessor system having four ARM11 microprocessors, for eight benchmark applications. Even compared to a 64-processor system, thread warping achieves 11x speedup.
- Amerson, R., Carter, R., Culbertson, W., Kuekes, P., Snider, G., and Albertson, L. Plasma: an FPGA for million gate systems. In Proceedings of Int. Symp. on Field Programmable Gate Arrays (FPGA), 1996, 10--16. Google ScholarDigital Library
- Andrews, D., Niehaus, D., and Ashenden, P. Programming models for hybrid CPU/FPGA chips. IEEE Computer, 37, 1 (2004), 118--120. Google ScholarDigital Library
- Burger, D. and Austin, T. The simplescalar tool set, version 2.0. SIGARCH Computer Architecture News, 25, 3 (1997), 13--35. Google ScholarDigital Library
- Cifuentes, C. Reverse Compilation Techniques. PhD Thesis, Queensland University of Technology, 1994.Google Scholar
- Cray XD1. http://www.cray.com/products/xd1, 2005.Google Scholar
- Dellson, A., Sandberg, G., and Möhl, S. Turning FPGAs into Supercomputers. Cray User Group, 2006.Google Scholar
- Eles, P., Peng, Z., Kuchchinski, K., and Doboli, A. System level hardware/software partitioning based on simulated annealing and tabu search. Journal on Design Automation for Embedded Systems (DAES), Springer, 2, 1 (1997), 5--32.Google Scholar
- Fin, A., Fummi, F., and Signoretto, M. SystemC: a homogenous environment to test embedded systems. In Proceedings of Int. Workshop on Hardware/Software Codesign (CODES), 2001, 17--22. Google ScholarDigital Library
- Grimpe, E. and Oppenheimer, F. Extending the SystemC synthesis subset by object oriented features. In Proceedings of Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2003, 25--30. Google ScholarDigital Library
- Guo, Z., Buyukkurt, A.B., and Najjar, W. Input data reuse in compiling window operations onto reconfigurable hardware. In Proceedings of Symposium on Languages, Compilers and Tools for Embedded Systems (LCTES), 2004, 249--256. Google ScholarDigital Library
- Gupta, S., Dutt, N., Gupta, R., and Nicolau, A. SPARK : a high-level synthesis framework for applying parallelizing compiler transformations. In Proceedings of Int. Conf. on VLSI Design, 2003. Google ScholarDigital Library
- Hill, M., Larus, J., Lebeck, A., Talluri, M., and Wood, D. Wisconsin architectural research tool set. SIGARCH Computer Architecture News. 21, 4 (1993). Google ScholarDigital Library
- IBM. The Cell Architecture. http://domino.research.ibm.com, 2006.Google Scholar
- Schleupen, K., Lekuch, S., Mannion, R., Guo, Z., Najjar, W., and Vahid, F. Dynamic partial FPGA reconfiguration in a prototype microprocessor system. In Proceedings of Int. Conf. on Field Programmable Logic And Applications, 2007.Google ScholarCross Ref
- Intel Quad-Core Xeon. http://www.intel.com, 2007.Google Scholar
- Jung, H. and Ha, S. Hardware synthesis from coarse-grained dataflow specification for fast hw/sw cosynthesis. In Proceedings of Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), 2004, 24--29. Google ScholarDigital Library
- Koch, D., Haubelt, C., and Teich, J. Efficient hardware checkpointing: concepts, overhead analysis, and implementation. In Proceedings of Int. Symp. on Field Programmable Gate Arrays (FPGA), 2007, 188--196. Google ScholarDigital Library
- M. LaPedus. Intel Tips Teraflops Programmable Processor. EE Times, September 2006.Google Scholar
- Lu, J., Chen, H., Yew, P., and Hsu, W. Design and implementation of a lightweight dynamic optimization system. Journal of Instruction-Level Parallelism, 6 (Jun 2004), 1--24.Google Scholar
- Ludwig, S. Fast Hardware Synthesis Tools and a Reconfigurable Coprocessor. Ph.D. Thesis, ETH Zurich, 2005.Google Scholar
- Lysecky, R., Stitt, G., and Vahid, F. Warp processors. ACM Transactions on Design Automation of Electronic Systems (TODAES), 11, 3 (2006), 659--681. Google ScholarDigital Library
- Lysecky, R., Vahid, F., and Tan, S. A study of the scalability of on-chip routing for just-in-time FPGA compilation. In Proceedings of IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM), 2005, 57--62. Google ScholarDigital Library
- Mittal, G., Zaretsky, D., Tang, X., and Banerjee, P. Automatic translation of software binaries onto FPGAs. In Proceedings of ACM Design Automation Conference (DAC), 2004, 389--394. Google ScholarDigital Library
- De Micheli, G. Synthesis and Optimization of Digital Circuits. McGraw-Hill, 1994. Google ScholarDigital Library
- Rakhmatov, D. and Vrudhula, S. Hardware-software bipartitioning for dynamically reconfigurable systems. In Proceedings of Int. Workshop on Hardware/Software Co-Design (CODES), 2002, 145--150. Google ScholarDigital Library
- SGI Altix. http://www.sgi.com/products/servers/altix/Google Scholar
- Stitt, G. and Vahid, F. New decompilation techniques for binary-level co-processor generation. In Proceedings of IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD), 2005, 547--554. Google ScholarDigital Library
- VxWorks RTOS. http://www.windriver.com/vxworks/, 2007.Google Scholar
- Xilinx Virtex II Pro, http://www.xilinx.com, 2006.Google Scholar
- Xilinx Virtex IV, http://www.xilinx.com, 2006.Google Scholar
- Zhang, W., Calder, B., and Tullsen, D. An event-driven multithreaded dynamic optimization framework. In Proceedings of Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), 2005, 87--98. Google ScholarDigital Library
Index Terms
- Thread warping: a framework for dynamic synthesis of thread accelerators
Recommendations
Thread Warping: Dynamic and Transparent Synthesis of Thread Accelerators
We introduce thread warping, a dynamic optimization technique that customizes multicore architectures to a given application by dynamically synthesizing threads into custom accelerator circuits on FPGAs (Field-Programmable Gate Arrays). Thread warping ...
Warp Processing: Dynamic Translation of Binaries to FPGA Circuits
Warp processing dynamically and transparently transforms an executing microprocessor's binary kernels into customized field-programmable gate array (FPGA) circuits, commonly resulting in 2X to 100X speedup over executing on microprocessors. A new ...
An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor
Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to ...
Comments