Abstract
In this article, we present experiences implementing a general Parallel Discrete Event Simulation (PDES) accelerator on a Field Programmable Gate Array (FPGA). The accelerator can be specialized to any particular simulation model by defining the object states and the event handling code, which are then synthesized into a custom accelerator for the given model. The accelerator consists of several event processors that can process events in parallel while maintaining the dependencies between them. Events are automatically sorted by a self-sorting event queue. The accelerator supports optimistic simulation by automatically keeping track of event history and supporting rollbacks. The architecture is limited in scalability locally by the communication and port bandwidth of the different structures. However, it is designed to allow multiple accelerators to be connected to scale up the simulation. We evaluate the design and explore several design trade-offs and optimizations. We show that the accelerator can scale to 64 concurrent event processors relative to the performance of a single event processor. At this point, the scalability becomes limited by contention on the shared structures within the datapath. To alleviate this bottleneck, we also develop a new version of the datapath that partitions the state and event space of the simulation but allows these partitions to share the use of the event processors. The new design substantially reduces contention and improves the performance with 64 processors from 49x to 62x relative to a single processor design. We went through two iterations of the design of PDES-A, first using Verilog and then using Chisel (for the partitioned version of the design). We report in this article on some observations in the differences in prototyping accelerators using these two different languages. PDES-A outperforms the ROSS simulator running on a 12-core Intel Xeon machine by a factor of 3.2x with less than 15% of the power consumption. Our future work includes building multiple interconnected PDES-A cores.
- Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 1216--1225. Google ScholarDigital Library
- R. Bhagwan and B. Lin. 2000. Fast and scalable priority queue architecture for high-speed network switches. In Proceedings of IEEE INFOCOM 2000. Conference on Computer Communications. 19th Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 2. IEEE, Tel Aviv, Israel, 538--547.Google Scholar
- R. Brown. 1988. Calendar queues: A fast 0(1) priority queue implementation for the simulation event set problem. Commun. ACM 31, 10 (Oct. 1988), 1220--1227. Google ScholarDigital Library
- J. Burt. 2016. Intel Begins Shipping Xeon Chips With FPGA Accelerators. Retrieved February 2017 from http://www.eweek.com/servers/intel-begins-shipping-xeon-chips-with-fpga-accelerators.html.Google Scholar
- Christopher D. Carothers. 2018. ROSS-Models. Retrieved January 31, 2019 from https://github.com/carothersc/ROSS-Models.Google Scholar
- Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: A high-performance, low memory, modular time warp system. In Proceedings of the 14th Workshop on Parallel and Distributed Simulation (PADS’00). IEEE Computer Society, Washington, DC, 53--60. http://dl.acm.org/citation.cfm?id=336146.336157 Google ScholarDigital Library
- Guillaume Chapuis, Stephan Eidenbenz, Nandakishore Santhi, and Eun Jung Park. 2015. Simian integrated framework for parallel discrete event simulation on GPUs. In Proceedings of the 2015 Winter Simulation Conference (WSC’15). IEEE Press, Piscataway, NJ, 1127--1138. http://dl.acm.org/citation.cfm?id=2888619.2888742 Google ScholarDigital Library
- Huilong Chen, Yiping Yao, Wenjie Tang, Dong Meng, Feng Zhu, Yuewen Fu, and Yiping Yao. 2015. Can MIC find its place in the field of PDES? An early performance evaluation of PDES simulator on Intel many integrated cores coprocessor. In Proceedings of the 19th International Symposium on Distributed Simulation and Real Time Applications (DS-RT’15). IEEE Press, Piscataway, NJ, 41--49. Google ScholarDigital Library
- Convey Computers Corporation. 2013. The Convey WX Series (conv-13-045.5 ed.). https://www.micron.com/-/media/client/global/documents/products/product-flyer/conv13045,-d-,5-wolverine_r1b.pdf.Google Scholar
- Convey Computers Corporation. 2014. Convey Wolverine® Application Accelerators Architectural Overview (CONV-14-049.1 ed.). https://www.micron.com/-/media/client/global/documents/products/white-paper/wp_conv14049,-d-,1_wolverine_arch_overview.pdf.Google Scholar
- Samir Das, Richard Fujimoto, Kiran Panesar, Don Allison, and Maria Hybinette. 1994. GTW: A time warp system for shared memory multiprocessors. In Proceedings of the 26th Conference on Winter Simulation (WSC’94). Society for Computer Simulation International, San Diego, CA, 1332--1339. http://dl.acm.org/citation.cfm?id=193201.194885 Google ScholarDigital Library
- Richard Fujimoto. 2015. Parallel and distributed simulation. In Proceedings of the 2015 Winter Simulation Conference (WSC’15). IEEE Press, Piscataway, NJ, 45--59. http://dl.acm.org/citation.cfm?id=2888619.2888624 Google ScholarDigital Library
- Richard M. Fujimoto. 1999. Parallel and Distribution Simulation Systems. John Wiley 8 Sons, Inc., New York, NY. Google ScholarDigital Library
- Richard M. Fujimoto, Jya-Jang Tsai, and Ganesh C. Gopalakrishnan. 1992. Design and evaluation of the rollback chip: Special purpose hardware for time warp. IEEE Trans. Comput. 41, 1 (Jan. 1992), 68--82. Google ScholarDigital Library
- Sounak Gupta and Philip A. Wilsey. 2014. Lock-free pending event set management in time warp. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS’14). ACM, New York, NY, 15--26. Google ScholarDigital Library
- Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 37--47. Google ScholarDigital Library
- Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali. 2011. Ordered vs. unordered: A comparison of parallelism and work-efficiency in irregular algorithms. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 3--12. Google ScholarDigital Library
- M. C. Herbordt, F. Kosie, and J. Model. 2008. An efficient O(1) priority queue for large FPGA-based discrete event simulations of molecular dynamics. In 2008 16th International Symposium on Field-Programmable Custom Computing Machines. IEEE, Palo Alto, CA, 248--257. Google ScholarDigital Library
- Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.1 (2.1 ed.). http://hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf.Google Scholar
- Amazon Web Services, Inc. 2018. Amazon EC2 F1 Instances. Retrieved January 31, 2019 from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- Deepak Jagtap, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012. Characterizing and understanding PDES behavior on Tilera architecture. In Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation (PADS’12). IEEE Computer Society, Washington, DC, 53--62. Google ScholarDigital Library
- David R. Jefferson. 1985. Virtual time. ACM Trans. Program. Lang. Syst. 7, 3 (July 1985), 404--425. Google ScholarDigital Library
- Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2015. A scalable architecture for ordered parallelism. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, New York, NY, 228--241. Google ScholarDigital Library
- Ranjit Noronha and Nael B. Abu-Ghazaleh. 2002. Early cancellation: An active NIC optimization for time-warp. In Proceedings of the 16th Workshop on Parallel and Distributed Simulation (PADS’02). IEEE Computer Society, Washington, DC, 43--50. http://dl.acm.org/citation.cfm?id=564062.564070 Google ScholarDigital Library
- Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 5--14. Google ScholarDigital Library
- Hyungwook Park and Paul A. Fishwick. 2010. A GPU-based application framework supporting fast discrete-event simulation. Simulation 86, 10 (Oct. 2010), 613--628. Google ScholarDigital Library
- Alessandro Pellegrini and Francesco Quaglia. 2014. Transparent multi-core speculative parallelization of DES models with event and cross-state dependencies. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS’14). ACM, New York, NY, 105--116. Google ScholarDigital Library
- Kalyan S. Perumalla. 2006. Discrete-event execution alternatives on general purpose graphical processing units (GPGPUs). In Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation (PADS’06). IEEE Computer Society, Washington, DC, 74--81. Google ScholarDigital Library
- Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of parallelism in algorithms. SIGPLAN Not. 46, 6 (June 2011), 12--25. Google ScholarDigital Library
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA’14). IEEE Press, Piscataway, NJ, 13--24. http://dl.acm.org/citation.cfm?id=2665671.2665678 Google ScholarDigital Library
- Shafiur Rahman, Nael Abu-Ghazaleh, and Walid Najjar. 2017. PDES-A: A parallel discrete event simulation accelerator for FPGAs. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS’17). ACM, New York, NY, 133--144. Google ScholarDigital Library
- Joseph Rios. 2007. An Efficient FPGA Priority Queue Implementation with Application to the Routing Problem. UC Santa Cruz Technical Report. University of California, Santa Cruz, Santa Cruz, CA. https://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-07-01Google Scholar
- Robert Rönngren and Rassul Ayani. 1997. A comparative study of parallel and sequential priority queue algorithms. ACM Transactions on Modeling and Computer Simulation 7, 2 (1997), 157--209. Google ScholarDigital Library
- N. Santhi, S. Eidenbenz, and J. Liu. 2015. The Simian concept: Parallel discrete event simulation with interpreted languages and just-in-time compilation. In 2015 Winter Simulation Conference (WSC’15). IEEE, Huntington Beach, CA, 3013--3024. Google ScholarDigital Library
- Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE Press, Piscataway, NJ, Article 17, 12 pages. http://dl.acm.org/citation.cfm?id=3195638.3195659 Google ScholarDigital Library
- Philip Andrew Simpson. 2015. FPGA Design. Springer International Publishing, Cham. http://link.springer.com/10.1007/978-3-319-17924-7Google Scholar
- Jeffrey S. Steinman. 2005. The WarpIV simulation kernel. In Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation (PADS’05). IEEE Computer Society, Washington, DC, 161--170. Google ScholarDigital Library
- Zhangxi Tan, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry Cook, David Patterson, and Krste Asanović. 2010. RAMP Gold: An FPGA-based architecture simulator for multiprocessors. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY, 463--468. Google ScholarDigital Library
- Wenjie Tang and Yiping Yao. 2013. A GPU-based discrete event simulation kernel. Simulation 89, 11 (Nov. 2013), 1335--1354. Google ScholarDigital Library
- Jingjing Wang, Deepak Jagtap, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2014. Parallel discrete event simulation for multi-core systems: Analysis and optimization. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1574--1584. Google ScholarDigital Library
- Jingjing Wang, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012. Performance analysis of a multithreaded PDES simulator on multicore clusters. In Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation (PADS’12). IEEE Computer Society, Washington, DC, 93--95. Google ScholarDigital Library
- Barry Williams, Dmitry Ponomarev, Nael Abu-Ghazaleh, and Philip Wilsey. 2017. Performance characterization of parallel discrete event simulation on Knights Landing processor. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS’17). ACM, New York, NY, 121--132. Google ScholarDigital Library
- S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, Washington, DC, 103--110.Google Scholar
Index Terms
- PDES-A: Accelerators for Parallel Discrete Event Simulation Implemented on FPGAs
Recommendations
PDES-A: a Parallel Discrete Event Simulation Accelerator for FPGAs
SIGSIM-PADS '17: Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete SimulationIn this paper, we present initial experiences implementing a general Parallel Discrete Event Simulation (PDES) accelerator on a Field Programmable Gate Array (FPGA). The accelerator can be specialized to any particular simulation model by defining the ...
Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform
HPRCTA '07: Proceedings of the 1st international workshop on High-performance reconfigurable computing technology and applications: held in conjunction with SC07An innovative reconfigurable supercomputing platform -- XD1000 is developed by XtremeData Inc. to exploit the rapid progress of FPGA technology and the high-performance of Hyper-Transport interconnection. In this paper, we present the implementations of ...
Comments