skip to main content
research-article
Open Access

PDES-A: Accelerators for Parallel Discrete Event Simulation Implemented on FPGAs

Published:18 April 2019Publication History
Skip Abstract Section

Abstract

In this article, we present experiences implementing a general Parallel Discrete Event Simulation (PDES) accelerator on a Field Programmable Gate Array (FPGA). The accelerator can be specialized to any particular simulation model by defining the object states and the event handling code, which are then synthesized into a custom accelerator for the given model. The accelerator consists of several event processors that can process events in parallel while maintaining the dependencies between them. Events are automatically sorted by a self-sorting event queue. The accelerator supports optimistic simulation by automatically keeping track of event history and supporting rollbacks. The architecture is limited in scalability locally by the communication and port bandwidth of the different structures. However, it is designed to allow multiple accelerators to be connected to scale up the simulation. We evaluate the design and explore several design trade-offs and optimizations. We show that the accelerator can scale to 64 concurrent event processors relative to the performance of a single event processor. At this point, the scalability becomes limited by contention on the shared structures within the datapath. To alleviate this bottleneck, we also develop a new version of the datapath that partitions the state and event space of the simulation but allows these partitions to share the use of the event processors. The new design substantially reduces contention and improves the performance with 64 processors from 49x to 62x relative to a single processor design. We went through two iterations of the design of PDES-A, first using Verilog and then using Chisel (for the partitioned version of the design). We report in this article on some observations in the differences in prototyping accelerators using these two different languages. PDES-A outperforms the ROSS simulator running on a 12-core Intel Xeon machine by a factor of 3.2x with less than 15% of the power consumption. Our future work includes building multiple interconnected PDES-A cores.

References

  1. Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 1216--1225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Bhagwan and B. Lin. 2000. Fast and scalable priority queue architecture for high-speed network switches. In Proceedings of IEEE INFOCOM 2000. Conference on Computer Communications. 19th Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 2. IEEE, Tel Aviv, Israel, 538--547.Google ScholarGoogle Scholar
  3. R. Brown. 1988. Calendar queues: A fast 0(1) priority queue implementation for the simulation event set problem. Commun. ACM 31, 10 (Oct. 1988), 1220--1227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Burt. 2016. Intel Begins Shipping Xeon Chips With FPGA Accelerators. Retrieved February 2017 from http://www.eweek.com/servers/intel-begins-shipping-xeon-chips-with-fpga-accelerators.html.Google ScholarGoogle Scholar
  5. Christopher D. Carothers. 2018. ROSS-Models. Retrieved January 31, 2019 from https://github.com/carothersc/ROSS-Models.Google ScholarGoogle Scholar
  6. Christopher D. Carothers, David Bauer, and Shawn Pearce. 2000. ROSS: A high-performance, low memory, modular time warp system. In Proceedings of the 14th Workshop on Parallel and Distributed Simulation (PADS’00). IEEE Computer Society, Washington, DC, 53--60. http://dl.acm.org/citation.cfm?id=336146.336157 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guillaume Chapuis, Stephan Eidenbenz, Nandakishore Santhi, and Eun Jung Park. 2015. Simian integrated framework for parallel discrete event simulation on GPUs. In Proceedings of the 2015 Winter Simulation Conference (WSC’15). IEEE Press, Piscataway, NJ, 1127--1138. http://dl.acm.org/citation.cfm?id=2888619.2888742 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Huilong Chen, Yiping Yao, Wenjie Tang, Dong Meng, Feng Zhu, Yuewen Fu, and Yiping Yao. 2015. Can MIC find its place in the field of PDES? An early performance evaluation of PDES simulator on Intel many integrated cores coprocessor. In Proceedings of the 19th International Symposium on Distributed Simulation and Real Time Applications (DS-RT’15). IEEE Press, Piscataway, NJ, 41--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Convey Computers Corporation. 2013. The Convey WX Series (conv-13-045.5 ed.). https://www.micron.com/-/media/client/global/documents/products/product-flyer/conv13045,-d-,5-wolverine_r1b.pdf.Google ScholarGoogle Scholar
  10. Convey Computers Corporation. 2014. Convey Wolverine® Application Accelerators Architectural Overview (CONV-14-049.1 ed.). https://www.micron.com/-/media/client/global/documents/products/white-paper/wp_conv14049,-d-,1_wolverine_arch_overview.pdf.Google ScholarGoogle Scholar
  11. Samir Das, Richard Fujimoto, Kiran Panesar, Don Allison, and Maria Hybinette. 1994. GTW: A time warp system for shared memory multiprocessors. In Proceedings of the 26th Conference on Winter Simulation (WSC’94). Society for Computer Simulation International, San Diego, CA, 1332--1339. http://dl.acm.org/citation.cfm?id=193201.194885 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Richard Fujimoto. 2015. Parallel and distributed simulation. In Proceedings of the 2015 Winter Simulation Conference (WSC’15). IEEE Press, Piscataway, NJ, 45--59. http://dl.acm.org/citation.cfm?id=2888619.2888624 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Richard M. Fujimoto. 1999. Parallel and Distribution Simulation Systems. John Wiley 8 Sons, Inc., New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Richard M. Fujimoto, Jya-Jang Tsai, and Ganesh C. Gopalakrishnan. 1992. Design and evaluation of the rollback chip: Special purpose hardware for time warp. IEEE Trans. Comput. 41, 1 (Jan. 1992), 68--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sounak Gupta and Philip A. Wilsey. 2014. Lock-free pending event set management in time warp. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS’14). ACM, New York, NY, 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali. 2011. Ordered vs. unordered: A comparison of parallelism and work-efficiency in irregular algorithms. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). ACM, New York, NY, 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. C. Herbordt, F. Kosie, and J. Model. 2008. An efficient O(1) priority queue for large FPGA-based discrete event simulations of molecular dynamics. In 2008 16th International Symposium on Field-Programmable Custom Computing Machines. IEEE, Palo Alto, CA, 248--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hybrid Memory Cube Consortium. 2014. Hybrid Memory Cube Specification 2.1 (2.1 ed.). http://hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf.Google ScholarGoogle Scholar
  20. Amazon Web Services, Inc. 2018. Amazon EC2 F1 Instances. Retrieved January 31, 2019 from https://aws.amazon.com/ec2/instance-types/f1/.Google ScholarGoogle Scholar
  21. Deepak Jagtap, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012. Characterizing and understanding PDES behavior on Tilera architecture. In Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation (PADS’12). IEEE Computer Society, Washington, DC, 53--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David R. Jefferson. 1985. Virtual time. ACM Trans. Program. Lang. Syst. 7, 3 (July 1985), 404--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2015. A scalable architecture for ordered parallelism. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, New York, NY, 228--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ranjit Noronha and Nael B. Abu-Ghazaleh. 2002. Early cancellation: An active NIC optimization for time-warp. In Proceedings of the 16th Workshop on Parallel and Distributed Simulation (PADS’02). IEEE Computer Society, Washington, DC, 43--50. http://dl.acm.org/citation.cfm?id=564062.564070 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang, Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit Subhaschandra, and Guy Boudoukh. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks? In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). ACM, New York, NY, 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hyungwook Park and Paul A. Fishwick. 2010. A GPU-based application framework supporting fast discrete-event simulation. Simulation 86, 10 (Oct. 2010), 613--628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alessandro Pellegrini and Francesco Quaglia. 2014. Transparent multi-core speculative parallelization of DES models with event and cross-state dependencies. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM PADS’14). ACM, New York, NY, 105--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kalyan S. Perumalla. 2006. Discrete-event execution alternatives on general purpose graphical processing units (GPGPUs). In Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation (PADS’06). IEEE Computer Society, Washington, DC, 74--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of parallelism in algorithms. SIGPLAN Not. 46, 6 (June 2011), 12--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA’14). IEEE Press, Piscataway, NJ, 13--24. http://dl.acm.org/citation.cfm?id=2665671.2665678 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shafiur Rahman, Nael Abu-Ghazaleh, and Walid Najjar. 2017. PDES-A: A parallel discrete event simulation accelerator for FPGAs. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS’17). ACM, New York, NY, 133--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Joseph Rios. 2007. An Efficient FPGA Priority Queue Implementation with Application to the Routing Problem. UC Santa Cruz Technical Report. University of California, Santa Cruz, Santa Cruz, CA. https://www.soe.ucsc.edu/research/technical-reports/UCSC-CRL-07-01Google ScholarGoogle Scholar
  33. Robert Rönngren and Rassul Ayani. 1997. A comparative study of parallel and sequential priority queue algorithms. ACM Transactions on Modeling and Computer Simulation 7, 2 (1997), 157--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Santhi, S. Eidenbenz, and J. Liu. 2015. The Simian concept: Parallel discrete event simulation with interpreted languages and just-in-time compilation. In 2015 Winter Simulation Conference (WSC’15). IEEE, Huntington Beach, CA, 3013--3024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE Press, Piscataway, NJ, Article 17, 12 pages. http://dl.acm.org/citation.cfm?id=3195638.3195659 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Philip Andrew Simpson. 2015. FPGA Design. Springer International Publishing, Cham. http://link.springer.com/10.1007/978-3-319-17924-7Google ScholarGoogle Scholar
  37. Jeffrey S. Steinman. 2005. The WarpIV simulation kernel. In Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation (PADS’05). IEEE Computer Society, Washington, DC, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zhangxi Tan, Andrew Waterman, Rimas Avizienis, Yunsup Lee, Henry Cook, David Patterson, and Krste Asanović. 2010. RAMP Gold: An FPGA-based architecture simulator for multiprocessors. In Proceedings of the 47th Design Automation Conference (DAC’10). ACM, New York, NY, 463--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wenjie Tang and Yiping Yao. 2013. A GPU-based discrete event simulation kernel. Simulation 89, 11 (Nov. 2013), 1335--1354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jingjing Wang, Deepak Jagtap, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2014. Parallel discrete event simulation for multi-core systems: Analysis and optimization. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1574--1584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jingjing Wang, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012. Performance analysis of a multithreaded PDES simulator on multicore clusters. In Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation (PADS’12). IEEE Computer Society, Washington, DC, 93--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Barry Williams, Dmitry Ponomarev, Nael Abu-Ghazaleh, and Philip Wilsey. 2017. Performance characterization of parallel discrete event simulation on Knights Landing processor. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS’17). ACM, New York, NY, 121--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Zhou, C. Chelmis, and V. K. Prasanna. 2016. High-throughput and energy-efficient graph processing on FPGA. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’16). IEEE, Washington, DC, 103--110.Google ScholarGoogle Scholar

Index Terms

  1. PDES-A: Accelerators for Parallel Discrete Event Simulation Implemented on FPGAs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Modeling and Computer Simulation
        ACM Transactions on Modeling and Computer Simulation  Volume 29, Issue 2
        Special Issue on PADS 2017
        April 2019
        105 pages
        ISSN:1049-3301
        EISSN:1558-1195
        DOI:10.1145/3320014
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 April 2019
        • Accepted: 1 December 2018
        • Revised: 1 September 2018
        • Received: 1 December 2017
        Published in tomacs Volume 29, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format