ABSTRACT
The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration between cores; these properties should significantly improve the performance of PDES in such environments. However, on clusters of multi-cores (CMs), the latency and processing overheads incurred when communicating between different machines (nodes) far outweigh those between cores on the same chip, especially when commodity networking fabrics and communication software are used. It is unclear if there is any benefit to the low latency among cores on the same node given that communication links across nodes are significantly worse. In this study, we examine the performance of a multi-threaded implementation of PDES on CMs. We demonstrate that the inter-node communication costs impose a substantial bottleneck on PDES and demonstrate that without optimizations addressing these long latencies, multi-threaded PDES does not significantly outperform the multiprocess version despite direct communication through shared memory on the individual nodes. We then propose three optimizations: message consolidation and routing, infrequent polling and latency-sensitive model partitioning. We show that with these optimizations in place, threaded implementation of PDES significantly outperforms process-based implementation even on CMs.
- K. Bahulkar, J. Wang, N. Abu-Ghazaleh, and D. Ponomarev. Partitioning on dynamic behavior for parallel discrete event simulation. In Principles of Advanced and Distributed Simulation (PADS), pages 221--230. IEEE, 2012. Google ScholarDigital Library
- M. L. Bailey, J. V. Briner, Jr., and R. D. Chamberlain. Parallel logic simulation of VLSI systems. ACM Computing Surveys, 26(3):255--294, sep 1994. Google ScholarDigital Library
- D. Bauer, C. Carothers, and A. Holder. Scalable time warp on bluegene supercomputer. In Principles of Advanced and Distributed Simulation (PADS), pages 35--44, 2009. Google ScholarDigital Library
- A. Boukerche and S. Das. Dynamic load balancing strategies for conservative parallel simulation. In Principles of Advanced and Distributed Simulation (PADS), pages 32--37, 1997. Google ScholarDigital Library
- A. Canedo, T. Yoshizawa, and H.Komatsu. Automatic parallelization of simulink applications. In Proc. of CGO, pages 151--159, 2010. Google ScholarDigital Library
- C. Carothers, D. Bauer, and S. Pearce. ROSS: A high-performance, low memory, modular time warp system. In Principles of Advanced and Distributed Simulation (PADS), pages 53--60. IEEE, 2000. Google ScholarDigital Library
- C. D. Carothers, R. M. Fujimoto, and P. England. Effect of communication overheads on Time Warp performance: An experimental study. In Principles of Advanced and Distributed Simulation (PADS), pages 118--125, jul 1994. Google ScholarDigital Library
- C. D. Carothers, R. M. Fujimoto, and Y.-B. Lin. A case study in simulating pcs networks using time warp. In Principles of Advanced and Distributed Simulation (PADS), pages 87--94. IEEE, 1995. Google ScholarDigital Library
- C. Chen, J. Zhang, R. Cohen, and P.Ho. Secure and efficient trust opinion aggregation for vehicular ad-hoc networks. In Proc. of VTC, pages 1--5, 2010.Google ScholarCross Ref
- L. Chen, Y. Lu, Y. Yao, S. Peng, and L. Wu. A well-balanced time warp system on multi-core environments. In Principles of Advanced and Distributed Simulation (PADS), pages 1--9. IEEE, 2011. Google ScholarDigital Library
- M. Chetlur, N. Abu-Ghazaleh, R. Radhakrishnan, and P. A. Wilsey. Optimizing communication in Time-Warp simulators. In Principles of Advanced and Distributed Simulation (PADS), pages 64--71. IEEE, 1998. Google ScholarDigital Library
- R. Child and P. Wilsey. Dynamically adjusting core frequencies to accelerate time warp simulations in many-core processors. In Principles of Advanced and Distributed Simulation (PADS), pages 35--43. IEEE, 2012. Google ScholarDigital Library
- J. Cloutier. Model partitioning and the performance of distributed timewarp simulation of logic circuits. Simulation Practice and Theory, 5(1):83--99, 1997.Google ScholarCross Ref
- J. Doi and Y. Negishi. Overlapping methods of all-to-all communication and FFT algorithms for torus-connected massively parallel supercomputers. In Proc. of Int'l Conference on Supercomputing, pages 1--9, 2010. Google ScholarDigital Library
- K. El-Khatib and C. Tropper. On metrics for the dynamic load balancing of optimistic simulations. In Proc. 32nd Hawaii International Conference on Systems Science (HICCS), 1999. Google ScholarDigital Library
- R. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30--53, oct 1990. Google ScholarDigital Library
- R. Fujimoto. Performance of time warp under synthetic workloads. Proceedings of the SCS Multiconference on Distributed Simulation, 22(1):23--28, 1990.Google Scholar
- D. Jagtap, K. Bahulkar, D.Ponomarev, and N.Abu-Ghazaleh. Characterizing and understanding pdes behavior on tilera architecture. In Principles of Advanced and Distributed Simulation (PADS), pages 53--62. IEEE, 2012. Google ScholarDigital Library
- D. Jagtap, N.Abu-Ghazaleh, and D.Ponomarev. Optimization of parallel discrete event simulator for multi-core systems. In Parallel and Distributed Processing Symposium (IPDPS), pages 520--531. IEEE, 2012. Google ScholarDigital Library
- G. Karypis and V. Kumar. hmetis: a hypergraph partitioning package. Available on WWW at URL: http://www.cs.umn.edu/ karypis/metis/hmetis.Google Scholar
- K.Bahulkar, N.Hofmann, D.Jagtap, N.Abu-Ghazaleh, and D.Ponomarev. Performance evaluation of pdes on multicore clusters. In 14th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications (DS-RT), pages 131--140, 2010. Google ScholarDigital Library
- K.S.Perumalla. Scaling time warp-based discrete event execution to 104 processors on a blue gene supercomputer. In in Proceedings of the ACM Computing Frontiers, pages 69--76, 2007. Google ScholarDigital Library
- L. Li and C. Tropper. A design-driven partitioning algorithm for distributed verilog simulation. In Principles of Advanced and Distributed Simulation (PADS), pages 211--218. IEEE, 2007. Google ScholarDigital Library
- J. Liu, B. chandrasekaran, J. Wu, W. Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. Panda. Performance comparison of mpi implementations over infiniband, myrinet and quadrics. In Proc. of ACM/IEEE conference on Supercomputing, pages 58--71. IEEE, nov 2003. Google ScholarDigital Library
- J. Liu and R. Rong. Hierarchical composite synchronization. In Principles of Advanced and Distributed Simulation (PADS), pages 3--12. IEEE, 2012. Google ScholarDigital Library
- P. Peschlow, T. Honecker, and P. Martini. A flexible dynamic partitioning algorithm for optimistic distributed simulation. In Principles of Advanced and Distributed Simulation (PADS), pages 219--228. IEEE, 2007. Google ScholarDigital Library
- R. Preissl, N. Wichmann, B. Long, J. Shalf, S. Ethier, and A. Koniges. Multithreaded global address space communication techniques for gyrokinetic fusion applications on ultra-scale platforms. In Proc. of Int'l Conference on Supercomputing, 2011. Google ScholarDigital Library
- V. Sarkar and J. Hennessy. Compile-time partitioning and scheduling of parallel programs. In Proc. of the SIGPLAN Symposium on Compiler construction, pages 17--26, 1986. Google ScholarDigital Library
- G. D. Sharma, N. B. Abu-Ghazaleh, U. V. Rajasekaran, and P. A. Wilsey. Optimizing message delivery in asynchronous distributed applications. In Proc. of Euro-Par, pages 1204--1208, 1998. Google ScholarDigital Library
- G. D. Sharma, R. Radhakrishnan, U. V. Rajesekaran, N. B. Abu-Ghazaleh, and P. A. Wilsey. Time warp simulation on clumps. In Principles of Advanced and Distributed Simulation (PADS), pages 174--181, may 1999. Google ScholarDigital Library
- R. Vitali, A. Pellegrini, and F. Quaglia. Assessing load-sharing within optimistic simulation platforms. In Proceedings of the 2012 Winter Simulation Conference. IEEE, 2012. Google ScholarDigital Library
- R. Vitali, A. Pellegrini, and F. Quaglia. Towards symmetric multi-threaded optimistic simulation kernels. In Principles of Advanced and Distributed Simulation (PADS), pages 211--220. IEEE, 2012. Google ScholarDigital Library
- J. Wang, D.Ponomarev, and N.Abu-Ghazaleh. Performance analysis of a multithreaded pdes simulator on multicore clusters. In Principles of Advanced and Distributed Simulation (PADS) (Short Paper), pages 93--95. IEEE, 2012. Google ScholarDigital Library
Index Terms
- Can PDES scale in environments with heterogeneous delays?
Recommendations
Can MIC find its place in the field of PDES?: An Early Performance Evaluation of PDES Simulator on Intel Many Integrated Cores Coprocessor
DS-RT 2015: Proceedings of the 19th International Symposium on Distributed Simulation and Real Time ApplicationsThe widespread utilization of many-core processors offers a good opportunity for Parallel Discrete Events Simulation (PDES) to obtain a better execution performance. As one of the newly introduced many-core processors, the Intel Xeon Phi coprocessor ...
Coordinator-master-worker model for efficient large scale network simulation
SimuTools '13: Proceedings of the 6th International ICST Conference on Simulation Tools and TechniquesIn this work, we propose a coordinator-master-worker (CMW) model for medium to extra-large scale network simulation. The model supports distributed and parallel simulation for a heterogeneous computing node architecture with both multi-core CPUs and ...
PDES-A: Accelerators for Parallel Discrete Event Simulation Implemented on FPGAs
Special Issue on PADS 2017In this article, we present experiences implementing a general Parallel Discrete Event Simulation (PDES) accelerator on a Field Programmable Gate Array (FPGA). The accelerator can be specialized to any particular simulation model by defining the object ...
Comments