Abstract
Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results.
In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1% and a maximum of 18.8% for the IW-centric model with a 1.5× slowdown compared to interval simulation.
- A. Adileh, C. Kaynak, P. Lotfi-Kamran, and S. Volos. 2012. CloudSuite on Flexus. Retrieved July 22, 2014, from http://parsa.epfl.ch/simflex/doc/CloudSuite-on-Flexus-isca12.pdf.Google Scholar
- E. K. Ardestani and J. Renau. 2013. ESESC: A fast multicore simulator using time-based sampling. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 448--459. Google ScholarDigital Library
- B. Beckmann, N. Binkert, A. Saidi, J. Hestness, G. Black, K. Sewell, and D. Hower. 2011. The gem5 Simulator. Retrieved July 22, 2014, from http://www.gem5.org/dist/tutorials/isca_pres_2011.pdf.Google Scholar
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. 2011. The gem5 simulator. SIGARCH Computer Architecture News 39, 2, 1--7. Google ScholarDigital Library
- N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 52--60. Google ScholarDigital Library
- T. E. Carlson, W. Heirman, K. V. Craeynest, and L. Eeckhout. 2014. BarrierPoint: Sampled simulation of multi-threaded applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--12.Google Scholar
- T. E. Carlson, W. Heirman, and L. Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 52:1--52:12. Google ScholarDigital Library
- T. E. Carlson, W. Heirman, and L. Eeckhout. 2013. Sampled simulation of multi-threaded applications. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--12.Google Scholar
- J. Chen, L. K. Dabbiru, D. Wong, M. Annavaram, and M. Dubois. 2010. Adaptive and speculative slack simulations of CMPs on CMPs. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 523--534. Google ScholarDigital Library
- X. E. Chen and T. M. Aamodt. 2011. Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. ACM Transactions on Architecture and Code Optimization 8, 3, 10:1--10:28. Google ScholarDigital Library
- D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson, J. Keefe, and H. Angepat. 2007. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 249--261. Google ScholarDigital Library
- Y. Chou, B. Fahs, and S. Abraham. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the International Symposium on Computer Architecture (ISCA). 76--87. Google ScholarDigital Library
- E. S. Chung, E. Nurvitadhi, J. C. Hoe, B. Falsafi, and K. Mai. 2008. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA). 77--86. Google ScholarDigital Library
- L. Eeckhout, R. H. Bell Jr, B. Stougie, K. De Bosschere, and L. K. John. 2004. Control flow modeling in statistical simulation for accurate and efficient processor design studies. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 350--361. Google ScholarDigital Library
- L. Eeckhout, S. Nussbaum, J. E. Smith, and K. De Bosschere. 2003. Statistical simulation: Adding efficiency to the computer designer’s toolbox. IEEE Micro 23, 5, 26--38. Google ScholarDigital Library
- J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. Mukherjee, H. Patil, S. Wallace, N. Binkert, R. Espasa, and T. Juan. 2002. Asim: A performance model framework. Computer 35, 2, 68--76. Google ScholarDigital Library
- J. S. Emer and D. W. Clark. 1984. A characterization of processor performance in the VAX-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA). 301--310. Google ScholarDigital Library
- P. G. Emma. 1997. Understanding some simple processor-performance limits. IBM Journal of Research and Development 41, 3, 215--232. Google ScholarDigital Library
- S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. 2006. A performance counter architecture for computing accurate CPI components. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 175--184. Google ScholarDigital Library
- S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27, 2, 42--53. Google ScholarDigital Library
- A. Fog. 2013. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs. Retrieved July 22, 2014, from http://www.agner.org/optimize/instruction_tables.pdf.Google Scholar
- D. Genbrugge, S. Eyerman, and L. Eeckhout. 2010. Interval simulation: Raising the level of abstraction in architectural simulation. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 307--318.Google Scholar
- K. Ghose, A. Patel, F. Afram, H. Zheng, and J. Tringali. 2012. MARSS: Micro Architectural Systems Simulator. Retrieved July 22, 2014, from http://cloud.github.com/downloads/avadhpatel/marss/Marss_ISCA_2012_tutorial.pdf.Google Scholar
- A. Glew. 1998. MLP yes! ILP no! In Proceedings of the ASPLOS Wild and Crazy Idea Session.Google Scholar
- P. Greenhalgh. 2011. big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM white paper.Google Scholar
- N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. 2004. SimFlex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review 31, 4, 31--34. Google ScholarDigital Library
- A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. 2008. CMP$im: A pin-based on-the-fly multi-core cache simulator. In Proceedings of the 4th Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA 2008. 28--36.Google Scholar
- T. Karkhanis and J. E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 338--349. Google ScholarDigital Library
- A. Krasnov, A. Schultz, J. Wawrzynek, G. Gibeling, and P.-Y. Droz. 2007. RAMP Blue: A message-passing manycore system in FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). 54--61.Google ScholarCross Ref
- J. D. Little. 1961. A proof for the queuing formula: L = λ W. Operations Research 9, 3, 383--387. Google ScholarDigital Library
- G. Loh, S. Subramaniam, and Y. Xie. 2009. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 53--64.Google Scholar
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 190--200. Google ScholarDigital Library
- J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 1--12.Google Scholar
- S. S. Mukherjee, S. K. Reinhardt, B. Falsafi, M. Litzkow, M. D. Hill, D. A. Wood, S. Huss-Lederman, and J. R. Larus. 2000. Wisconsin wind tunnel II: A fast, portable parallel architecture simulator. IEEE Concurrency 8, 4, 12--20. Google ScholarDigital Library
- S. Nussbaum and J. E. Smith. 2001. Modeling superscalar processors via statistical simulation. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques (PACT). 15--24. Google ScholarDigital Library
- M. Oskin, F. Chong, and M. Farrens. 2000. HLS: Combining statistical and symbolic simulation to guide microprocessor designs. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA). 71--82. Google ScholarDigital Library
- A. Patel, F. Afram, S. Chen, and K. Ghose. 2011. MARSS×86: A full system simulator for ×86 CPUs. In Proceedings of the Design Automation Conference (DAC). 1050--1055. Google ScholarDigital Library
- M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. 2011. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 406--417. Google ScholarDigital Library
- D. Sanchez and C. Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). 475--486. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 45--57. Google ScholarDigital Library
- T. Taha and D. Wills. 2008. An instruction throughput model of superscalar processors. IEEE Transactions on Computers 57, 3, 389--403. Google ScholarDigital Library
- V. Uzelac and A. Milenkovic. 2009. Experiment flows and microbenchmarks for reverse engineering of branch predictor structures. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 207--217.Google Scholar
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA). 24--36. Google ScholarDigital Library
- R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 84--95. Google ScholarDigital Library
- M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. 1996. Performance analysis using the MIPS R10000 performance counters. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (SC). Article No. 16. Google ScholarDigital Library
Index Terms
- An Evaluation of High-Level Mechanistic Core Models
Recommendations
Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisTwo major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable ...
Power-aware multi-core simulation for early design stage hardware/software co-optimization
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesStringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel ...
Mars: Accelerating MapReduce with Graphics Processors
We design and implement Mars, a MapReduce runtime system accelerated with graphics processing units (GPUs). MapReduce is a simple and flexible parallel programming paradigm originally proposed by Google, for the ease of large-scale data processing on ...
Comments