An Evaluation of High-Level Mechanistic Core Models

Authors:
Trevor E. Carlson

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

,
Wim Heirman

Intel, ExaScience Lab, Leuven, Belgium

Intel, ExaScience Lab, Leuven, Belgium
View Profile

,
Stijn Eyerman

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

,
Ibrahim Hur

Intel, ExaScience Lab, Leuven, Belgium

Intel, ExaScience Lab, Leuven, Belgium
View Profile

,
Lieven Eeckhout

Ghent University, Gent, Belgium

Ghent University, Gent, Belgium
View Profile

ACM Transactions on Architecture and Code Optimization Volume 11 Issue 3Article No.: 28pp 1–25https://doi.org/10.1145/2629677

Published:25 August 2014Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results.

In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1% and a maximum of 18.8% for the IW-centric model with a 1.5× slowdown compared to interval simulation.

References

A. Adileh, C. Kaynak, P. Lotfi-Kamran, and S. Volos. 2012. CloudSuite on Flexus. Retrieved July 22, 2014, from http://parsa.epfl.ch/simflex/doc/CloudSuite-on-Flexus-isca12.pdf.Google Scholar
E. K. Ardestani and J. Renau. 2013. ESESC: A fast multicore simulator using time-based sampling. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 448--459. Google ScholarDigital Library
B. Beckmann, N. Binkert, A. Saidi, J. Hestness, G. Black, K. Sewell, and D. Hower. 2011. The gem5 Simulator. Retrieved July 22, 2014, from http://www.gem5.org/dist/tutorials/isca_pres_2011.pdf.Google Scholar
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. 2011. The gem5 simulator. SIGARCH Computer Architecture News 39, 2, 1--7. Google ScholarDigital Library
N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. 2006. The M5 simulator: Modeling networked systems. IEEE Micro 26, 52--60. Google ScholarDigital Library
T. E. Carlson, W. Heirman, K. V. Craeynest, and L. Eeckhout. 2014. BarrierPoint: Sampled simulation of multi-threaded applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--12.Google Scholar
T. E. Carlson, W. Heirman, and L. Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 52:1--52:12. Google ScholarDigital Library
T. E. Carlson, W. Heirman, and L. Eeckhout. 2013. Sampled simulation of multi-threaded applications. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--12.Google Scholar
J. Chen, L. K. Dabbiru, D. Wong, M. Annavaram, and M. Dubois. 2010. Adaptive and speculative slack simulations of CMPs on CMPs. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 523--534. Google ScholarDigital Library
X. E. Chen and T. M. Aamodt. 2011. Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs. ACM Transactions on Architecture and Code Optimization 8, 3, 10:1--10:28. Google ScholarDigital Library
D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson, J. Keefe, and H. Angepat. 2007. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 249--261. Google ScholarDigital Library
Y. Chou, B. Fahs, and S. Abraham. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the International Symposium on Computer Architecture (ISCA). 76--87. Google ScholarDigital Library
E. S. Chung, E. Nurvitadhi, J. C. Hoe, B. Falsafi, and K. Mai. 2008. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA). 77--86. Google ScholarDigital Library
L. Eeckhout, R. H. Bell Jr, B. Stougie, K. De Bosschere, and L. K. John. 2004. Control flow modeling in statistical simulation for accurate and efficient processor design studies. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 350--361. Google ScholarDigital Library
L. Eeckhout, S. Nussbaum, J. E. Smith, and K. De Bosschere. 2003. Statistical simulation: Adding efficiency to the computer designer’s toolbox. IEEE Micro 23, 5, 26--38. Google ScholarDigital Library
J. Emer, P. Ahuja, E. Borch, A. Klauser, C.-K. Luk, S. Manne, S. Mukherjee, H. Patil, S. Wallace, N. Binkert, R. Espasa, and T. Juan. 2002. Asim: A performance model framework. Computer 35, 2, 68--76. Google ScholarDigital Library
J. S. Emer and D. W. Clark. 1984. A characterization of processor performance in the VAX-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA). 301--310. Google ScholarDigital Library
P. G. Emma. 1997. Understanding some simple processor-performance limits. IBM Journal of Research and Development 41, 3, 215--232. Google ScholarDigital Library
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. 2006. A performance counter architecture for computing accurate CPI components. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 175--184. Google ScholarDigital Library
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27, 2, 42--53. Google ScholarDigital Library
A. Fog. 2013. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs. Retrieved July 22, 2014, from http://www.agner.org/optimize/instruction_tables.pdf.Google Scholar
D. Genbrugge, S. Eyerman, and L. Eeckhout. 2010. Interval simulation: Raising the level of abstraction in architectural simulation. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 307--318.Google Scholar
K. Ghose, A. Patel, F. Afram, H. Zheng, and J. Tringali. 2012. MARSS: Micro Architectural Systems Simulator. Retrieved July 22, 2014, from http://cloud.github.com/downloads/avadhpatel/marss/Marss_ISCA_2012_tutorial.pdf.Google Scholar
A. Glew. 1998. MLP yes&excl; ILP no&excl; In Proceedings of the ASPLOS Wild and Crazy Idea Session.Google Scholar
P. Greenhalgh. 2011. big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7. ARM white paper.Google Scholar
N. Hardavellas, S. Somogyi, T. F. Wenisch, R. E. Wunderlich, S. Chen, J. Kim, B. Falsafi, J. C. Hoe, and A. G. Nowatzyk. 2004. SimFlex: A fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture. SIGMETRICS Performance Evaluation Review 31, 4, 31--34. Google ScholarDigital Library
A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. 2008. CMP&dollar;im: A pin-based on-the-fly multi-core cache simulator. In Proceedings of the 4th Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA 2008. 28--36.Google Scholar
T. Karkhanis and J. E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 338--349. Google ScholarDigital Library
A. Krasnov, A. Schultz, J. Wawrzynek, G. Gibeling, and P.-Y. Droz. 2007. RAMP Blue: A message-passing manycore system in FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). 54--61.Google ScholarCross Ref
J. D. Little. 1961. A proof for the queuing formula: L = λ W. Operations Research 9, 3, 383--387. Google ScholarDigital Library
G. Loh, S. Subramaniam, and Y. Xie. 2009. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 53--64.Google Scholar
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 190--200. Google ScholarDigital Library
J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 1--12.Google Scholar
S. S. Mukherjee, S. K. Reinhardt, B. Falsafi, M. Litzkow, M. D. Hill, D. A. Wood, S. Huss-Lederman, and J. R. Larus. 2000. Wisconsin wind tunnel II: A fast, portable parallel architecture simulator. IEEE Concurrency 8, 4, 12--20. Google ScholarDigital Library
S. Nussbaum and J. E. Smith. 2001. Modeling superscalar processors via statistical simulation. In Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques (PACT). 15--24. Google ScholarDigital Library
M. Oskin, F. Chong, and M. Farrens. 2000. HLS: Combining statistical and symbolic simulation to guide microprocessor designs. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA). 71--82. Google ScholarDigital Library
A. Patel, F. Afram, S. Chen, and K. Ghose. 2011. MARSS×86: A full system simulator for ×86 CPUs. In Proceedings of the Design Automation Conference (DAC). 1050--1055. Google ScholarDigital Library
M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. 2011. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 406--417. Google ScholarDigital Library
D. Sanchez and C. Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). 475--486. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 45--57. Google ScholarDigital Library
T. Taha and D. Wills. 2008. An instruction throughput model of superscalar processors. IEEE Transactions on Computers 57, 3, 389--403. Google ScholarDigital Library
V. Uzelac and A. Milenkovic. 2009. Experiment flows and microbenchmarks for reverse engineering of branch predictor structures. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 207--217.Google Scholar
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA). 24--36. Google ScholarDigital Library
R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 84--95. Google ScholarDigital Library
M. Zagha, B. Larson, S. Turner, and M. Itzkowitz. 1996. Performance analysis using the MIPS R10000 performance counters. In Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (SC). Article No. 16. Google ScholarDigital Library

Index Terms

An Evaluation of High-Level Mechanistic Core Models

Recommendations

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable ...
Read More
Power-aware multi-core simulation for early design stage hardware/software co-optimization
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel ...
Read More
Mars: Accelerating MapReduce with Graphics Processors

We design and implement Mars, a MapReduce runtime system accelerated with graphics processing units (GPUs). MapReduce is a simple and flexible parallel programming paradigm originally proposed by Google, for the ease of large-scale data processing on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 11, Issue 3
October 2014
298 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2658949
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 August 2014
- Accepted: 1 May 2014
- Revised: 1 March 2014
- Received: 1 December 2013
Published in taco Volume 11, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Interval simulation
design space exploration
interval model
multicore processor
performance modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 239
  Total Citations
  View Citations
- 2,464
  Total Downloads
- Downloads (Last 12 months)439
- Downloads (Last 6 weeks)73
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Evaluation of High-Level Mechanistic Core Models

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

Power-aware multi-core simulation for early design stage hardware/software co-optimization

Mars: Accelerating MapReduce with Graphics Processors