skip to main content
research-article

Fast and Cycle-Accurate Emulation of Large-Scale Networks-on-Chip Using a Single FPGA

Published:13 December 2017Publication History
Skip Abstract Section

Abstract

Modeling and simulation/emulation play a major role in research and development of novel Networks-on-Chip (NoCs). However, conventional software simulators are so slow that studying NoCs for emerging many-core systems with hundreds to thousands of cores is challenging. State-of-the-art FPGA-based NoC emulators have shown great potential in speeding up the NoC simulation, but they cannot emulate large-scale NoCs due to the FPGA capacity constraints. Moreover, emulating large-scale NoCs under synthetic workloads on FPGAs typically requires a large amount of memory and thus involves the use of off-chip memory, which makes the overall design much more complicated and may substantially degrade the emulation speed. This article presents methods for fast and cycle-accurate emulation of NoCs with up to thousands of nodes using a single FPGA. We first describe how to emulate a NoC under a synthetic workload using only FPGA on-chip memory (BRAMs). We next present a novel use of time-division multiplexing where BRAMs are effectively used for emulating a network using a small number of nodes, thereby overcoming the FPGA capacity constraints. We propose methods for emulating both direct and indirect networks, focusing on the commonly used meshes and fat-trees (k-ary n-trees). This is different from prior work that considers only direct networks. Using the proposed methods, we build a NoC emulator, called FNoC, and demonstrate the emulation of some mesh-based and fat-tree-based NoCs with canonical router architectures. Our evaluation results show that (1) the size of the largest NoC that can be emulated depends on only the FPGA on-chip memory capacity; (2) a mesh-based NoC with 16,384 nodes (128×128 NoC) and a fat-tree-based NoC with 6,144 switch nodes and 4,096 terminal nodes (4-ary 6-tree NoC) can be emulated using a single Virtex-7 FPGA; and (3) when emulating these two NoCs, we achieve, respectively, 5,047× and 232× speedups over BookSim, one of the most widely used software-based NoC simulators, while maintaining the same level of accuracy.

References

  1. S. Abba and J. Lee. 2014. A parametric-based performance evaluation and design trade-offs for interconnect architectures using FPGAs for networks-on-chip. Microprocess. Microsyst. 38, 5 (2014), 375--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Access IC Lab. 2017. Access Noxim. Retrieved from http://access.ee.ntu.edu.tw/noxim/index.html.Google ScholarGoogle Scholar
  3. N. Agarwal, T. Krishna, L. S. Peh, and N. K. Jha. 2009. GARNET: A detailed on-chip network model inside a full-system simulator. In ISPASS. 33--42.Google ScholarGoogle Scholar
  4. M. Badr and N. E. Jerger. 2014. SynFull: Synthetic traffic models capturing cache coherent behaviour. In ISCA. 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comp. Arch. News 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti. 2016. Cycle-accurate network on chip simulation with noxim. ACM Trans. Model. Comput. Simul. 27, 1 (2016), 4:1--4:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. M. Chiu. 2000. The odd-even turn model for adaptive routing. IEEE Trans. Parallel Distrib. Syst. 11, 7 (2000), 729--738. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. V. Chu, S. Sato, and K. Kise. 2015a. Enabling fast and accurate emulation of large-scale network on chip architectures on a single FPGA. In FCCM. 60--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. V. Chu, S. Sato, and K. Kise. 2015b. Ultra-fast NoC emulation on a single FPGA. In FPL. 1--8.Google ScholarGoogle Scholar
  10. E. S. Chung. 2011. CoRAM: An In-Fabric Memory Architecture for FPGA-Based Computing. Ph.D. Dissertation. CMU. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. CMU-SAFARI. 2017. NOCulator. Retreived from https://github.com/CMU-SAFARI/NOCulator.Google ScholarGoogle Scholar
  12. W. J. Dally and B. Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Hu and R. Marculescu. 2004. DyAD: Smart routing for networks-on-chip. In DAC. 260--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In ISPASS. 86--96.Google ScholarGoogle Scholar
  15. H. M. Kamali and S. Hessabi. 2016. AdapNoC: A fast and flexible FPGA-based NoC simulator. In FPL. 1--8.Google ScholarGoogle Scholar
  16. A. Khan, M. Vijayaraghavan, S. Boyd-Wickizer, and Arvind. 2012. Fast and cycle-accurate modeling of a multicore processor. In ISPASS. 178--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. I. Khan. 2013. Cycle-Accurate Modeling of Multicore Processors on FPGAs. Ph.D. Dissertation. MIT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. A. Kinsy, M. Pellauer, and S. Devadas. 2013. Heracles: A tool for fast RTL-based design space exploration of multicore processors. In FPGA. 125--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. E. Knuth. 1997. The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd ed.). Addison-Wesley Longman Publishing Co., Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. E. Krasteva, F. Criado, E. de la Torre, and T. Riesgo. 2008. A fast emulation-based NoC prototyping framework. In ReConFig. 211--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Lotlikar, V. Pai, and P. V. Gratz. 2011. AcENoCs: A configurable HW/SW platform for FPGA accelerated NoC emulation. In VLSID. 147--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. K. Papamichael. 2011. Fast scalable FPGA-based network-on-chip simulation models. In MEMOCODE. 77--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. K. Papamichael, J. C. Hoe, and O. Mutlu. 2011. FIST: A fast, lightweight, FPGA-friendly packet latency estimator for NoC modeling in full-system simulations. In NOCS. 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Patel, F. Afram, S. Chen, and K. Ghose. 2011. MARSS: A full system simulator for multicore x86 CPUs. In DAC. 1050--1055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. 2011. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. In HPCA. 406--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Pellauer, M. Vijayaraghavan, M. Adler, Arvind, and J. Emer. 2008. A-ports: An efficient abstraction for cycle-accurate performance models on FPGAs. In FPGA. 87--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Ren, M. Lis, M. H. Cho, K. S. Shim, C. W. Fletcher, O. Khan, N. Zheng, and S. Devadas. 2012. HORNET: A cycle-level multicore simulator. IEEE Trans. Comput.-Aid. Des. 31, 6 (2012), 890--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Sanchez and C. Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. In ISCA. 475--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Shannon, V. Cojocaru, C. N. Dao, and P. H. W. Leong. 2015. Technology scaling in FPGAs: Trends in applications and architectures. In FCCM. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, D. Patterson, and K. Asanović. 2010. RAMP gold: An FPGA-based architecture simulator for multiprocessors. In DAC. 463--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Vigna. 2017. Further scramblings of Marsaglia’s xorshift generators. Journal of Computational and Applied Mathematics 315 (2017), 175–181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Wang, C. Lo, J. Vasiljevic, N. E. Jerger, and J. Gregory Steffan. 2014. DART: A programmable architecture for NoC simulation on FPGAs. IEEE Trans. Comput. 63, 3 (2014), 664--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Wang, Y. Huang, M. Ebrahimi, L. Huang, Q. Li, A. Jantsch, and G. Li. 2016. VisualNoC: A visualization and evaluation environment for simulation and mapping. In MES. 18--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Wawrzynek, D. Patterson, M. Oskin, S. L. Lu, C. Kozyrakis, J. Hoe, D. Chiou, and K. Asanovic. 2007. RAMP: Research accelerator for multiple processors. IEEE Micro 27, 2 (2007), 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. T. Wolkotte, P. K. F. Holzenspies, and G. J. M. Smit. 2007. Fast, accurate and detailed NoC simulations. In NOCS. 323--332. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast and Cycle-Accurate Emulation of Large-Scale Networks-on-Chip Using a Single FPGA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 10, Issue 4
        December 2017
        119 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3166118
        • Editor:
        • Steve Wilton
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 December 2017
        • Accepted: 1 July 2017
        • Revised: 1 April 2017
        • Received: 1 October 2016
        Published in trets Volume 10, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader