skip to main content
10.1145/3317550.3321424acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article
Open Access

Project PBerry: FPGA Acceleration for Remote Memory

Published:13 May 2019Publication History

ABSTRACT

Recent research efforts propose remote memory systems that pool memory from multiple hosts. These systems rely on the virtual memory subsystem to track application memory accesses and transparently offer remote memory to applications. We outline several limitations of this approach, such as page fault overheads and dirty data amplification. Instead, we argue for a fundamentally different approach: leverage the local host's cache coherence traffic to track application memory accesses at cache line granularity. Our approach uses emerging cache-coherent FPGAs to expose cache coherence events to the operating system. This approach not only accelerates remote memory systems by reducing dirty data amplification and by eliminating page faults, but also enables other use cases, such as live virtual machine migration, unified virtual memory, security and code analysis. All of these use cases open up many promising research directions.

References

  1. CCIX. https://www.ccixconsortium.com.Google ScholarGoogle Scholar
  2. Enzian, a research computer built by the Systems Group at ETH Zürich. http://www.enzian.systems/index.html.Google ScholarGoogle Scholar
  3. P.Haul. https://criu.org/P.Haul.Google ScholarGoogle Scholar
  4. Pin - a dynamic binary instrumentation tool. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.Google ScholarGoogle Scholar
  5. Redis: open-source, in-memory data structure store. https://redis.io.Google ScholarGoogle Scholar
  6. Serving DNNs in real time at datacenter scale with Project Brainwave. https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-2018Mar25.pdf.Google ScholarGoogle Scholar
  7. Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote regions: a simple abstraction for remote memory. In USENIX Annual Technical Conference (ATC), Boston, MA, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote memory in the age of fast networks. In ACM Symposium on Cloud Computing (SoCC), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cristiana Amza, Alan L. Cox, Shandya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, February 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A Java-compatible and synthesizable language for heterogeneous architectures. 2010.Google ScholarGoogle Scholar
  11. Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Communications of the ACM, March 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), March 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Abhishek Bhattacharjee. Translation-triggered prefetching. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, July 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Blott and K. Vissers. Dataflow architectures for 10 Gbps line-rate key-value-stores. In IEEE Hot Chips 25 Symposium (HCS), 2013.Google ScholarGoogle ScholarCross RefCross Ref
  16. Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin Schulz. Application-level checkpointing for shared memory programs. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Derek Bruening, Qin Zhao, and Saman Amarasinghe. Transparent dynamic instrumentation. In International Conference on Virtual Execution Environments (VEE), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera. Black-box concurrent data structures for NUMA architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marco Chiappetta, Erkay Savas, and Cemal Yilmaz. Real time detection of cache-based side-channel attacks using hardware performance counters. Applied Soft Computing, 49(C), December 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Christopher Clark, Keir Fraser, Steven H, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Symposium on Networked Systems Design and Implementation (NSDI), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Convey Computer. The Convey HC-2 Computer. Architectural Overview. https://www.micron.eom/~/media/documents/products/white-paper/wp_convey_hc2_architectual_overview.pdf, 2012.Google ScholarGoogle Scholar
  22. Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Symposium on Networked Systems Design and Implementation (NSDI), April 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In ACM Symposium on Operating Systems Principles (SOSP), October 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jake Edge. DAX, mmap(), and a "go faster" flag. https://lwn.net/Articles/684828/.Google ScholarGoogle Scholar
  25. Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), October 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Gibb, J. W. Lockwood, J. Naous, P. Hartke, and N. McKeown. NetFPGA: An open platform for teaching how to build Gigabit-rate network switches and routers. IEEE Transactions on Education, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Heiner Giefers, Raphael Polig, and Christoph Hagleitner. Accelerating arithmetic kernels with coherent attached fpga coprocessors. In Design, Automation & Test in Europe (DATE), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with Infiniswap. In Symposium on Networked Systems Design and Implementation (NSDI), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mark Harris. Unified Memory in CUDA 6. https://devblogs.nvidia.com/unified-memory-in-cuda-6/.Google ScholarGoogle Scholar
  30. Zecheng He and Ruby B. Lee. How secure is your cache against side-channel attacks? In International Symposium on Microarchitecture (MICRO), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Zhenhao He, David Sidler, Zsolt István, and Gustavo Alonso. A flexible k-means operator for hybrid databases. In "International Conference on Field Programmable Logic and Applications (FPL)", 2018.Google ScholarGoogle Scholar
  32. John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Michael Henson and Stephen Taylor. Memory encryption: A survey of existing techniques. ACM Computing Surveys, March 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michael R. Hines, Umesh Deshpande, and Kartik Gopalan. Post-copy live migration of virtual machines. Operating Systems Review, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Intel. EPT-based Sub-Page Permissions. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.Google ScholarGoogle Scholar
  36. Intel. Intel® Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf.Google ScholarGoogle Scholar
  37. Intel. Intel® Xeon®+FPGA Platform for the Data Center. http://reconfigurablecomputing4themasses.net/files/2.2%20PK.pdf.Google ScholarGoogle Scholar
  38. Intel. Page Modification Logging for Virtual Machine Monitor White Paper. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf.Google ScholarGoogle Scholar
  39. Daniel Jacobowitz. ptrace() event tracing. https://lwn.net/Articles/10369/.Google ScholarGoogle Scholar
  40. Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Andi Kleen. Machine check handling on Linux. https://www.halobates.de/mce.pdf.Google ScholarGoogle Scholar
  43. David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. Automatic generation of efficient accelerators for reconfigurable hardware. In International Symposium on Computer Architecture (ISCA), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Maysam Lavasani, Hari Angepat, and Derek Chiou. An FPGA-based in-line accelerator for Memcached. IEEE Computer Architecture Letters, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems (TOCS), November 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kevin T. Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. System-level implications of disaggregated memory. In IEEE Symposium on High Performance Computer Architecture (HPCA), February 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Liu Ling, Neal Oliver, Chitlur Bhushan, Wang Qigang, Alvin Chen, Shen Wenbo, Yu Zhihong, Arthur Sheiman, Ian McCallum, Joseph Grecco, Henry Mitchel, Liu Dong, and Prabhat Gupta. High-performance, energy-efficient platforms using in-socket fpga accelerators. In International Symposium on Field Programmable Gate Arrays (FPGA), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In IEEE Symposium on High Performance Computer Architecture (HPCA), 2016.Google ScholarGoogle ScholarCross RefCross Ref
  50. Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for multicore architectures. Technical Report MIT-CSAIL-TR-2010-020, May 2010.Google ScholarGoogle Scholar
  51. Mellanox. Mellanox Innova™ IPsec 4 Lx Ethernet Adapter Card User Manual. http://www.mellanox.com/related-docs/prod_software/Mellanox_InnovaJPsec_4_Lx_Ethernet_Adapter_Card_User_Manual_rev_1_3.pdf.Google ScholarGoogle Scholar
  52. Microsoft. Project Catapult. https://www.microsoft.com/en-us/research/project/project-catapult.Google ScholarGoogle Scholar
  53. Microsoft. SDN for the Cloud. https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf.Google ScholarGoogle Scholar
  54. David Mulnix. Intel Xeon Processor Scalable Family Technical Overview. https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview.Google ScholarGoogle Scholar
  55. Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  56. Vijay Nagarajan and Rajiv Gupta. Architectural support for shadow memory in multiprocessors. In International Conference on Virtual Execution Environments (VEE), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In USENIX Annual Technical Conference (ATC), July 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. A reconfigurable computing system based on a cache-coherent fabric. In International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. OpenCAPI consortium. http://opencapi.org.Google ScholarGoogle Scholar
  60. Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. Centaur: A framework for hybrid CPU-FPGA databases. In International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.Google ScholarGoogle ScholarCross RefCross Ref
  61. Mark S. Papamarcos and Janak H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In International Symposium on Computer Architecture (ISCA), 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Mathias Payer, Boris Bluntschli, and Thomas R. Gross. Dynsec: On-the-fly code rewriting and repair. In Hot Topics in Software Upgrades, 2013.Google ScholarGoogle Scholar
  63. Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. Linearly compressed pages: A low-complexity, low-latency main memory compression framework. In International Symposium on Microarchitecture (MICRO), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain access control for distributed shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. The dirty-block index. In International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In International Symposium on Computer Architecture (ISCA), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Distributed shared persistent memory. In ACM Symposium on Cloud Computing (SoCC), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing CNN accelerator efficiency through resource partitioning. In International Symposium on Computer Architecture (ISCA), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Navin Shenoy. A Milestone in Moving Data. https://newsroom.intel.com/editorials/milestone-moving-data.Google ScholarGoogle Scholar
  73. David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. doppioDB: A hardware accelerated database. In International Conference on Management of Data (SIGMOD), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Mario Smarduch. Enhanced Live Migration For Intensive Memory Loads. https://events.static.linuxfound.org/sites/events/files/slides/CloudOpen-Japan-2015.pdf.Google ScholarGoogle Scholar
  75. Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, and Al Davis. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Bharat Sukhwani, Thomas Roewer, Charles L. Haymes, Kyu-Hyoun Kim, Adam J. McPadden, Daniel M. Dreps, Dean Sanner, Jan Van Lunteren, and Sameh Asaad. Contutto: A novel FPGA-based prototyping platform enabling innovation in the memory subsystem of a server class processor. In International Symposium on Microarchitecture (MICRO), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. A. Tran, M. Smith, and J. Miller. A hardware-assisted tool for fast, full code coverage analysis. In International Symposium on Software Reliability Engineering (ISSRE), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Irina Chihaia Tuduce and Thomas R. Gross. Adaptive main memory compression. In USENIX Annual Technical Conference (ATC), April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: Lightweight persistent memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Carl A. Waldspurger. Memory resource management in VMware ESX server. In Symposium on Operating Systems Design and Implementation (OSDI), December 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Emmett Witchel, Josh Cates, and Krste Asanovic. Mondrian memory protection. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A reliable and highly-available non-volatile memory system. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Qin Zhao, Derek Bruening, and Saman Amarasinghe. Efficient memory shadowing for 64-bit architectures. In International Symposium on Memory Management (ISMM), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Project PBerry: FPGA Acceleration for Remote Memory

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems
      May 2019
      227 pages
      ISBN:9781450367271
      DOI:10.1145/3317550

      Copyright © 2019 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 May 2019

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader