ABSTRACT
Recent research efforts propose remote memory systems that pool memory from multiple hosts. These systems rely on the virtual memory subsystem to track application memory accesses and transparently offer remote memory to applications. We outline several limitations of this approach, such as page fault overheads and dirty data amplification. Instead, we argue for a fundamentally different approach: leverage the local host's cache coherence traffic to track application memory accesses at cache line granularity. Our approach uses emerging cache-coherent FPGAs to expose cache coherence events to the operating system. This approach not only accelerates remote memory systems by reducing dirty data amplification and by eliminating page faults, but also enables other use cases, such as live virtual machine migration, unified virtual memory, security and code analysis. All of these use cases open up many promising research directions.
- CCIX. https://www.ccixconsortium.com.Google Scholar
- Enzian, a research computer built by the Systems Group at ETH Zürich. http://www.enzian.systems/index.html.Google Scholar
- P.Haul. https://criu.org/P.Haul.Google Scholar
- Pin - a dynamic binary instrumentation tool. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.Google Scholar
- Redis: open-source, in-memory data structure store. https://redis.io.Google Scholar
- Serving DNNs in real time at datacenter scale with Project Brainwave. https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-2018Mar25.pdf.Google Scholar
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote regions: a simple abstraction for remote memory. In USENIX Annual Technical Conference (ATC), Boston, MA, 2018. Google ScholarDigital Library
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote memory in the age of fast networks. In ACM Symposium on Cloud Computing (SoCC), 2017. Google ScholarDigital Library
- Cristiana Amza, Alan L. Cox, Shandya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, February 1996. Google ScholarDigital Library
- Joshua Auerbach, David F. Bacon, Perry Cheng, and Rodric Rabbah. Lime: A Java-compatible and synthesizable language for heterogeneous architectures. 2010.Google Scholar
- Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Communications of the ACM, March 2017. Google ScholarDigital Library
- J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), March 1990. Google ScholarDigital Library
- Abhishek Bhattacharjee. Translation-triggered prefetching. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017. Google ScholarDigital Library
- Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, July 1970. Google ScholarDigital Library
- M. Blott and K. Vissers. Dataflow architectures for 10 Gbps line-rate key-value-stores. In IEEE Hot Chips 25 Symposium (HCS), 2013.Google ScholarCross Ref
- Greg Bronevetsky, Daniel Marques, Keshav Pingali, Peter Szwed, and Martin Schulz. Application-level checkpointing for shared memory programs. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004. Google ScholarDigital Library
- Derek Bruening, Qin Zhao, and Saman Amarasinghe. Transparent dynamic instrumentation. In International Conference on Virtual Execution Environments (VEE), 2012. Google ScholarDigital Library
- Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera. Black-box concurrent data structures for NUMA architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017. Google ScholarDigital Library
- Marco Chiappetta, Erkay Savas, and Cemal Yilmaz. Real time detection of cache-based side-channel attacks using hardware performance counters. Applied Soft Computing, 49(C), December 2016. Google ScholarDigital Library
- Christopher Clark, Keir Fraser, Steven H, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. Live migration of virtual machines. In Symposium on Networked Systems Design and Implementation (NSDI), 2005. Google ScholarDigital Library
- Convey Computer. The Convey HC-2 Computer. Architectural Overview. https://www.micron.eom/~/media/documents/products/white-paper/wp_convey_hc2_architectual_overview.pdf, 2012.Google Scholar
- Aleksandar Dragojević, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Symposium on Networked Systems Design and Implementation (NSDI), April 2014. Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In ACM Symposium on Operating Systems Principles (SOSP), October 2015. Google ScholarDigital Library
- Jake Edge. DAX, mmap(), and a "go faster" flag. https://lwn.net/Articles/684828/.Google Scholar
- Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), October 2016. Google ScholarDigital Library
- G. Gibb, J. W. Lockwood, J. Naous, P. Hartke, and N. McKeown. NetFPGA: An open platform for teaching how to build Gigabit-rate network switches and routers. IEEE Transactions on Education, 2008. Google ScholarDigital Library
- Heiner Giefers, Raphael Polig, and Christoph Hagleitner. Accelerating arithmetic kernels with coherent attached fpga coprocessors. In Design, Automation & Test in Europe (DATE), 2015. Google ScholarDigital Library
- Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Efficient memory disaggregation with Infiniswap. In Symposium on Networked Systems Design and Implementation (NSDI), 2017. Google ScholarDigital Library
- Mark Harris. Unified Memory in CUDA 6. https://devblogs.nvidia.com/unified-memory-in-cuda-6/.Google Scholar
- Zecheng He and Ruby B. Lee. How secure is your cache against side-channel attacks? In International Symposium on Microarchitecture (MICRO), 2017. Google ScholarDigital Library
- Zhenhao He, David Sidler, Zsolt István, and Gustavo Alonso. A flexible k-means operator for hybrid databases. In "International Conference on Field Programmable Logic and Applications (FPL)", 2018.Google Scholar
- John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 2011. Google ScholarDigital Library
- Michael Henson and Stephen Taylor. Memory encryption: A survey of existing techniques. ACM Computing Surveys, March 2014. Google ScholarDigital Library
- Michael R. Hines, Umesh Deshpande, and Kartik Gopalan. Post-copy live migration of virtual machines. Operating Systems Review, July 2009. Google ScholarDigital Library
- Intel. EPT-based Sub-Page Permissions. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.Google Scholar
- Intel. Intel® Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/07/b7/319433-023.pdf.Google Scholar
- Intel. Intel® Xeon®+FPGA Platform for the Data Center. http://reconfigurablecomputing4themasses.net/files/2.2%20PK.pdf.Google Scholar
- Intel. Page Modification Logging for Virtual Machine Monitor White Paper. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf.Google Scholar
- Daniel Jacobowitz. ptrace() event tracing. https://lwn.net/Articles/10369/.Google Scholar
- Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018. Google ScholarDigital Library
- Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- Andi Kleen. Machine check handling on Linux. https://www.halobates.de/mce.pdf.Google Scholar
- David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. Automatic generation of efficient accelerators for reconfigurable hardware. In International Symposium on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
- Maysam Lavasani, Hari Angepat, and Derek Chiou. An FPGA-based in-line accelerator for Memcached. IEEE Computer Architecture Letters, 2014. Google ScholarDigital Library
- Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems (TOCS), November 1989. Google ScholarDigital Library
- Kevin T. Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. System-level implications of disaggregated memory. In IEEE Symposium on High Performance Computer Architecture (HPCA), February 2012. Google ScholarDigital Library
- Liu Ling, Neal Oliver, Chitlur Bhushan, Wang Qigang, Alvin Chen, Shen Wenbo, Yu Zhihong, Arthur Sheiman, Ian McCallum, Joseph Grecco, Henry Mitchel, Liu Dong, and Prabhat Gupta. High-performance, energy-efficient platforms using in-socket fpga accelerators. In International Symposium on Field Programmable Gate Arrays (FPGA), 2009. Google ScholarDigital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2005. Google ScholarDigital Library
- Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In IEEE Symposium on High Performance Computer Architecture (HPCA), 2016.Google ScholarCross Ref
- Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for multicore architectures. Technical Report MIT-CSAIL-TR-2010-020, May 2010.Google Scholar
- Mellanox. Mellanox Innova™ IPsec 4 Lx Ethernet Adapter Card User Manual. http://www.mellanox.com/related-docs/prod_software/Mellanox_InnovaJPsec_4_Lx_Ethernet_Adapter_Card_User_Manual_rev_1_3.pdf.Google Scholar
- Microsoft. Project Catapult. https://www.microsoft.com/en-us/research/project/project-catapult.Google Scholar
- Microsoft. SDN for the Cloud. https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/keynote.pdf.Google Scholar
- David Mulnix. Intel Xeon Processor Scalable Family Technical Overview. https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview.Google Scholar
- Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Processing data where it makes sense: Enabling in-memory computation. Microprocessors and Microsystems, 2019.Google ScholarCross Ref
- Vijay Nagarajan and Rajiv Gupta. Architectural support for shadow memory in multiprocessors. In International Conference on Virtual Execution Environments (VEE), 2009. Google ScholarDigital Library
- Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In USENIX Annual Technical Conference (ATC), July 2015. Google ScholarDigital Library
- Neal Oliver, Rahul R. Sharma, Stephen Chang, Bhushan Chitlur, Elkin Garcia, Joseph Grecco, Aaron Grier, Nelson Ijih, Yaping Liu, Pratik Marolia, Henry Mitchel, Suchit Subhaschandra, Arthur Sheiman, Tim Whisonant, and Prabhat Gupta. A reconfigurable computing system based on a cache-coherent fabric. In International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2011. Google ScholarDigital Library
- OpenCAPI consortium. http://opencapi.org.Google Scholar
- Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. Centaur: A framework for hybrid CPU-FPGA databases. In International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.Google ScholarCross Ref
- Mark S. Papamarcos and Janak H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In International Symposium on Computer Architecture (ISCA), 1984. Google ScholarDigital Library
- Mathias Payer, Boris Bluntschli, and Thomas R. Gross. Dynsec: On-the-fly code rewriting and repair. In Hot Topics in Software Upgrades, 2013.Google Scholar
- Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. Linearly compressed pages: A low-complexity, low-latency main memory compression framework. In International Symposium on Microarchitecture (MICRO), 2013. Google ScholarDigital Library
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1996. Google ScholarDigital Library
- Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain access control for distributed shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1994. Google ScholarDigital Library
- Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. The dirty-block index. In International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In International Symposium on Computer Architecture (ISCA), 2015. Google ScholarDigital Library
- Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018. Google ScholarDigital Library
- Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Distributed shared persistent memory. In ACM Symposium on Cloud Computing (SoCC), 2017. Google ScholarDigital Library
- Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing CNN accelerator efficiency through resource partitioning. In International Symposium on Computer Architecture (ISCA), 2017. Google ScholarDigital Library
- Navin Shenoy. A Milestone in Moving Data. https://newsroom.intel.com/editorials/milestone-moving-data.Google Scholar
- David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. doppioDB: A hardware accelerated database. In International Conference on Management of Data (SIGMOD), 2017. Google ScholarDigital Library
- Mario Smarduch. Enhanced Live Migration For Intensive Memory Loads. https://events.static.linuxfound.org/sites/events/files/slides/CloudOpen-Japan-2015.pdf.Google Scholar
- Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian, and Al Davis. Micro-pages: Increasing DRAM efficiency with locality-aware data placement. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010. Google ScholarDigital Library
- Bharat Sukhwani, Thomas Roewer, Charles L. Haymes, Kyu-Hyoun Kim, Adam J. McPadden, Daniel M. Dreps, Dean Sanner, Jan Van Lunteren, and Sameh Asaad. Contutto: A novel FPGA-based prototyping platform enabling innovation in the memory subsystem of a server class processor. In International Symposium on Microarchitecture (MICRO), 2017. Google ScholarDigital Library
- A. Tran, M. Smith, and J. Miller. A hardware-assisted tool for fast, full code coverage analysis. In International Symposium on Software Reliability Engineering (ISSRE), 2008. Google ScholarDigital Library
- Irina Chihaia Tuduce and Thomas R. Gross. Adaptive main memory compression. In USENIX Annual Technical Conference (ATC), April 2005. Google ScholarDigital Library
- Haris Volos, Andres Jaan Tack, and Michael M. Swift. Mnemosyne: Lightweight persistent memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011. Google ScholarDigital Library
- Carl A. Waldspurger. Memory resource management in VMware ESX server. In Symposium on Operating Systems Design and Implementation (OSDI), December 2002. Google ScholarDigital Library
- Emmett Witchel, Josh Cates, and Krste Asanovic. Mondrian memory protection. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2002. Google ScholarDigital Library
- Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven Swanson. Mojim: A reliable and highly-available non-volatile memory system. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015. Google ScholarDigital Library
- Qin Zhao, Derek Bruening, and Saman Amarasinghe. Efficient memory shadowing for 64-bit architectures. In International Symposium on Memory Management (ISMM), 2010. Google ScholarDigital Library
Index Terms
- Project PBerry: FPGA Acceleration for Remote Memory
Recommendations
Rethinking software runtimes for disaggregated memory
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating SystemsDisaggregated memory can address resource provisioning inefficiencies in current datacenters. Multiple software runtimes for disaggregated memory have been proposed in an attempt to make disaggregated memory practical. These systems rely on the virtual ...
Low-energy volatile STT-RAM cache design using cache-coherence-enabled adaptive refresh
Spin-Torque Transfer RAM (STT-RAM) is a promising candidate for SRAM replacement because of its excellent features, such as fast read access, high density, low leakage power, and CMOS technology compatibility. However, wide adoption of STT-RAM as cache ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Comments