ABSTRACT
With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems.
In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.
- A. Agarwal. Facebook: Science and the social graph. http://www.infoq.com/presentations/Facebook-Software-Stack, 2009. presented in QCon San Francisco.Google Scholar
- Apache. Hadoop. http://hadoop.apache.org/, 2011.Google Scholar
- M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches. In HPCA '09: 2009 IEEE 15th Intl. Symp. on High Performance Computer Architecture, 2009.Google ScholarCross Ref
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schuepbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In SOSP '09: 22nd ACM symposium on Operating systems principles, New York, NY, USA, 2009. ACM Press. Google ScholarDigital Library
- B. M. Beckmann, M. R. Marty, and D. A. Wood. ASR: Adaptive Selective Replication for CMP Caches. In MICRO 39: 39th IEEE/ACM Intl. Symp. on Microarchitecture, 2006. Google ScholarDigital Library
- J. Chang and G. S. Sohi. Cooperative Caching for Chip Multiprocessors. In Computer Architecture, 2006. ISCA '06. 33rd Intl. Symp. on, 2006. Google ScholarDigital Library
- H. Chen, Y. Luo, X. Wang, B. Zhang, Y. Sun, and Z. Wang. A transparent remote paging model for virtual machines, 2008.Google Scholar
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing Replication, Communication and Capacity Allocation in CMPs,. In In the 32th ISCA, June 2005. Google ScholarDigital Library
- I. Corp. Chip shot: Intel outlines low-power micro server strategy, 2011.Google Scholar
- G. Dhiman, R. Ayoub, and T. Rosing. PDRAM: a hybrid PRAM and DRAM main memory system. In 46th Design Automation Conf., DAC '09, pages 664--469, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Fedora Project. Intel. Core. i7-800 Processor Series. http://fedoraproject.org/, 2010.Google Scholar
- Intel Corp. Thunderbolt Technology. http://www.intel.com/technology/io/thunderbolt/index.htm, 2011.Google Scholar
- Intel Microarchitecture. Intel. Core. i7-800 Processor Series. http://download.intel.com/products/processor/corei7/319724.pdf, 2010.Google Scholar
- S. Liang, R. Noronha, and D. Panda. Swapping to remote memory over InfiniBand: An approach using a high performance network block device. In Cluster Computing, 2005. IEEE Intl., pages 1--10, 2005.Google Scholar
- K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated memory for expansion and sharing in blade servers. In 36th annual international symposium on Computer architecture, ISCA '09, pages 267--278, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- E. Markatos, E. P. Markatos, G. Dramitinos, and G. Dramitinos. Implementation of a reliable remote memory pager. In In USENIX Technical Conf., pages 177--190, 1996. Google ScholarDigital Library
- E. P. Markatos and G. Dramitinos. Adding flexibility to a remote memory pager, 1996. Google ScholarDigital Library
- M. L. Massie, B. N. Chun, and D. E. Culler. The ganglia distributed monitoring system: Design, implementation and experience, 2004.Google Scholar
- C. R. R. Maule. iwarp ethernet: key to driving ethernet into high performance environments. In 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- H. Midorikawa, M. Kurokawa, R. Himeno, and M. Sato. DLM: A distributed large memory system using remote memory swapping over cluster nodes. In Cluster Computing, 2008 IEEE Intl. Conf. on, pages 268--273, 2008.Google ScholarCross Ref
- Network Block Device TCP version. NBD. http://nbd.sourceforge.net/, 2011.Google Scholar
- T. Newhall, S. Finney, K. Ganchev, and M. Spiegel. Nswap: A network swapping module for linux clusters, 2003.Google Scholar
- J. K. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman. The case for ramclouds: Scalable high-performance storage entirely in DRAM. In SIGOPS OSR, 2009. Google ScholarDigital Library
- M. Qureshi. Adaptive Spill-Receive for Robust High-Performance Caching in CMPs. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th Intl. Symp. on, 2009. Google ScholarDigital Library
- M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable high performance main memory system using phase-change memory technology. In 36th annual international symposium on Computer architecture, ISCA '09, pages 24--33, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In PACT '06: 15th international conference on Parallel architectures and compilation techniques, 2006. Google ScholarDigital Library
- L. E. Ramos, E. Gorbatov, and R. Bianchini. Page placement in hybrid memory systems. In international conference on Supercomputing, ICS '11, pages 85--95, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. Rao. Seamicro technology overview, 2010.Google Scholar
- A. Romanow and S. Bailey. An overview of RDMA over IP. In In 1st Intl. Workshop on Protocols for Fast Long-Distance Networks (PFLDnet, 2003.Google Scholar
- A. Samih, A. Krishna, and Y. Solihin. Understanding the limits of capacity sharing in CMP Private Caches, in CMP-MSI, 2009.Google Scholar
- A. Samih, A. Krishna, and Y. Solihin. Evaluating Placement Policies for Managing Capacity Sharing in CMP Architectures with Private Caches. ACM Trans. on Architecture and Code Optimization (TACO), 8(3), 2011. Google ScholarDigital Library
- M. Schlansker, N. Chitlur, E. Oertli, P. M. Stillwell, Jr, L. Rankin, D. Bradford, R. J. Carter, J. Mudigonda, N. Binkert, and N. P. Jouppi. High-performance ethernet-based communications for future multi-core processors. In 2007 ACM/IEEE conference on Supercomputing, SC '07, pages 37:1--37:12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Standard Performance Evaluation Corporation. http://www.specbench.org, 2006.Google Scholar
- D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm. RapidMRC: Approximating L2 Miss Rate Curves on Commodity Systems for Online Optimizations. SIGPLAN Not., 44(3), 2009. Google ScholarDigital Library
- A. S. Tanenbaum and R. Van Renesse. Distributed operating systems. ACM Comput. Surv., 17:419--470, 1985. Google ScholarDigital Library
- Transaction Processing Performance Council. TPC-H 2.14.2. http://www.tpc.org/tpch/, 2011.Google Scholar
- vmware. experience game-changing virtual machine mobility. http://www.vmware.com/products/vmotion/overview.html, 2011.Google Scholar
- N. Wang, X. Liu, J. He, J. Han, L. Zhang, and Z. Xu. Collaborative memory pool in cluster system. In Parallel Processing, 2007. ICPP 2007. Intl. Conf. on, page 17, 2007. Google ScholarDigital Library
- M. Zhang and K. Asanovic. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. In ISCA '05: 32nd annual international symposium on Computer Architecture, 2005. Google ScholarDigital Library
- A collaborative memory system for high-performance and cost-effective clustered architectures
Recommendations
State-Restrict MLC STT-RAM Designs for High-Reliable High-Performance Memory System
DAC '14: Proceedings of the 51st Annual Design Automation ConferenceMulti-level Cell Spin-Transfer Torque Random Access Memory (MLC STT-RAM) is a promising nonvolatile memory technology for high-capacity and high-performance applications. However, the reliability concerns and the complicated access mechanism greatly ...
Scalable high performance main memory system using phase-change memory technology
The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-...
Scalable high performance main memory system using phase-change memory technology
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureThe memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. An alternative memory technology that uses resistance contrast in phase-...
Comments