research-article

Efficient virtual memory for big memory servers

Authors:
Arkaprava Basu

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Jayneel Gandhi

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Jichuan Chang

Hewlett-Packard Laboratories, Palo Alto, CA

Hewlett-Packard Laboratories, Palo Alto, CA
View Profile

,
Mark D. Hill

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Michael M. Swift

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureJune 2013Pages 237–248https://doi.org/10.1145/2485922.2485943

Published:23 June 2013Publication History

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 237–248

ABSTRACT

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.

To remove the TLB miss overhead for big-memory workloads, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware---base, limit and offset registers per core---to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed.

We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.

References

Adams, K. and Agesen, O. 2006. A comparison of software and hardware techniques for x86 virtualization. Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct. 2006), 2--13. Google ScholarDigital Library
Ahn, J. et al. 2012. Revisiting Hardware-Assisted Page Walks for Virtualized Systems. Proceedings of the 39th Annual International Symposium on Computer Architecture (Jun. 2012). Google ScholarDigital Library
Barr, T. W. et al. 2011. SpecTLB: a mechanism for speculative address translation. Proceedings of the 38th Annual International Symposium on Computer Architecture (Jun. 2011). Google ScholarDigital Library
Barr, T. W. et al. 2010. Translation caching: skip, don't walk (the page table). Proceedings of the 37th Annual International Symposium on Computer Architecture (Jun. 2010). Google ScholarDigital Library
Basu, A. et al. 2012. Reducing Memory Reference Energy With Opportunistic Virtual Caching. Proceedings of the 39th annual international symposium on Computer architecture (Jun. 2012), 297--308. Google ScholarDigital Library
Bhargava, R. et al. 2008. Accelerating two-dimensional page walks for virtualized systems. Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2008). Google ScholarDigital Library
Bhattacharjee, A. et al. 2011. Shared last-level TLBs for chip multiprocessors. Proc. of the 17th IEEE Symp. on High-Performance Computer Architecture (Feb. 2011). Google ScholarDigital Library
Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (Sep. 2009). Google ScholarDigital Library
Bhattacharjee, A. and Martonosi, M. 2010. Inter-core cooperative TLB for chip multiprocessors. Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2010). Google ScholarDigital Library
Binkert, N. et al. 2011. The gem5 simulator. Computer Architecture News (CAN). (2011). Google ScholarDigital Library
Chen, J. B. et al. 1992. A Simulation Based Study of TLB Performance. Proceedings of the 19th Annual International Symposium on Computer Architecture (May. 1992). Google ScholarDigital Library
Christos Kozyrakis, A. K. and Vaid, K. 2010. Server Engineering Insights for Large-Scale Online Services. IEEE Micro (Jul. 2010). Google ScholarDigital Library
Couleur, J. F. and Glaser, E. L. 1968. Shared-access Data Processing System. Nov. 1968.Google Scholar
Daley, R. C. and Dennis, J. B. 1968. Virtual memory, processes, and sharing in MULTICS. Communications of the ACM. 11, 5 (May. 1968), 306--312. Google ScholarDigital Library
Denning, P. J. 1970. Virtual Memory. ACM Computing Surveys. 2, 3 (Sep. 1970), 153--189. Google ScholarDigital Library
Emer, J. S. and Clark, D. W. 1984. A Characterization of Processor Performance in the vax-11/780. Proceedings of the 11th Annual International Symposium on Computer Architecture (Jun. 1984), 301--310. Google ScholarDigital Library
Ferdman, M. et al. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. Proceedings of the 17th Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2012). Google ScholarDigital Library
Ganapathy, N. and Schimmel, C. 1998. General purpose operating system support for multiple page sizes. Proceedings of the annual conference on USENIX Annual Technical Conference (1998). Google ScholarDigital Library
graph500 -- The Graph500 List: http://www.graph500.org/.Google Scholar
Huge Pages/libhugetlbfs: 2010. http://lwn.net/Articles/374424/.Google Scholar
Intel 8086: http://en.wikipedia.org/wiki/Intel_8086.Google Scholar
Jacob, B. and Mudge, T. 2001. Uniprocessor Virtual Memory without TLBs. IEEE Transaction on Computer. 50, 5 (May. 2001). Google ScholarDigital Library
Jacob, B. and Mudge, T. 1998. Virtual Memory in Contemporary Microprocessors. IEEE Micro. 18, 4 (1998). Google ScholarDigital Library
Kandiraju, G. B. and Sivasubramaniam, A. 2002. Going the distance for TLB prefetching: an application-driven study. Proceedings of the 29th Annual International Symposium on Computer Architecture (May. 2002). Google ScholarDigital Library
Large Page Performance: ESX Server 3.5 and ESX Server 3i v3.5: http://www.vmware.com/files/pdf/large_pg_performance.pdf.Google Scholar
Linux pmap utility: http://linux.die.net/man/1/pmap.Google Scholar
Lustig, D. et al. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Transactions on Architecture and Code Optimization. (Jan. 2013). Google ScholarDigital Library
Marissa Mayer at Web 2.0: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html.Google Scholar
Mars, J. et al. 2011. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. Proceedings of the 44th Annual IEEE/ACM International Symp. on Microarchitecture (Dec. 2011). Google ScholarDigital Library
McCurdy, C. et al. 2008. Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors. Proceedings of IEEE International Symposium on Performance Analysis of Systems and software (2008). Google ScholarDigital Library
Memory Hotplug: http://www.kernel.org/doc/Documentation/memory-hotplug.txt.Google Scholar
Microsystems, S. 2007. UltraSPARC T2#8482; Supplement to the UltraSPARC Architecture 2007. (Sep. 2007).Google Scholar
Navarro, J. et al. 2002. Practical Transparent Operating System Support for Superpages. Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Dec. 2002). Google ScholarDigital Library
Oprofile: http://oprofile.sourceforge.net/.Google Scholar
Ousterhout, J. and al, et 2011. The case for RAMCloud. Communications of the ACM. 54, 7 (Jul. 2011), 121--130. Google ScholarDigital Library
Pham, B. et al. 2012. CoLT: Coalesced Large Reach TLBs. Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2012). Google ScholarDigital Library
Ranganathan, P. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer. 44, 1 (2011). Google ScholarDigital Library
Reiss, C. et al. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. Proceedings of the 3rd ACM Symposium on Cloud Computing (Oct. 2012). Google ScholarDigital Library
Rosenblum, N. E. et al. 2008. Virtual machine-provided context sensitive page mappings. Proceedings of the 4th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments (Mar. 2008). Google ScholarDigital Library
Saulsbury, A. et al. 2000. Recency-based TLB preloading. Proceedings of the 27th Annual International Symposium on Computer Architecture (Jun. 2000). Google ScholarDigital Library
Sodani, A. 2011. Race to Exascale: Opportunities and Challenges. MICRO 2011 Keynote address.Google Scholar
Srikantaiah, S. and Kandemir, M. 2010. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. Proceedings of 43rd Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2010). Google ScholarDigital Library
Talluri, M. et al. 1992. Tradeoffs in Supporting Two Page Sizes. Proceedings of the 19th Annual International Symposium on Computer Architecture (May. 1992). Google ScholarDigital Library
Talluri, M. and Hill, M. D. 1994. Surpassing the TLB performance of superpages with less operating system support. Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct. 1994). Google ScholarDigital Library
TCMalloc: Thread-Caching Malloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html.Google Scholar
Transparent huge pages: 2011. www.lwn.net/Articles/423584/.Google Scholar
Volos, H. et al. 2011. Mnemosyne: Lightweight Persistent Memory. Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2011). Google ScholarDigital Library
Waldspurger, C. A. 2002. Memory Resource Management in VMware ESX Server. Proceedings of the 2002 Symposium on Operating Systems Design and Implementation (Dec. 2002). Google ScholarDigital Library
Wood, D. A. et al. 1986. An in-cache address translation mechanism. Proceedings of 13th annual international symposium on Computer architecture (Jun. 1986). Google ScholarDigital Library
Zhang, L. et al. 2010. Enigma: architectural and operating system support for reducing the impact of address translation. Proceedings of the 24th ACM International Conference on Supercomputing (Jun. 2010). Google ScholarDigital Library

Index Terms

Efficient virtual memory for big memory servers
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Efficient virtual memory for big memory servers
ICSA '13

Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. ...
Read More
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

The non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Read More
Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems
RTAS '11: Proceedings of the 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium

Flash memory is becoming the storage media of choice for mobile devices and embedded systems. The performance of flash memory is impacted by the asymmetric speed of read and write operations, limited number of erase times and the absence of in-place ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion
ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
tanslation lookaside buffer
virtual memory
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 254
  Total Citations
  View Citations
- 2,951
  Total Downloads
- Downloads (Last 12 months)231
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient virtual memory for big memory servers

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient virtual memory for big memory servers

Redesign the Memory Allocator for Non-Volatile Main Memory

Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems