ABSTRACT
Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. On the other hand, we find that these workloads use read-write permission on most pages, are provisioned not to swap, and rarely benefit from the full flexibility of page-based virtual memory.
To remove the TLB miss overhead for big-memory workloads, we propose mapping part of a process's linear virtual address space with a direct segment, while page mapping the rest of the virtual address space. Direct segments use minimal hardware---base, limit and offset registers per core---to map contiguous virtual memory regions directly to contiguous physical memory. They eliminate the possibility of TLB misses for key data structures such as database buffer pools and in-memory key-value stores. Memory mapped by a direct segment may be converted back to paging when needed.
We prototype direct-segment software support for x86-64 in Linux and emulate direct-segment hardware. For our workloads, direct segments eliminate almost all TLB misses and reduce the execution time wasted on TLB misses to less than 0.5%.
- Adams, K. and Agesen, O. 2006. A comparison of software and hardware techniques for x86 virtualization. Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct. 2006), 2--13. Google ScholarDigital Library
- Ahn, J. et al. 2012. Revisiting Hardware-Assisted Page Walks for Virtualized Systems. Proceedings of the 39th Annual International Symposium on Computer Architecture (Jun. 2012). Google ScholarDigital Library
- Barr, T. W. et al. 2011. SpecTLB: a mechanism for speculative address translation. Proceedings of the 38th Annual International Symposium on Computer Architecture (Jun. 2011). Google ScholarDigital Library
- Barr, T. W. et al. 2010. Translation caching: skip, don't walk (the page table). Proceedings of the 37th Annual International Symposium on Computer Architecture (Jun. 2010). Google ScholarDigital Library
- Basu, A. et al. 2012. Reducing Memory Reference Energy With Opportunistic Virtual Caching. Proceedings of the 39th annual international symposium on Computer architecture (Jun. 2012), 297--308. Google ScholarDigital Library
- Bhargava, R. et al. 2008. Accelerating two-dimensional page walks for virtualized systems. Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2008). Google ScholarDigital Library
- Bhattacharjee, A. et al. 2011. Shared last-level TLBs for chip multiprocessors. Proc. of the 17th IEEE Symp. on High-Performance Computer Architecture (Feb. 2011). Google ScholarDigital Library
- Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors. Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (Sep. 2009). Google ScholarDigital Library
- Bhattacharjee, A. and Martonosi, M. 2010. Inter-core cooperative TLB for chip multiprocessors. Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2010). Google ScholarDigital Library
- Binkert, N. et al. 2011. The gem5 simulator. Computer Architecture News (CAN). (2011). Google ScholarDigital Library
- Chen, J. B. et al. 1992. A Simulation Based Study of TLB Performance. Proceedings of the 19th Annual International Symposium on Computer Architecture (May. 1992). Google ScholarDigital Library
- Christos Kozyrakis, A. K. and Vaid, K. 2010. Server Engineering Insights for Large-Scale Online Services. IEEE Micro (Jul. 2010). Google ScholarDigital Library
- Couleur, J. F. and Glaser, E. L. 1968. Shared-access Data Processing System. Nov. 1968.Google Scholar
- Daley, R. C. and Dennis, J. B. 1968. Virtual memory, processes, and sharing in MULTICS. Communications of the ACM. 11, 5 (May. 1968), 306--312. Google ScholarDigital Library
- Denning, P. J. 1970. Virtual Memory. ACM Computing Surveys. 2, 3 (Sep. 1970), 153--189. Google ScholarDigital Library
- Emer, J. S. and Clark, D. W. 1984. A Characterization of Processor Performance in the vax-11/780. Proceedings of the 11th Annual International Symposium on Computer Architecture (Jun. 1984), 301--310. Google ScholarDigital Library
- Ferdman, M. et al. 2012. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. Proceedings of the 17th Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2012). Google ScholarDigital Library
- Ganapathy, N. and Schimmel, C. 1998. General purpose operating system support for multiple page sizes. Proceedings of the annual conference on USENIX Annual Technical Conference (1998). Google ScholarDigital Library
- graph500 -- The Graph500 List: http://www.graph500.org/.Google Scholar
- Huge Pages/libhugetlbfs: 2010. http://lwn.net/Articles/374424/.Google Scholar
- Intel 8086: http://en.wikipedia.org/wiki/Intel_8086.Google Scholar
- Jacob, B. and Mudge, T. 2001. Uniprocessor Virtual Memory without TLBs. IEEE Transaction on Computer. 50, 5 (May. 2001). Google ScholarDigital Library
- Jacob, B. and Mudge, T. 1998. Virtual Memory in Contemporary Microprocessors. IEEE Micro. 18, 4 (1998). Google ScholarDigital Library
- Kandiraju, G. B. and Sivasubramaniam, A. 2002. Going the distance for TLB prefetching: an application-driven study. Proceedings of the 29th Annual International Symposium on Computer Architecture (May. 2002). Google ScholarDigital Library
- Large Page Performance: ESX Server 3.5 and ESX Server 3i v3.5: http://www.vmware.com/files/pdf/large_pg_performance.pdf.Google Scholar
- Linux pmap utility: http://linux.die.net/man/1/pmap.Google Scholar
- Lustig, D. et al. 2013. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs. ACM Transactions on Architecture and Code Optimization. (Jan. 2013). Google ScholarDigital Library
- Marissa Mayer at Web 2.0: http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html.Google Scholar
- Mars, J. et al. 2011. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. Proceedings of the 44th Annual IEEE/ACM International Symp. on Microarchitecture (Dec. 2011). Google ScholarDigital Library
- McCurdy, C. et al. 2008. Investigating the TLB Behavior of High-end Scientific Applications on Commodity Microprocessors. Proceedings of IEEE International Symposium on Performance Analysis of Systems and software (2008). Google ScholarDigital Library
- Memory Hotplug: http://www.kernel.org/doc/Documentation/memory-hotplug.txt.Google Scholar
- Microsystems, S. 2007. UltraSPARC T2#8482; Supplement to the UltraSPARC Architecture 2007. (Sep. 2007).Google Scholar
- Navarro, J. et al. 2002. Practical Transparent Operating System Support for Superpages. Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Dec. 2002). Google ScholarDigital Library
- Oprofile: http://oprofile.sourceforge.net/.Google Scholar
- Ousterhout, J. and al, et 2011. The case for RAMCloud. Communications of the ACM. 54, 7 (Jul. 2011), 121--130. Google ScholarDigital Library
- Pham, B. et al. 2012. CoLT: Coalesced Large Reach TLBs. Proceedings of 45th Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2012). Google ScholarDigital Library
- Ranganathan, P. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer. 44, 1 (2011). Google ScholarDigital Library
- Reiss, C. et al. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. Proceedings of the 3rd ACM Symposium on Cloud Computing (Oct. 2012). Google ScholarDigital Library
- Rosenblum, N. E. et al. 2008. Virtual machine-provided context sensitive page mappings. Proceedings of the 4th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments (Mar. 2008). Google ScholarDigital Library
- Saulsbury, A. et al. 2000. Recency-based TLB preloading. Proceedings of the 27th Annual International Symposium on Computer Architecture (Jun. 2000). Google ScholarDigital Library
- Sodani, A. 2011. Race to Exascale: Opportunities and Challenges. MICRO 2011 Keynote address.Google Scholar
- Srikantaiah, S. and Kandemir, M. 2010. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. Proceedings of 43rd Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2010). Google ScholarDigital Library
- Talluri, M. et al. 1992. Tradeoffs in Supporting Two Page Sizes. Proceedings of the 19th Annual International Symposium on Computer Architecture (May. 1992). Google ScholarDigital Library
- Talluri, M. and Hill, M. D. 1994. Surpassing the TLB performance of superpages with less operating system support. Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct. 1994). Google ScholarDigital Library
- TCMalloc: Thread-Caching Malloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html.Google Scholar
- Transparent huge pages: 2011. www.lwn.net/Articles/423584/.Google Scholar
- Volos, H. et al. 2011. Mnemosyne: Lightweight Persistent Memory. Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (Mar. 2011). Google ScholarDigital Library
- Waldspurger, C. A. 2002. Memory Resource Management in VMware ESX Server. Proceedings of the 2002 Symposium on Operating Systems Design and Implementation (Dec. 2002). Google ScholarDigital Library
- Wood, D. A. et al. 1986. An in-cache address translation mechanism. Proceedings of 13th annual international symposium on Computer architecture (Jun. 1986). Google ScholarDigital Library
- Zhang, L. et al. 2010. Enigma: architectural and operating system support for reducing the impact of address translation. Proceedings of the 24th ACM International Conference on Supercomputing (Jun. 2010). Google ScholarDigital Library
Index Terms
- Efficient virtual memory for big memory servers
Recommendations
Efficient virtual memory for big memory servers
ICSA '13Our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory. They consume as much as 10% of execution cycles on TLB misses, even using large pages. ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing SystemsThe non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems
RTAS '11: Proceedings of the 2011 17th IEEE Real-Time and Embedded Technology and Applications SymposiumFlash memory is becoming the storage media of choice for mobile devices and embedded systems. The performance of flash memory is impacted by the asymmetric speed of read and write operations, limited number of erase times and the absence of in-place ...
Comments