research-article

Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation

Authors:
Xiangyao Yu

MIT

MIT
View Profile

,
Christopher J. Hughes

Intel Labs

Intel Labs
View Profile

,
Nadathur Satish

Intel Labs

Intel Labs
View Profile

,
Onur Mutlu

ETH Zürich

ETH Zürich
View Profile

,
Srinivas Devadas

MIT

MIT
View Profile

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2017Pages 1–14https://doi.org/10.1145/3123939.3124555

Published:14 October 2017Publication History

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 1–14

ABSTRACT

Placing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs optimize mainly for cache hit latency and do not consider bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM.

We propose a new DRAM cache design, Banshee, that optimizes for both in-package and off-package DRAM bandwidth efficiency without degrading access latency. Banshee is based on two key ideas. First, it eliminates the tag lookup overhead by tracking the contents of the DRAM cache using TLBs and page table entries, which is efficiently enabled by a new lightweight TLB coherence protocol we introduce. Second, it reduces unnecessary DRAM cache replacement traffic with a new bandwidth-aware frequency-based replacement policy. Our evaluations show that Banshee significantly improves performance (15% on average) and reduces DRAM traffic (35.8% on average) over the best-previous latency-optimized DRAM cache design.

References

Hybrid Memory Cube Specification 2.1. http://www.hybridmemorycube.org, 2014.Google Scholar
NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data. https://goo.gl/y6oYqD, 2014.Google Scholar
The Road to the AMD "Fiji" GPU. https://goo.gl/ci9BvG, 2015.Google Scholar
Data Sheet: Tesla P100. https://goo.gl/Y6gfXZ, 2016.Google Scholar
Intel® 64 and IA-32 Architectures Optimization Reference Manual. https://goo.gl/WKkFiw, 2016.Google Scholar
NVidia Tesla V100 GPU Accelerator. https://goo.gl/5eqTg5, 2017.Google Scholar
Agarwal, N., et al. Page Placement Strategies for CPUs within Heterogeneous Memory Systems. In ASPLOS (2015). Google ScholarDigital Library
Agarwal, N., et al. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In HPCA (2015).Google Scholar
Ahn, J., et al. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In ISCA (2015). Google ScholarDigital Library
Ahn, J., et al. PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture. In ISCA (2015). Google ScholarDigital Library
Bailey, L., and Chris, C. Configuring Huge Pages in Red Hat Enterprise Linux 4 or 5. https://goo.gl/lqB1uf, 2014.Google Scholar
Bovet, D. P., and Cesati, M. Understanding the Linux kernel. O'Reilly Media, Inc., 2005. Google ScholarDigital Library
Chang, K., et al. Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM. In HPCA (2016).Google Scholar
Chang, K., et al. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. In SIGMETRICS (2016). Google ScholarDigital Library
Chang, K., et al. Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms. SIGMETRICS (2017). Google ScholarDigital Library
Chatterjee, N., et al. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In MICRO (2012). Google ScholarDigital Library
Chi, P., et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In ISCA (2016). Google ScholarDigital Library
Chou, C., et al. CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. In MICRO (2014). Google ScholarDigital Library
Chou, C., et al. BATMAN: Maximizing Bandwidth Utilization of Hybrid Memory Systems. Tech report, ECE, Georgia Institute of Technology, 2015.Google Scholar
Chou, C., et al. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In ISCA (2015). Google ScholarDigital Library
Chou, C., et al. CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems. In MICRO (2016).Google Scholar
Dhiman, G., et al. PDRAM: a Hybrid PRAM and DRAM Main Memory System. In DAC (2009). Google ScholarDigital Library
Franey, S., and Lipasti, M. Tag Tables. In HPCA (2015).Google Scholar
Gulur, N., et al. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In MICRO (2014).Google Scholar
Henning, J. L. SPEC CPU2006 Benchmark Descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006). Google ScholarDigital Library
Huang, C.-C., et al. C³D: Mitigating the NUMA Bottleneck via Coherent DRAM Caches. In MICRO (2016).Google Scholar
Huang, C.-C., and Nagarajan, V. ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache. In PACT (2014). Google ScholarDigital Library
Jang, H., et al. Efficient Footprint Caching for Tagless DRAM Caches. In HPCA (2016).Google Scholar
JEDEC. JESD235 High Bandwidth Memory (HBM) DRAM, 2013.Google Scholar
Jeffers, J., et al. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016. Google ScholarDigital Library
Jevdjic, D., et al. Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In ISCA (2013). Google ScholarDigital Library
Jevdjic, D., et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In MICRO (2014). Google ScholarDigital Library
Jiang, X., et al. CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms. In HPCA (2010).Google Scholar
Kim, Y., et al. Ramulator: A Fast and Extensible DRAM Simulator. CAL (2016). Google ScholarDigital Library
Kumar, S., and Wilkerson, C. Exploiting Spatial Locality in Data Caches using Spatial Footprints. In ISCA (1998). Google ScholarDigital Library
Lee, D., et al. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies. IEEE transactions on Computers (2001). Google ScholarDigital Library
Lee, D., et al. Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture. In HPCA (2013). Google ScholarDigital Library
Lee, Y., et al. A Fully Associative, Tagless DRAM Cache. In ISCA (2015). Google ScholarDigital Library
Li, Y., et al. Utility-Based Hybrid Memory Management. In CLUSTER (2017).Google Scholar
Liptay, J. Structural Aspects of the System/360 Model 85, II: The cache. IBM Systems Journal (1968). Google ScholarDigital Library
Loh, G. H., and Hill, M. D. Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches. In MICRO (2011). Google ScholarDigital Library
Lu, S.-L., et al. Improving DRAM Latency with Dynamic Asymmetric Subarray. In MICRO (2015). Google ScholarDigital Library
Luo, Y., et al. Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory. In DSN (2014). Google ScholarDigital Library
Meswani, M., et al. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories. In HPCA (2015).Google Scholar
Meza, J., et al. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. CAL (2012). Google ScholarDigital Library
O'Connor, M. Highlights of the High-Bandwidth Memory (HBM) Standard.Google Scholar
Phadke, S., and Narayanasamy, S. MLP Aware Heterogeneous Memory System. In DATE (2011).Google Scholar
Qureshi, M., et al. A Case for MLP-Aware Cache Replacement. ISCA (2006). Google ScholarDigital Library
Qureshi, M., et al. Adaptive Insertion Policies for High Performance Caching. In ISCA (2007). Google ScholarDigital Library
Qureshi, M., and Loh, G. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical DRAM-Tags with a Simple and Practical Design. In MICRO (2012). Google ScholarDigital Library
Robinson, J., and Devarakonda, M. Data cache management using frequency-based replacement. In SIGMETRICS (1990). Google ScholarDigital Library
Rothman, J., and Smith, A. Sector Cache Design and Performance. In MASCOTS (2000). Google ScholarDigital Library
Sanchez, D., and Kozyrakis, C. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In ISCA (2013). Google ScholarDigital Library
Sim, J., et al. Transparent Hardware Management of Stacked DRAM as Part of Memory. In MICRO (2014). Google ScholarDigital Library
Sodani, A. Intel®Xeon Phi™ Processor "Knights Landing" Architectural Overview. https://goo.gl/dp1dVm, 2015.Google Scholar
Sodani, A., et al. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (2016), 34--46. Google ScholarDigital Library
Stallings, W., et al. Operating Systems: Internals and Design Principles, vol. 148. Prentice Hall Upper Saddle River, NJ, 1998. Google ScholarDigital Library
Villavieja, C., et al. DiDi: Mitigating The Performance Impact of TLB. Shootdowns Using A Shared TLB Directory. In PACT (2011). Google ScholarDigital Library
Yoon, H., et al. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD (2012). Google ScholarDigital Library
Yu, X., et al. IMP: Indirect Memory Prefetcher. In MICRO (2015). Google ScholarDigital Library
Yu, X., et al. Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation. arXiv.1704.02677 (2017).Google Scholar

Index Terms

Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
    2. Parallel architectures
      1. Multicore architectures

Recommendations

Morphable DRAM Cache Design for Hybrid Memory Systems

DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. When a small amount of fast memory is combined with slow but large memory, the cache-based organization of the fast ...
Read More
Opportunistic compression for direct-mapped DRAM caches
MEMSYS '18: Proceedings of the International Symposium on Memory Systems

Large off-chip DRAM caches offer performance and bandwidth improvements for many systems by bridging the gap between on-chip last level caches and off-chip memories. To avoid the high hit latency resulting from serial DRAM accesses for tags and data, ...
Read More
Phase-change memory: An architectural perspective

This article surveys the current state of phase-change memory (PCM) as a nonvolatile memory technology set to replace flash and DRAM in modern computerized systems. It has been researched and developed in the last decade, with researchers providing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
General Chairs:
Hillery Hunter
IBM Research
,
Jaime Moreno
IBM Research
,
Program Chairs:
Joel Emer
NVIDIA and MIT
,
Daniel Sanchez
MIT
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DRAM cache
TLB coherence
cache replacement
hybrid memory systems
in-package DRAM
main memory
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate484of2,242submissions,22%
Upcoming Conference
MICRO '24

Sponsor:

sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 1,077
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Morphable DRAM Cache Design for Hybrid Memory Systems

Opportunistic compression for direct-mapped DRAM caches

Phase-change memory: An architectural perspective