ABSTRACT
Placing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs optimize mainly for cache hit latency and do not consider bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM.
We propose a new DRAM cache design, Banshee, that optimizes for both in-package and off-package DRAM bandwidth efficiency without degrading access latency. Banshee is based on two key ideas. First, it eliminates the tag lookup overhead by tracking the contents of the DRAM cache using TLBs and page table entries, which is efficiently enabled by a new lightweight TLB coherence protocol we introduce. Second, it reduces unnecessary DRAM cache replacement traffic with a new bandwidth-aware frequency-based replacement policy. Our evaluations show that Banshee significantly improves performance (15% on average) and reduces DRAM traffic (35.8% on average) over the best-previous latency-optimized DRAM cache design.
- Hybrid Memory Cube Specification 2.1. http://www.hybridmemorycube.org, 2014.Google Scholar
- NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data. https://goo.gl/y6oYqD, 2014.Google Scholar
- The Road to the AMD "Fiji" GPU. https://goo.gl/ci9BvG, 2015.Google Scholar
- Data Sheet: Tesla P100. https://goo.gl/Y6gfXZ, 2016.Google Scholar
- Intel® 64 and IA-32 Architectures Optimization Reference Manual. https://goo.gl/WKkFiw, 2016.Google Scholar
- NVidia Tesla V100 GPU Accelerator. https://goo.gl/5eqTg5, 2017.Google Scholar
- Agarwal, N., et al. Page Placement Strategies for CPUs within Heterogeneous Memory Systems. In ASPLOS (2015). Google ScholarDigital Library
- Agarwal, N., et al. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In HPCA (2015).Google Scholar
- Ahn, J., et al. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In ISCA (2015). Google ScholarDigital Library
- Ahn, J., et al. PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture. In ISCA (2015). Google ScholarDigital Library
- Bailey, L., and Chris, C. Configuring Huge Pages in Red Hat Enterprise Linux 4 or 5. https://goo.gl/lqB1uf, 2014.Google Scholar
- Bovet, D. P., and Cesati, M. Understanding the Linux kernel. O'Reilly Media, Inc., 2005. Google ScholarDigital Library
- Chang, K., et al. Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM. In HPCA (2016).Google Scholar
- Chang, K., et al. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. In SIGMETRICS (2016). Google ScholarDigital Library
- Chang, K., et al. Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms. SIGMETRICS (2017). Google ScholarDigital Library
- Chatterjee, N., et al. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In MICRO (2012). Google ScholarDigital Library
- Chi, P., et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In ISCA (2016). Google ScholarDigital Library
- Chou, C., et al. CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. In MICRO (2014). Google ScholarDigital Library
- Chou, C., et al. BATMAN: Maximizing Bandwidth Utilization of Hybrid Memory Systems. Tech report, ECE, Georgia Institute of Technology, 2015.Google Scholar
- Chou, C., et al. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In ISCA (2015). Google ScholarDigital Library
- Chou, C., et al. CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems. In MICRO (2016).Google Scholar
- Dhiman, G., et al. PDRAM: a Hybrid PRAM and DRAM Main Memory System. In DAC (2009). Google ScholarDigital Library
- Franey, S., and Lipasti, M. Tag Tables. In HPCA (2015).Google Scholar
- Gulur, N., et al. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In MICRO (2014).Google Scholar
- Henning, J. L. SPEC CPU2006 Benchmark Descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006). Google ScholarDigital Library
- Huang, C.-C., et al. C3D: Mitigating the NUMA Bottleneck via Coherent DRAM Caches. In MICRO (2016).Google Scholar
- Huang, C.-C., and Nagarajan, V. ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache. In PACT (2014). Google ScholarDigital Library
- Jang, H., et al. Efficient Footprint Caching for Tagless DRAM Caches. In HPCA (2016).Google Scholar
- JEDEC. JESD235 High Bandwidth Memory (HBM) DRAM, 2013.Google Scholar
- Jeffers, J., et al. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016. Google ScholarDigital Library
- Jevdjic, D., et al. Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In ISCA (2013). Google ScholarDigital Library
- Jevdjic, D., et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In MICRO (2014). Google ScholarDigital Library
- Jiang, X., et al. CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms. In HPCA (2010).Google Scholar
- Kim, Y., et al. Ramulator: A Fast and Extensible DRAM Simulator. CAL (2016). Google ScholarDigital Library
- Kumar, S., and Wilkerson, C. Exploiting Spatial Locality in Data Caches using Spatial Footprints. In ISCA (1998). Google ScholarDigital Library
- Lee, D., et al. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies. IEEE transactions on Computers (2001). Google ScholarDigital Library
- Lee, D., et al. Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture. In HPCA (2013). Google ScholarDigital Library
- Lee, Y., et al. A Fully Associative, Tagless DRAM Cache. In ISCA (2015). Google ScholarDigital Library
- Li, Y., et al. Utility-Based Hybrid Memory Management. In CLUSTER (2017).Google Scholar
- Liptay, J. Structural Aspects of the System/360 Model 85, II: The cache. IBM Systems Journal (1968). Google ScholarDigital Library
- Loh, G. H., and Hill, M. D. Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches. In MICRO (2011). Google ScholarDigital Library
- Lu, S.-L., et al. Improving DRAM Latency with Dynamic Asymmetric Subarray. In MICRO (2015). Google ScholarDigital Library
- Luo, Y., et al. Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory. In DSN (2014). Google ScholarDigital Library
- Meswani, M., et al. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories. In HPCA (2015).Google Scholar
- Meza, J., et al. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. CAL (2012). Google ScholarDigital Library
- O'Connor, M. Highlights of the High-Bandwidth Memory (HBM) Standard.Google Scholar
- Phadke, S., and Narayanasamy, S. MLP Aware Heterogeneous Memory System. In DATE (2011).Google Scholar
- Qureshi, M., et al. A Case for MLP-Aware Cache Replacement. ISCA (2006). Google ScholarDigital Library
- Qureshi, M., et al. Adaptive Insertion Policies for High Performance Caching. In ISCA (2007). Google ScholarDigital Library
- Qureshi, M., and Loh, G. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical DRAM-Tags with a Simple and Practical Design. In MICRO (2012). Google ScholarDigital Library
- Robinson, J., and Devarakonda, M. Data cache management using frequency-based replacement. In SIGMETRICS (1990). Google ScholarDigital Library
- Rothman, J., and Smith, A. Sector Cache Design and Performance. In MASCOTS (2000). Google ScholarDigital Library
- Sanchez, D., and Kozyrakis, C. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In ISCA (2013). Google ScholarDigital Library
- Sim, J., et al. Transparent Hardware Management of Stacked DRAM as Part of Memory. In MICRO (2014). Google ScholarDigital Library
- Sodani, A. Intel®Xeon Phi™ Processor "Knights Landing" Architectural Overview. https://goo.gl/dp1dVm, 2015.Google Scholar
- Sodani, A., et al. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (2016), 34--46. Google ScholarDigital Library
- Stallings, W., et al. Operating Systems: Internals and Design Principles, vol. 148. Prentice Hall Upper Saddle River, NJ, 1998. Google ScholarDigital Library
- Villavieja, C., et al. DiDi: Mitigating The Performance Impact of TLB. Shootdowns Using A Shared TLB Directory. In PACT (2011). Google ScholarDigital Library
- Yoon, H., et al. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD (2012). Google ScholarDigital Library
- Yu, X., et al. IMP: Indirect Memory Prefetcher. In MICRO (2015). Google ScholarDigital Library
- Yu, X., et al. Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation. arXiv.1704.02677 (2017).Google Scholar
Index Terms
- Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation
Recommendations
Morphable DRAM Cache Design for Hybrid Memory Systems
DRAM caches have emerged as an efficient new layer in the memory hierarchy to address the increasing diversity of memory components. When a small amount of fast memory is combined with slow but large memory, the cache-based organization of the fast ...
Opportunistic compression for direct-mapped DRAM caches
MEMSYS '18: Proceedings of the International Symposium on Memory SystemsLarge off-chip DRAM caches offer performance and bandwidth improvements for many systems by bridging the gap between on-chip last level caches and off-chip memories. To avoid the high hit latency resulting from serial DRAM accesses for tags and data, ...
Phase-change memory: An architectural perspective
This article surveys the current state of phase-change memory (PCM) as a nonvolatile memory technology set to replace flash and DRAM in modern computerized systems. It has been researched and developed in the last decade, with researchers providing ...
Comments