skip to main content
10.1145/3123939.3124555acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation

Published:14 October 2017Publication History

ABSTRACT

Placing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-package DRAM is as a large cache. Unfortunately, most previous DRAM cache designs optimize mainly for cache hit latency and do not consider bandwidth efficiency as a first-class design constraint. Hence, as we show in this paper, these designs are suboptimal for use with in-package DRAM.

We propose a new DRAM cache design, Banshee, that optimizes for both in-package and off-package DRAM bandwidth efficiency without degrading access latency. Banshee is based on two key ideas. First, it eliminates the tag lookup overhead by tracking the contents of the DRAM cache using TLBs and page table entries, which is efficiently enabled by a new lightweight TLB coherence protocol we introduce. Second, it reduces unnecessary DRAM cache replacement traffic with a new bandwidth-aware frequency-based replacement policy. Our evaluations show that Banshee significantly improves performance (15% on average) and reduces DRAM traffic (35.8% on average) over the best-previous latency-optimized DRAM cache design.

References

  1. Hybrid Memory Cube Specification 2.1. http://www.hybridmemorycube.org, 2014.Google ScholarGoogle Scholar
  2. NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data. https://goo.gl/y6oYqD, 2014.Google ScholarGoogle Scholar
  3. The Road to the AMD "Fiji" GPU. https://goo.gl/ci9BvG, 2015.Google ScholarGoogle Scholar
  4. Data Sheet: Tesla P100. https://goo.gl/Y6gfXZ, 2016.Google ScholarGoogle Scholar
  5. Intel® 64 and IA-32 Architectures Optimization Reference Manual. https://goo.gl/WKkFiw, 2016.Google ScholarGoogle Scholar
  6. NVidia Tesla V100 GPU Accelerator. https://goo.gl/5eqTg5, 2017.Google ScholarGoogle Scholar
  7. Agarwal, N., et al. Page Placement Strategies for CPUs within Heterogeneous Memory Systems. In ASPLOS (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Agarwal, N., et al. Unlocking Bandwidth for GPUs in CC-NUMA Systems. In HPCA (2015).Google ScholarGoogle Scholar
  9. Ahn, J., et al. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In ISCA (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ahn, J., et al. PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture. In ISCA (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bailey, L., and Chris, C. Configuring Huge Pages in Red Hat Enterprise Linux 4 or 5. https://goo.gl/lqB1uf, 2014.Google ScholarGoogle Scholar
  12. Bovet, D. P., and Cesati, M. Understanding the Linux kernel. O'Reilly Media, Inc., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chang, K., et al. Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in DRAM. In HPCA (2016).Google ScholarGoogle Scholar
  14. Chang, K., et al. Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization. In SIGMETRICS (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chang, K., et al. Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms. SIGMETRICS (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chatterjee, N., et al. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In MICRO (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chi, P., et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In ISCA (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chou, C., et al. CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache. In MICRO (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chou, C., et al. BATMAN: Maximizing Bandwidth Utilization of Hybrid Memory Systems. Tech report, ECE, Georgia Institute of Technology, 2015.Google ScholarGoogle Scholar
  20. Chou, C., et al. BEAR: Techniques for Mitigating Bandwidth Bloat in Gigascale DRAM Caches. In ISCA (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chou, C., et al. CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems. In MICRO (2016).Google ScholarGoogle Scholar
  22. Dhiman, G., et al. PDRAM: a Hybrid PRAM and DRAM Main Memory System. In DAC (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Franey, S., and Lipasti, M. Tag Tables. In HPCA (2015).Google ScholarGoogle Scholar
  24. Gulur, N., et al. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In MICRO (2014).Google ScholarGoogle Scholar
  25. Henning, J. L. SPEC CPU2006 Benchmark Descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Huang, C.-C., et al. C3D: Mitigating the NUMA Bottleneck via Coherent DRAM Caches. In MICRO (2016).Google ScholarGoogle Scholar
  27. Huang, C.-C., and Nagarajan, V. ATCache: Reducing DRAM Cache Latency via a Small SRAM Tag Cache. In PACT (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jang, H., et al. Efficient Footprint Caching for Tagless DRAM Caches. In HPCA (2016).Google ScholarGoogle Scholar
  29. JEDEC. JESD235 High Bandwidth Memory (HBM) DRAM, 2013.Google ScholarGoogle Scholar
  30. Jeffers, J., et al. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jevdjic, D., et al. Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache. In ISCA (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jevdjic, D., et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In MICRO (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jiang, X., et al. CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms. In HPCA (2010).Google ScholarGoogle Scholar
  34. Kim, Y., et al. Ramulator: A Fast and Extensible DRAM Simulator. CAL (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kumar, S., and Wilkerson, C. Exploiting Spatial Locality in Data Caches using Spatial Footprints. In ISCA (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lee, D., et al. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies. IEEE transactions on Computers (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lee, D., et al. Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture. In HPCA (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lee, Y., et al. A Fully Associative, Tagless DRAM Cache. In ISCA (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Li, Y., et al. Utility-Based Hybrid Memory Management. In CLUSTER (2017).Google ScholarGoogle Scholar
  40. Liptay, J. Structural Aspects of the System/360 Model 85, II: The cache. IBM Systems Journal (1968). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Loh, G. H., and Hill, M. D. Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches. In MICRO (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Lu, S.-L., et al. Improving DRAM Latency with Dynamic Asymmetric Subarray. In MICRO (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Luo, Y., et al. Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory. In DSN (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Meswani, M., et al. Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories. In HPCA (2015).Google ScholarGoogle Scholar
  45. Meza, J., et al. Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management. CAL (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. O'Connor, M. Highlights of the High-Bandwidth Memory (HBM) Standard.Google ScholarGoogle Scholar
  47. Phadke, S., and Narayanasamy, S. MLP Aware Heterogeneous Memory System. In DATE (2011).Google ScholarGoogle Scholar
  48. Qureshi, M., et al. A Case for MLP-Aware Cache Replacement. ISCA (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Qureshi, M., et al. Adaptive Insertion Policies for High Performance Caching. In ISCA (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Qureshi, M., and Loh, G. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical DRAM-Tags with a Simple and Practical Design. In MICRO (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Robinson, J., and Devarakonda, M. Data cache management using frequency-based replacement. In SIGMETRICS (1990). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Rothman, J., and Smith, A. Sector Cache Design and Performance. In MASCOTS (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Sanchez, D., and Kozyrakis, C. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In ISCA (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Sim, J., et al. Transparent Hardware Management of Stacked DRAM as Part of Memory. In MICRO (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Sodani, A. Intel®Xeon Phi™ Processor "Knights Landing" Architectural Overview. https://goo.gl/dp1dVm, 2015.Google ScholarGoogle Scholar
  56. Sodani, A., et al. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (2016), 34--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Stallings, W., et al. Operating Systems: Internals and Design Principles, vol. 148. Prentice Hall Upper Saddle River, NJ, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Villavieja, C., et al. DiDi: Mitigating The Performance Impact of TLB. Shootdowns Using A Shared TLB Directory. In PACT (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yoon, H., et al. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In ICCD (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yu, X., et al. IMP: Indirect Memory Prefetcher. In MICRO (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yu, X., et al. Banshee: Bandwidth-Efficient DRAM Caching Via Software/Hardware Cooperation. arXiv.1704.02677 (2017).Google ScholarGoogle Scholar

Index Terms

  1. Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
        October 2017
        850 pages
        ISBN:9781450349529
        DOI:10.1145/3123939

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 October 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate484of2,242submissions,22%

        Upcoming Conference

        MICRO '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader