Abstract
Data compression plays a pivotal role in improving system performance and reducing energy consumption, because it increases the logical effective capacity of a compressed memory system without physically increasing the memory size. However, data compression techniques incur some cost, such as non-negligible compression and decompression overhead. This overhead becomes more severe if compression is used in the cache. In this article, we aim to minimize the read-hit decompression penalty in compressed Last-Level Caches (LLCs) by speculatively decompressing frequently used cachelines. To this end, we propose a Hot-cacheline Prediction and Early decompression (HoPE) mechanism that consists of three synergistic techniques: Hot-cacheline Prediction (HP), Early Decompression (ED), and Hit-history-based Insertion (HBI). HP and HBI efficiently identify the hot compressed cachelines, while ED selectively decompresses hot cachelines, based on their size information. Unlike previous approaches, the HoPE framework considers the performance balance/tradeoff between the increased effective cache capacity and the decompression penalty. To evaluate the effectiveness of the proposed HoPE mechanism, we run extensive simulations on memory traces obtained from multi-threaded benchmarks running on a full-system simulation framework. We observe significant performance improvements over compressed cache schemes employing the conventional Least-Recently Used (LRU) replacement policy, the Dynamic Re-Reference Interval Prediction (DRRIP) scheme, and the Effective Capacity Maximizer (ECM) compressed cache management mechanism. Specifically, HoPE exhibits system performance improvements of approximately 11%, on average, over LRU, 8% over DRRIP, and 7% over ECM by reducing the read-hit decompression penalty by around 65%, over a wide range of applications.
- Bulent Abali, Hubertus Franke, Xiaowei Shen, Dan E. Poff, and T. Basil Smith. 2001. Performance of hardware compressed main memory. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA’01). 73--81. Google ScholarCross Ref
- Ali-Reza Adl-Tabatabai, Anwar M. Ghuloum, and Shobhit O. Kanaujia. 2007. Compression in cache design. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS’07). 190--201. Google ScholarDigital Library
- Alaa R. Alameldeen and David A. Wood. 2004a. Adaptive cache compression for high-performance processors. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). 12--223. Google ScholarCross Ref
- Alaa R. Alameldeen and David A. Wood. 2004b. Frequent pattern compression: A significance-based compression scheme for L2 caches. In Technical Report 1500. Computer Sciences Department, University of Wisconsin—Madison.Google Scholar
- Apple. 2015. Apple OS X yosemite, advanced technologies. Retrieved June 2015 from http://www.apple.com/osx/advanced-technologies/.Google Scholar
- Angelos Arelakis and Per Stenstrom. 2014. SC2: A statistical compression cache scheme. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA’14). 145--156. Google ScholarDigital Library
- Seungcheol Baek, Hyung Gyu Lee, Chrysostomos Nicopoulos, and Jongman Kim. 2014. Designing hybrid DRAM/PCM main memory systems utilizing dual-phase compression. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 11 (Nov. 2014). Google ScholarDigital Library
- Seungcheol Baek, Hyung Gyu Lee, Chrysostomos Nicopoulos, Junghee Lee, and Jongman Kim. 2015. Size-aware cache management for compressed cache architectures. In IEEE Trans. Comput. 64. 2337--2352. Google ScholarCross Ref
- Christian Bienia and Kai Li. 2009. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation.Google Scholar
- Xi Chen, Lei Yang, Robert P. Dick, Li Shang, and Haris Lekatsas. 2010. C-pack: A high-performance microprocessor cache compression algorithm. IEEE Trans. VLSI 18, 8 (Aug. 2010), 1196--1208. Google ScholarDigital Library
- Krupal Chikhale and Urmila Shrawankar. 2014. Hybrid multi-level cache management policy. In Proceedings of the 4th International Conference on Communication Systems and Network Technologies (CSNT’14). 1119--1123. Google ScholarDigital Library
- Ju Hee Choi, Jong Wook Kwak, Seong Tae Jhang, and Chu Shik Jhon. 2014. Adaptive cache compression for non-volatile memories in embedded system. In Proceedings of the 2014 Conference on Research in Adaptive and Convergent Systems (RACS’14). 52--57. Google ScholarDigital Library
- P. Franaszek, J. Robinson, and J. Thomas. 1996. Parallel compression with cooperative dictionary construction. In Proceedings of the Conference on Data Compression (DCC’96). 200--209. Google ScholarCross Ref
- E. G. Hallnor and S. K. Reinhardt. 2004. A compressed memory hierarchy using an indirect index cache. In Proceedings of the 3rd Workshop on Memory Performance Issues: In conjunction with the 31st International Symposium on Computer Architecture (WMPI’04). 9--15. Google ScholarDigital Library
- E. G. Hallnor and S. K. Reinhardt. 2005. A unified compressed memory hierarchy. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA’05). 201--212. Google ScholarDigital Library
- Hewlett-Packard. CACTI-6.5. Retrieved from http://www.hpl.hp.com/research/cacti/.Google Scholar
- Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 208--219. Google ScholarDigital Library
- Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 60--71. Google ScholarDigital Library
- Soontae Kim, Jongmin Lee, Jesung Kim, and Seokin Hong. 2011. Residue cache: A low-energy low-area L2 cache architecture via compression and partial hits. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). 420--429. Google ScholarDigital Library
- Jang-Soo Lee, Won-Kee Hong, and Shin-Dug Kim. 2000. An on-chip cache compression technique to reduce decompression overhead and design complexity. J. Syst. Arch. 46, 15 (Dec. 2000), pp. 1365--1382. Google ScholarDigital Library
- Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. 2008. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’08). 222--233. Google ScholarDigital Library
- Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hållberg, Johan Högberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. IEEE Comput. 35, 2 (Oct. 2002), 50--58. Google ScholarDigital Library
- Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Arch. News 33, 4 (2005), 92--99. Google ScholarDigital Library
- Micron. 2013. Datasheet of DDR3 SDRAM UDIMM, MT8JTF12864AZ, MT8JTF25664AZ, MT8JFT51264AZ.Google Scholar
- Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip B Gibbons, Michael Kozuch, Todd C Mowry, and others. 2015. Exploiting compressed block size as an indicator of future reuse. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 51--63. Google ScholarCross Ref
- Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 377--388. Google ScholarDigital Library
- Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). 381--391. Google ScholarDigital Library
- Somayeh Sardashti and David A. Wood. 2013. Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 2--73. Google ScholarDigital Library
- Luis Villa, Michael Zhang, and Krste Asanović. 2000. Dynamic zero compression for cache energy reduction. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00). 214--220. Google ScholarDigital Library
- Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11). 430--441. Google ScholarDigital Library
- Yuejian Xie and G. H. Loh. 2011. Thread-aware dynamic shared cache compression in multi-core processors. In Proceedings of the 29th IEEE International Conference on Computer Design (ICCD’11). 135--141. Google ScholarDigital Library
- Jun Yang, Youtao Zhang, and Rajiv Gupta. 2000. Frequent value compression in data caches. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00). 258--265. Google ScholarDigital Library
Index Terms
- HoPE: Hot-Cacheline Prediction for Dynamic Early Decompression in Compressed LLCs
Recommendations
Size-Aware Cache Management for Compressed Cache Architectures
A practical way to increase the effective capacity of a microprocessor's cache, without physically increasing the cache size, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression schemes, since the primary ...
Zero-content augmented caches
ICS '09: Proceedings of the 23rd international conference on SupercomputingIt has been observed that some applications manipulate large amounts of null data. Moreover these zero data often exhibit high spatial locality. On some applications more than 20% of the data accesses concern null data blocks. Representing a null block ...
Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor
Heterogeneous multicore processors that integrate CPU cores and data-parallel accelerators such as graphic processing unit (GPU) cores onto the same die raise several new issues for sharing various on-chip resources. The shared last-level cache (LLC) is ...
Comments