ABSTRACT
Graph processing has become important for various applications in today's big data era. However, most graph processing applications suffer from large memory overhead due to random memory accesses. Such random memory access pattern provides little temporal and spatial locality which cannot be accelerated by the conventional hierarchical memory system. In this work, we propose GAS, a heterogeneous memory architecture, to accelerate graph applications implemented in message-based vertex program model, which is widely used in various graph processing systems. GAS utilizes the specialized content-addressable memory (CAM) to store random data, and determine exact access patterns by a series of associative search. Thus, GAS not only removes the inefficiency of random accesses but also reduces the memory access latency by accurate prefetching. We test the efficiency of GAS with three important graph processing kernels on five well-known graphs. Our experimental results show that GAS can significantly reduce cache miss rate and improve the bandwidth utilization as compared to a conventional system with a state-of-the-art graph-specific prefetching mechanism. These enhancements result in 34% and 27% reduction in energy consumption and execution time, respectively.
- Y. Low et al., "Distributed graphlab: a framework for machine learning and data mining in the cloud," Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716--727, 2012. Google ScholarDigital Library
- G. Malewicz et al., "Pregel: a system for large-scale graph processing," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 135--146, ACM, 2010. Google ScholarDigital Library
- S. Beamer et al., "Locality exists in graph processing: Workload characterization on an ivy bridge server," in IEEE IISWC, pp. 56--65, IEEE, 2015. Google ScholarDigital Library
- T. J. Ham et al., "Graphicionado: A high-performance and energy-efficient accelerator for graph analytics," in IEEE/ACM Micro, pp. 1--13, IEEE, 2016. Google ScholarDigital Library
- M. M. Ozdal et al., "Energy efficient architecture for graph analytics accelerators," in ACM/IEEE ISCA, pp. 166--177, IEEE, 2016. Google ScholarDigital Library
- N. Sundaram et al., "Graphmat: High performance graph analytics made productive," Proceedings of the VLDB Endowment, vol. 8, no. 11, pp. 1214--1225, 2015. Google ScholarDigital Library
- M. Imani et al., "Ultra-efficient processing in-memory for data intensive applications," in ACM DAC, p. 6, ACM, 2017. Google ScholarDigital Library
- M. Imani et al., "Mpim: Multi-purpose in-memory processing using configurable resistive memory," in IEEE ASP-DAC, pp. 757--763, IEEE, 2017.Google Scholar
- M. Imani et al., "Nvquery: Efficient query processing in non-volatile memory," IEEE TCAD, 2018.Google Scholar
- M. Imani et al., "Efficient query processing in crossbar memory," in IEEE ISLPED, pp. 1--6, IEEE, 2017.Google Scholar
- J. Sim et al., "Lupis:latch-up based ultra efficient processing in-memory system," in IEEE ISQED, pp. 1--6, IEEE, 2018.Google Scholar
- S. Salehi et al., "Mitigating process variability for non-volatile cache resilience and yield," IEEE Transactions on Emerging Topics in Computing, 2018.Google ScholarCross Ref
- N. Khoshavi et al., "Variation-immune resistive non-volatile memory using self-organized sub-bank circuit designs," in IEEE ISQED, pp. 52--57, IEEE, 2017.Google Scholar
- L. Nai et al., "Graphpim: Enabling instruction-level pim offloading in graph computing frameworks," in IEEE HPCA, pp. 457--468, IEEE, 2017.Google Scholar
- L. Song et al., "Graphr: Accelerating graph processing using reram," arXiv preprint arXiv:1708.06248, 2017.Google Scholar
- S. Ainsworth et al., "Graph prefetching using data structure knowledge," in Proceedings of the 2016 International Conference on Supercomputing, p. 39, ACM, 2016. Google ScholarDigital Library
- X. Yu et al., "Imp: Indirect memory prefetcher," in Proceedings of the 48th International Symposium on Microarchitecture, pp. 178--190, ACM, 2015. Google ScholarDigital Library
- M. S. Riazi et al., "Camsure: Secure content-addressable memory for approximate search," ACM TECS, vol. 16, no. 5s, p. 136, 2017. Google ScholarDigital Library
- M. Sharad et al., "Ultra low power associative computing with spin neurons and resistive crossbar memory," in ACM DAC, p. 107, ACM, 2013. Google ScholarDigital Library
- M. Imani et al., "Exploring hyperdimensional associative memory," in IEEE HPCA, pp. 445--456, IEEE, 2017.Google Scholar
- M. Imani et al., "Masc: Ultra-low energy multiple-access single-charge tcam for approximate computing," in IEEE/ACM DATE, pp. 373--378, EDA Consortium, 2016. Google ScholarDigital Library
- M. Imani et al., "Approximate computing using multiple-access single-charge associative memory," IEEE TETC, 2016.Google Scholar
- T. E. Carlson et al., "An evaluation of high-level mechanistic core models," ACM TACO, vol. 11, no. 3, p. 28, 2014. Google ScholarDigital Library
- C.-K. Luk et al., "Pin: building customized program analysis tools with dynamic instrumentation," in Acm sigplan notices, vol. 40, pp. 190--200, ACM, 2005. Google ScholarDigital Library
- S. Li et al., "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on, pp. 469--480, IEEE, 2009. Google ScholarDigital Library
- W. Heirman et al., "Power-aware multi-core simulation for early design stage hardware/software co-optimization," in IEEE PACT, pp. 3--12, IEEE, 2012. Google ScholarDigital Library
- P. Shivakumar et al., "Cacti 3.0: An integrated cache timing, power, and area model," 2001.Google Scholar
- S. Kvatinsky et al., "Vteam: A general model for voltage-controlled memristors," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786--790, 2015.Google ScholarCross Ref
- X. Dong et al., "Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory," in Emerging Memory Technologies, pp. 15--50, Springer, 2014.Google Scholar
- P. Boldi et al., "Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks," in ACM WWW. Google ScholarDigital Library
- P. Boldi et al., "The WebGraph framework I: Compression techniques," in ACM WWW, (Manhattan, USA), pp. 595--601, ACM Press, 2004. Google ScholarDigital Library
- S. Beamer et al., "The gap benchmark suite," arXiv preprint arXiv:1508.03619, 2015.Google Scholar
- "Graph500 benchmark." http://graph500.org/.Google Scholar
Index Terms
- GAS: A Heterogeneous Memory Architecture for Graph Processing
Recommendations
Optimizing Program Locality Through CMEs and GAs
PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation TechniquesCaches have become increasingly important with the widening gap between main memory and processor speeds. Small and fast cache memories are designed to bridge this discrepancy. However, they are only effective when programs exhibit sufficient data ...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
Banshee: bandwidth-efficient DRAM caching via software/hardware cooperation
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitecturePlacing the DRAM in the same package as a processor enables several times higher memory bandwidth than conventional off-package DRAM. Yet, the latency of in-package DRAM is not appreciably lower than that of off-package DRAM. A promising use of in-...
Comments