ABSTRACT
Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.
In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP.
We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3x.
- Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html.Google Scholar
- Intel Core i7--4790K Processor (8M Cache, up to 4.40 GHz). http://ark.intel.com/products/80807/Intel-Core-i7-4790K-Processor-8M-Cache-up-to-4_40-GHz.Google Scholar
- OpenMP Application Program Interface 4.0. http://openmp.org/wp/openmp-specifications/, 2013.Google Scholar
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Available at tensorflow.org.Google Scholar
- M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow., 5(10):1064--1075, June 2012. Google ScholarDigital Library
- L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In Proceedings of the 4th Annual ACM Web Science Conference, WebSci '12, pages 33--42, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- D. A. Bader, G. Cong, and J. Feo. On the architectural requirements for efficient execution of graph algorithms. In Parallel Processing, 2005. ICPP 2005. International Conference on, pages 547--556, June 2005. Google ScholarDigital Library
- C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng., 27(7):1754--1766, 2015.Google ScholarDigital Library
- S. Beamer, K. Asanović, and D. Patterson. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 12:1--12:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
- S. Beamer, K. Asanović, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.Google Scholar
- S. Beamer, K. Asanović, and D. A. Patterson. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In 2015 IEEE International Symposium on Workload Characterization, IISWC 2015, Atlanta, GA, USA, October 4-6, 2015, pages 56--65, 2015. Google ScholarDigital Library
- G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic parallel algorithms can be fast. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25--29, 2012, pages 181--192, 2012. Google ScholarDigital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, pages 442--446, 2004.Google ScholarCross Ref
- A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at Facebook-scale. Proceedings of the VLDB Endowment, 8(12):1804--1815, 2015. Google ScholarDigital Library
- T. Coffman, S. Greenblatt, and S. Marcus. Graph-based technologies for intelligence analysis. Commun. ACM, 47(3):45--47, Mar. 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In ACM SIGPLAN Notices, volume 34, pages 229--241. ACM, 1999. Google ScholarDigital Library
- W. Ding, M. Kandemir, D. Guttman, A. Jog, C. R. Das, and P. Yedlapalli. Trading cache hit rate for memory performance. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 357--368, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- R. Garg and Y. Sabharwal. Optimizing the HPCC RandomAccess benchmark on Blue Gene/L supercomputer. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, pages 369--370, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- Graph500. Graph 500 benchmark. http://www.graph500.org/specifications.Google Scholar
- H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In Languages and Compilers for Parallel Computing, pages 181--196. Springer, 1998. Google ScholarDigital Library
- H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606--618, July 2006. Google ScholarDigital Library
- JEDEC. DDR3 SDRAM Standard. http://www.jedec.org/standards-documents/docs/jesd-79--3d.Google Scholar
- W. Jung, J. Park, and J. Lee. Versatile and scalable parallel histogram construction. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 127--138, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proc. VLDB Endow., 2(2):1378--1389, Aug. 2009. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. CGO '04, San Jose, CA, USA, Mar 2004. Google ScholarDigital Library
- F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Åberg. The web of human sexual contacts. Nature, 411(6840):907--908, 2001.Google ScholarCross Ref
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012. Google ScholarDigital Library
- S. Manegold, P. Boncz, and M. Kersten. Optimizing main-memory join on modern hardware. IEEE Trans. on Knowl. and Data Eng., 14(4):709--730, July 2002. Google ScholarDigital Library
- U. Meyer and P. Sanders. Δ-stepping: A parallelizable shortest path algorithm. J. Algorithms, 49(1):114--152, Oct. 2003. Google ScholarDigital Library
- N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Parallel Architectures and Compilation Techniques, 1999. Proceedings. 1999 International Conference on, pages 192--202. IEEE, 1999. Google ScholarDigital Library
- MLC. Intel Memory Latency Checker v3.0. https://software.intel.com/en-us/articles/intelr-memory-latency-checker.Google Scholar
- J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-tolerant software distributed shared memory. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 291--305, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarDigital Library
- M. E. J. Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38(2):321--330, 2004.Google Scholar
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.Google Scholar
- I. Papazian, S. Kottapalli, J. Baxter, J. Chamberlain, G. Vedaraman, and B. Morris. Ivy Bridge server: A converged design. Micro, IEEE, 35(2):16--25, Mar 2015.Google ScholarCross Ref
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- J. Saltz, K. Crowley, R. Michandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303--312, 1990. Google ScholarDigital Library
- Samsung. DDR3L SDRAM 240pin Registered DIMM M393B2G70BH0 datasheet:. http://www.samsung.com/global/business/semiconductor/file/product/ds_ddr3_4gb_b-die_based_1_35v_rdimm_rev16-5.pdf, 2012.Google Scholar
- Samsung. DDR4 SDRAM 288pin Registered DIMM M393A2G40DB1 datasheet. http://www.samsung.com/semiconductor/global/file/product/DS_8GB_DDR4_4Gb_D_die_RegisteredDIMM_Rev15.pdf, 2015.Google Scholar
- N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 979--990, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic operations on modern architectures. San Francisco, CA, 2015. Parallel Architectures and Compilation Techniques (PACT'15). Google ScholarDigital Library
- A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 510--521, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. Journal of Algorithms, 3(1), 1982.Google ScholarCross Ref
- J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 135--146, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- J. Su and K. Yelick. Automatic support for irregular computations in a high-level language. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01, IPDPS '05, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE '13, pages 362--373, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
- J. Wassenberg and P. Sanders. Engineering a multi-core radix sort. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II, Euro-Par'11, pages 160--169, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
Index Terms
- Optimizing Indirect Memory References with milk
Recommendations
Optimizing irregular shared-memory applications for distributed-memory systems
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programmingIn prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this ...
Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory
MASCOTS '11: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication SystemsIn order to reduce the energy dissipation in main memory of computer systems, phase change memory (PCM) has emerged as one of the most promising technologies to incorporate into the memory hierarchy. However, PCM has two critical weaknesses to ...
Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory
GRADES'15: Proceedings of the GRADES'15Processing large graphs is becoming increasingly important for many domains. Large-scale graph processing requires a large-scale cluster system, which is very expensive. Thus, for high-performance large-scale graph processing in small clusters, we have ...
Comments