skip to main content
10.1145/2967938.2967948acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Open Access

Optimizing Indirect Memory References with milk

Published:11 September 2016Publication History

ABSTRACT

Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.

In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP.

We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3x.

References

  1. Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html.Google ScholarGoogle Scholar
  2. Intel Core i7--4790K Processor (8M Cache, up to 4.40 GHz). http://ark.intel.com/products/80807/Intel-Core-i7-4790K-Processor-8M-Cache-up-to-4_40-GHz.Google ScholarGoogle Scholar
  3. OpenMP Application Program Interface 4.0. http://openmp.org/wp/openmp-specifications/, 2013.Google ScholarGoogle Scholar
  4. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Available at tensorflow.org.Google ScholarGoogle Scholar
  5. M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow., 5(10):1064--1075, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In Proceedings of the 4th Annual ACM Web Science Conference, WebSci '12, pages 33--42, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. A. Bader, G. Cong, and J. Feo. On the architectural requirements for efficient execution of graph algorithms. In Parallel Processing, 2005. ICPP 2005. International Conference on, pages 547--556, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng., 27(7):1754--1766, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Beamer, K. Asanović, and D. Patterson. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 12:1--12:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Beamer, K. Asanović, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.Google ScholarGoogle Scholar
  11. S. Beamer, K. Asanović, and D. A. Patterson. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In 2015 IEEE International Symposium on Workload Characterization, IISWC 2015, Atlanta, GA, USA, October 4-6, 2015, pages 56--65, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic parallel algorithms can be fast. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25--29, 2012, pages 181--192, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, pages 442--446, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at Facebook-scale. Proceedings of the VLDB Endowment, 8(12):1804--1815, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Coffman, S. Greenblatt, and S. Marcus. Graph-based technologies for intelligence analysis. Commun. ACM, 47(3):45--47, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In ACM SIGPLAN Notices, volume 34, pages 229--241. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Ding, M. Kandemir, D. Guttman, A. Jog, C. R. Das, and P. Yedlapalli. Trading cache hit rate for memory performance. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 357--368, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Garg and Y. Sabharwal. Optimizing the HPCC RandomAccess benchmark on Blue Gene/L supercomputer. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, pages 369--370, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Graph500. Graph 500 benchmark. http://www.graph500.org/specifications.Google ScholarGoogle Scholar
  22. H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In Languages and Compilers for Parallel Computing, pages 181--196. Springer, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606--618, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. JEDEC. DDR3 SDRAM Standard. http://www.jedec.org/standards-documents/docs/jesd-79--3d.Google ScholarGoogle Scholar
  25. W. Jung, J. Park, and J. Lee. Versatile and scalable parallel histogram construction. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 127--138, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proc. VLDB Endow., 2(2):1378--1389, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. CGO '04, San Jose, CA, USA, Mar 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Åberg. The web of human sexual contacts. Nature, 411(6840):907--908, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  29. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Manegold, P. Boncz, and M. Kersten. Optimizing main-memory join on modern hardware. IEEE Trans. on Knowl. and Data Eng., 14(4):709--730, July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. U. Meyer and P. Sanders. Δ-stepping: A parallelizable shortest path algorithm. J. Algorithms, 49(1):114--152, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Parallel Architectures and Compilation Techniques, 1999. Proceedings. 1999 International Conference on, pages 192--202. IEEE, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. MLC. Intel Memory Latency Checker v3.0. https://software.intel.com/en-us/articles/intelr-memory-latency-checker.Google ScholarGoogle Scholar
  34. J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-tolerant software distributed shared memory. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 291--305, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. E. J. Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38(2):321--330, 2004.Google ScholarGoogle Scholar
  36. D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.Google ScholarGoogle Scholar
  38. I. Papazian, S. Kottapalli, J. Baxter, J. Chamberlain, G. Vedaraman, and B. Morris. Ivy Bridge server: A converged design. Micro, IEEE, 35(2):16--25, Mar 2015.Google ScholarGoogle ScholarCross RefCross Ref
  39. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Saltz, K. Crowley, R. Michandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303--312, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Samsung. DDR3L SDRAM 240pin Registered DIMM M393B2G70BH0 datasheet:. http://www.samsung.com/global/business/semiconductor/file/product/ds_ddr3_4gb_b-die_based_1_35v_rdimm_rev16-5.pdf, 2012.Google ScholarGoogle Scholar
  42. Samsung. DDR4 SDRAM 288pin Registered DIMM M393A2G40DB1 datasheet. http://www.samsung.com/semiconductor/global/file/product/DS_8GB_DDR4_4Gb_D_die_RegisteredDIMM_Rev15.pdf, 2015.Google ScholarGoogle Scholar
  43. N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 979--990, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic operations on modern architectures. San Francisco, CA, 2015. Parallel Architectures and Compilation Techniques (PACT'15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 510--521, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. Journal of Algorithms, 3(1), 1982.Google ScholarGoogle ScholarCross RefCross Ref
  47. J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 135--146, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Su and K. Yelick. Automatic support for irregular computations in a high-level language. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01, IPDPS '05, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE '13, pages 362--373, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Wassenberg and P. Sanders. Engineering a multi-core radix sort. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II, Euro-Par'11, pages 160--169, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing Indirect Memory References with milk

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
            September 2016
            474 pages
            ISBN:9781450341219
            DOI:10.1145/2967938

            Copyright © 2016 Owner/Author

            Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 September 2016

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%

            Upcoming Conference

            PACT '24
            International Conference on Parallel Architectures and Compilation Techniques
            October 14 - 16, 2024
            Southern California , CA , USA

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader