Optimizing Indirect Memory References with milk

Authors:
Vladimir Kiriansky

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Yunming Zhang

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Saman Amarasinghe

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationSeptember 2016Pages 299–312https://doi.org/10.1145/2967938.2967948

Published:11 September 2016Publication History

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 299–312

ABSTRACT

Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.

In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP.

We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3x.

References

Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html.Google Scholar
Intel Core i7--4790K Processor (8M Cache, up to 4.40 GHz). http://ark.intel.com/products/80807/Intel-Core-i7-4790K-Processor-8M-Cache-up-to-4_40-GHz.Google Scholar
OpenMP Application Program Interface 4.0. http://openmp.org/wp/openmp-specifications/, 2013.Google Scholar
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Available at tensorflow.org.Google Scholar
M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow., 5(10):1064--1075, June 2012. Google ScholarDigital Library
L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In Proceedings of the 4th Annual ACM Web Science Conference, WebSci '12, pages 33--42, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
D. A. Bader, G. Cong, and J. Feo. On the architectural requirements for efficient execution of graph algorithms. In Parallel Processing, 2005. ICPP 2005. International Conference on, pages 547--556, June 2005. Google ScholarDigital Library
C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng., 27(7):1754--1766, 2015.Google ScholarDigital Library
S. Beamer, K. Asanović, and D. Patterson. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 12:1--12:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
S. Beamer, K. Asanović, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.Google Scholar
S. Beamer, K. Asanović, and D. A. Patterson. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In 2015 IEEE International Symposium on Workload Characterization, IISWC 2015, Atlanta, GA, USA, October 4-6, 2015, pages 56--65, 2015. Google ScholarDigital Library
G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic parallel algorithms can be fast. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25--29, 2012, pages 181--192, 2012. Google ScholarDigital Library
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, pages 442--446, 2004.Google ScholarCross Ref
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at Facebook-scale. Proceedings of the VLDB Endowment, 8(12):1804--1815, 2015. Google ScholarDigital Library
T. Coffman, S. Greenblatt, and S. Marcus. Graph-based technologies for intelligence analysis. Commun. ACM, 47(3):45--47, Mar. 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In ACM SIGPLAN Notices, volume 34, pages 229--241. ACM, 1999. Google ScholarDigital Library
W. Ding, M. Kandemir, D. Guttman, A. Jog, C. R. Das, and P. Yedlapalli. Trading cache hit rate for memory performance. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 357--368, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
R. Garg and Y. Sabharwal. Optimizing the HPCC RandomAccess benchmark on Blue Gene/L supercomputer. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, pages 369--370, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Graph500. Graph 500 benchmark. http://www.graph500.org/specifications.Google Scholar
H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In Languages and Compilers for Parallel Computing, pages 181--196. Springer, 1998. Google ScholarDigital Library
H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606--618, July 2006. Google ScholarDigital Library
JEDEC. DDR3 SDRAM Standard. http://www.jedec.org/standards-documents/docs/jesd-79--3d.Google Scholar
W. Jung, J. Park, and J. Lee. Versatile and scalable parallel histogram construction. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 127--138, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proc. VLDB Endow., 2(2):1378--1389, Aug. 2009. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. CGO '04, San Jose, CA, USA, Mar 2004. Google ScholarDigital Library
F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Åberg. The web of human sexual contacts. Nature, 411(6840):907--908, 2001.Google ScholarCross Ref
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012. Google ScholarDigital Library
S. Manegold, P. Boncz, and M. Kersten. Optimizing main-memory join on modern hardware. IEEE Trans. on Knowl. and Data Eng., 14(4):709--730, July 2002. Google ScholarDigital Library
U. Meyer and P. Sanders. Δ-stepping: A parallelizable shortest path algorithm. J. Algorithms, 49(1):114--152, Oct. 2003. Google ScholarDigital Library
N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Parallel Architectures and Compilation Techniques, 1999. Proceedings. 1999 International Conference on, pages 192--202. IEEE, 1999. Google ScholarDigital Library
MLC. Intel Memory Latency Checker v3.0. https://software.intel.com/en-us/articles/intelr-memory-latency-checker.Google Scholar
J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-tolerant software distributed shared memory. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 291--305, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarDigital Library
M. E. J. Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38(2):321--330, 2004.Google Scholar
D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.Google Scholar
I. Papazian, S. Kottapalli, J. Baxter, J. Chamberlain, G. Vedaraman, and B. Morris. Ivy Bridge server: A converged design. Micro, IEEE, 35(2):16--25, Mar 2015.Google ScholarCross Ref
K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
J. Saltz, K. Crowley, R. Michandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303--312, 1990. Google ScholarDigital Library
Samsung. DDR3L SDRAM 240pin Registered DIMM M393B2G70BH0 datasheet:. http://www.samsung.com/global/business/semiconductor/file/product/ds_ddr3_4gb_b-die_based_1_35v_rdimm_rev16-5.pdf, 2012.Google Scholar
Samsung. DDR4 SDRAM 288pin Registered DIMM M393A2G40DB1 datasheet. http://www.samsung.com/semiconductor/global/file/product/DS_8GB_DDR4_4Gb_D_die_RegisteredDIMM_Rev15.pdf, 2015.Google Scholar
N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 979--990, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic operations on modern architectures. San Francisco, CA, 2015. Parallel Architectures and Compilation Techniques (PACT'15). Google ScholarDigital Library
A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 510--521, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. Journal of Algorithms, 3(1), 1982.Google ScholarCross Ref
J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 135--146, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
J. Su and K. Yelick. Automatic support for irregular computations in a high-level language. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01, IPDPS '05, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE '13, pages 362--373, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
J. Wassenberg and P. Sanders. Engineering a multi-core radix sort. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II, Euro-Par'11, pages 160--169, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library

Index Terms

Optimizing Indirect Memory References with milk

Recommendations

Optimizing irregular shared-memory applications for distributed-memory systems
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this ...
Read More
Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory
MASCOTS '11: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems

In order to reduce the energy dissipation in main memory of computer systems, phase change memory (PCM) has emerged as one of the most promising technologies to incorporate into the memory hierarchy. However, PCM has two critical weaknesses to ...
Read More
Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory
GRADES'15: Proceedings of the GRADES'15

Processing large graphs is becoming increasingly important for many domains. Large-scale graph processing requires a large-scale cluster system, which is very expensive. Thus, for high-performance large-scale graph processing in small clusters, we have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA
Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2016
Check for updates
Author Tags
bsp
graph algorithms
indirect memory reference
irregular memory access
milk
openmp
parallel algorithms
Qualifiers
- research-article
Conference

Acceptance Rates
PACT '16 Paper Acceptance Rate31of119submissions,26%Overall Acceptance Rate121of471submissions,26%
More
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 4,282
  Total Downloads
- Downloads (Last 12 months)93
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing Indirect Memory References with milk

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing irregular shared-memory applications for distributed-memory systems

Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory

Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory