skip to main content
10.1145/2815400.2815410acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Arabesque: a system for distributed graph mining

Published:04 October 2015Publication History

ABSTRACT

Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics.

In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Skip Supplemental Material Section

Supplemental Material

p425.mp4

mp4

1.9 GB

References

  1. Aggarwal, C. C., and Wang, H. Managing and mining graph data. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anchuri, P., Zaki, M. J., Barkol, O., Golan, S., and Shamy, M. Approximate graph mining with label costs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Avery, C. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit (2011).Google ScholarGoogle Scholar
  4. Bahmani, B., Kumar, R., and Vassilvitskii, S. Densest subgraph in streaming and MapReduce. Proceedings of the VLDB Endowment 5, 5 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bhuiyan, M. A., and Hasan, M. An iterative MapReduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering 27, 3 (March 2015).Google ScholarGoogle ScholarCross RefCross Ref
  6. Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1 (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bringmann, B., and Nijssen, S. What is frequent in a single graph? In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (2008), Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bron, C., and Kerbosch, J. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (1973). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cheng, J., Zhu, L., Ke, Y., and Chu, S. Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cheng, X., Dale, C., and Liu, J. Dataset for "statistics and social network of youtube videos". http://netsg.cs.sfu.ca/youtubedata/.Google ScholarGoogle Scholar
  11. Cordella, L., Foggia, P., Sansone, C., and Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dahlhaus, E., and Karpinski, M. A fast parallel algorithm for computing all maximal cliques in a graph and the related problems. In Proceedings of SWAT, Lecture Notes in Computer Science. Springer, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Di Fatta, G., and Berthold, M. R. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems 17, 8 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Elseidy, M., Abdelhamid, E., Skiadopoulos, S., and Kalnis, P. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Eppstein, D., and Strash, D. Listing all maximal cliques in large sparse real-world graphs. In Experimental Algorithms. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Garey, M. R., and Johnson, D. S. Computers and intractability, vol. 29. WH Freeman, 2002.Google ScholarGoogle Scholar
  17. Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H., H. B., Jaffe, A. B., and Trajtenberg, M. The NBER patent citation data file: Lessons, insights and methodological tools. http://www.nber.org/patents/, 2001.Google ScholarGoogle Scholar
  19. Hill, S., Srichandan, B., and Sunderraman, R. An iterative MapReduce approach to frequent subgraph mining in biological datasets. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Junttila, T., and Kaski, P. Engineering an efficient canonical labeling tool for large and sparse graphs. In Proceedings of the SIAM Workshop on Algorithm Engineering and Experiments (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kessl, R., Talukder, N., Anchuri, P., and Zaki, M. J. Parallel graph mining with GPUs. In Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (2014).Google ScholarGoogle Scholar
  22. Kuramochi, M., and Karypis, G. Finding frequent patterns in a large sparse graph. Data mining and knowledge discovery 11, 3 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lin, W., Xiao, X., and Ghinita, G. Large-scale frequent subgraph mining in MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (2014).Google ScholarGoogle ScholarCross RefCross Ref
  24. Lu, W., Chen, G., Tung, A. K., and Zhao, F. Efficiently extracting frequent subgraphs using MapReduce. In Proceedings of the IEEE International Conference on Big Data (2013).Google ScholarGoogle ScholarCross RefCross Ref
  25. Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. McCune, R. R., Weninger, T., and Madey, G. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. arXiv:1507.04405 (2015).Google ScholarGoogle Scholar
  27. McSherry, F., Isard, M., and Murray, D. G. Scalability! but at what cost. In Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mejova, Y., Haddadi, H., Noulas, A., and Weber, I. #Foodporn: Obesity patterns in culinary interactions. In Proceedings of the ACM conference on Digital Health 2015 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Oliveira Aparicio, D., Pinto Ribeiro, P. M., and Da Silva, F. M. A. Parallel subgraph counting for multicore architectures. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ribeiro, P., and Silva, F. G-Tries: A data structure for storing and finding subgraphs. Data Mining and Knowledge Discovery 28, 2 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Schmidt, M. C., Samatova, N. F., Thomas, K., and Park, B.-H. A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69, 4 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., and Xu, N. Parallel subgraph listing in a large-scale graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Slota, G. M., and Madduri, K. Complex network analysis using parallel approximate motif counting. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., and Aboulnaga, A. Arabesque: A system for distributed graph mining - Extended version. Technical Report, Qatar Computing Research Institute, 2015.Google ScholarGoogle Scholar
  36. Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., and McPherson, J. From "think like a vertex" to "think like a graph". Proceedings of the VLDB Endowment 7, 3 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Uno, T. MACE: Maximal clique enumerator, ver 2.0. http://research.nii.ac.jp/~uno/code/mace.html.Google ScholarGoogle Scholar
  38. Valiant, L. G. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wang, C., and Parthasarathy, S. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the Annual International Conference on Supercomputing (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wu, B., Yang, S., Zhao, H., and Wang, B. A distributed algorithm to enumerate all maximal cliques in MapReduce. In Proceedings of the International Conference on Frontier of Computer Science and Technology (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xiang, J., Guo, C., and Aboulnaga, A. Scalable maximum clique computation using MapReduce. In Proceedings of the IEEE International Coference on Data Engineering (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yan, D., Cheng, J., Lu, Y., and Ng, W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yan, X., and Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. A., and Marathe, M. V. Sahad: Subgraph analysis in massive networks using Hadoop. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Arabesque: a system for distributed graph mining

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
                October 2015
                499 pages
                ISBN:9781450338349
                DOI:10.1145/2815400

                Copyright © 2015 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 4 October 2015

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                SOSP '15 Paper Acceptance Rate30of181submissions,17%Overall Acceptance Rate131of716submissions,18%

                Upcoming Conference

                SOSP '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader