research-article

Arabesque: a system for distributed graph mining

Authors:
Carlos H. C. Teixeira

Qatar Computing Research Institute - HBKU, Qatar

Qatar Computing Research Institute - HBKU, Qatar
View Profile

,
Alexandre J. Fonseca

Qatar Computing Research Institute - HBKU, Qatar

Qatar Computing Research Institute - HBKU, Qatar
View Profile

,
Marco Serafini

Qatar Computing Research Institute - HBKU, Qatar

Qatar Computing Research Institute - HBKU, Qatar
View Profile

,
Georgos Siganos

Qatar Computing Research Institute - HBKU, Qatar

Qatar Computing Research Institute - HBKU, Qatar
View Profile

,
Mohammed J. Zaki

Qatar Computing Research Institute - HBKU, Qatar

Qatar Computing Research Institute - HBKU, Qatar
View Profile

,
Ashraf Aboulnaga

Qatar Computing Research Institute - HBKU, Qatar

Qatar Computing Research Institute - HBKU, Qatar
View Profile

SOSP '15: Proceedings of the 25th Symposium on Operating Systems PrinciplesOctober 2015Pages 425–440https://doi.org/10.1145/2815400.2815410

Published:04 October 2015Publication History

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Pages 425–440

ABSTRACT

Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics.

In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Supplemental Material

p425.mp4

mp4

1.9 GB

Download

References

Aggarwal, C. C., and Wang, H. Managing and mining graph data. Springer, 2010. Google ScholarDigital Library
Anchuri, P., Zaki, M. J., Barkol, O., Golan, S., and Shamy, M. Approximate graph mining with label costs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013). Google ScholarDigital Library
Avery, C. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit (2011).Google Scholar
Bahmani, B., Kumar, R., and Vassilvitskii, S. Densest subgraph in streaming and MapReduce. Proceedings of the VLDB Endowment 5, 5 (2012). Google ScholarDigital Library
Bhuiyan, M. A., and Hasan, M. An iterative MapReduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering 27, 3 (March 2015).Google ScholarCross Ref
Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1 (1998). Google ScholarDigital Library
Bringmann, B., and Nijssen, S. What is frequent in a single graph? In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (2008), Springer-Verlag. Google ScholarDigital Library
Bron, C., and Kerbosch, J. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (1973). Google ScholarDigital Library
Cheng, J., Zhu, L., Ke, Y., and Chu, S. Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (2012). Google ScholarDigital Library
Cheng, X., Dale, C., and Liu, J. Dataset for "statistics and social network of youtube videos". http://netsg.cs.sfu.ca/youtubedata/.Google Scholar
Cordella, L., Foggia, P., Sansone, C., and Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (2004). Google ScholarDigital Library
Dahlhaus, E., and Karpinski, M. A fast parallel algorithm for computing all maximal cliques in a graph and the related problems. In Proceedings of SWAT, Lecture Notes in Computer Science. Springer, 1988. Google ScholarDigital Library
Di Fatta, G., and Berthold, M. R. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems 17, 8 (2006). Google ScholarDigital Library
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., and Kalnis, P. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014). Google ScholarDigital Library
Eppstein, D., and Strash, D. Listing all maximal cliques in large sparse real-world graphs. In Experimental Algorithms. Springer, 2011. Google ScholarDigital Library
Garey, M. R., and Johnson, D. S. Computers and intractability, vol. 29. WH Freeman, 2002.Google Scholar
Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012). Google ScholarDigital Library
H., H. B., Jaffe, A. B., and Trajtenberg, M. The NBER patent citation data file: Lessons, insights and methodological tools. http://www.nber.org/patents/, 2001.Google Scholar
Hill, S., Srichandan, B., and Sunderraman, R. An iterative MapReduce approach to frequent subgraph mining in biological datasets. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (2012). Google ScholarDigital Library
Junttila, T., and Kaski, P. Engineering an efficient canonical labeling tool for large and sparse graphs. In Proceedings of the SIAM Workshop on Algorithm Engineering and Experiments (2007). Google ScholarDigital Library
Kessl, R., Talukder, N., Anchuri, P., and Zaki, M. J. Parallel graph mining with GPUs. In Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (2014).Google Scholar
Kuramochi, M., and Karypis, G. Finding frequent patterns in a large sparse graph. Data mining and knowledge discovery 11, 3 (2005). Google ScholarDigital Library
Lin, W., Xiao, X., and Ghinita, G. Large-scale frequent subgraph mining in MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (2014).Google ScholarCross Ref
Lu, W., Chen, G., Tung, A. K., and Zhao, F. Efficiently extracting frequent subgraphs using MapReduce. In Proceedings of the IEEE International Conference on Big Data (2013).Google ScholarCross Ref
Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2010). Google ScholarDigital Library
McCune, R. R., Weninger, T., and Madey, G. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. arXiv:1507.04405 (2015).Google Scholar
McSherry, F., Isard, M., and Murray, D. G. Scalability! but at what cost. In Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (2015). Google ScholarDigital Library
Mejova, Y., Haddadi, H., Noulas, A., and Weber, I. #Foodporn: Obesity patterns in culinary interactions. In Proceedings of the ACM conference on Digital Health 2015 (2015). Google ScholarDigital Library
Oliveira Aparicio, D., Pinto Ribeiro, P. M., and Da Silva, F. M. A. Parallel subgraph counting for multicore architectures. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications (2014). Google ScholarDigital Library
Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007). Google ScholarDigital Library
Ribeiro, P., and Silva, F. G-Tries: A data structure for storing and finding subgraphs. Data Mining and Knowledge Discovery 28, 2 (2014). Google ScholarDigital Library
Schmidt, M. C., Samatova, N. F., Thomas, K., and Park, B.-H. A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69, 4 (2009). Google ScholarDigital Library
Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., and Xu, N. Parallel subgraph listing in a large-scale graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2014). Google ScholarDigital Library
Slota, G. M., and Madduri, K. Complex network analysis using parallel approximate motif counting. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2014). Google ScholarDigital Library
Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., and Aboulnaga, A. Arabesque: A system for distributed graph mining - Extended version. Technical Report, Qatar Computing Research Institute, 2015.Google Scholar
Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., and McPherson, J. From "think like a vertex" to "think like a graph". Proceedings of the VLDB Endowment 7, 3 (2013). Google ScholarDigital Library
Uno, T. MACE: Maximal clique enumerator, ver 2.0. http://research.nii.ac.jp/~uno/code/mace.html.Google Scholar
Valiant, L. G. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990). Google ScholarDigital Library
Wang, C., and Parthasarathy, S. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the Annual International Conference on Supercomputing (2004). Google ScholarDigital Library
Wu, B., Yang, S., Zhao, H., and Wang, B. A distributed algorithm to enumerate all maximal cliques in MapReduce. In Proceedings of the International Conference on Frontier of Computer Science and Technology (2009). Google ScholarDigital Library
Xiang, J., Guo, C., and Aboulnaga, A. Scalable maximum clique computation using MapReduce. In Proceedings of the IEEE International Coference on Data Engineering (2013). Google ScholarDigital Library
Yan, D., Cheng, J., Lu, Y., and Ng, W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014). Google ScholarDigital Library
Yan, X., and Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (2002). Google ScholarDigital Library
Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. A., and Marathe, M. V. Sahad: Subgraph analysis in massive networks using Hadoop. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2012). Google ScholarDigital Library

Index Terms

Arabesque: a system for distributed graph mining

Recommendations

AutoMine: harmonizing high-level abstraction and high performance for graph mining
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

Graph mining algorithms that aim at identifying structural patterns of graphs are typically more complex than graph computation algorithms such as breadth first search. Researchers have implemented several systems with high-level and flexible interfaces ...
Read More
Peregrine: a pattern-aware graph mining system
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems

Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined ...
Read More
Fractal: A General-Purpose Graph Pattern Mining System
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data

In this paper we propose Fractal, a high performance and high productivity system for supporting distributed graph pattern mining (GPM) applications. Fractal employs a dynamic (auto-tuned) load-balancing based on a hierarchical and locality-aware work ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
October 2015
499 pages
ISBN:9781450338349
DOI:10.1145/2815400
General Chair:
Ethan Miller
UC Santa Cruz
,
Program Chair:
Steven Hand
Google
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SOSP '15 Paper Acceptance Rate30of181submissions,17%Overall Acceptance Rate131of716submissions,18%
More
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 167
  Total Citations
  View Citations
- 2,243
  Total Downloads
- Downloads (Last 12 months)109
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Arabesque: a system for distributed graph mining

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

AutoMine: harmonizing high-level abstraction and high performance for graph mining

Peregrine: a pattern-aware graph mining system

Fractal: A General-Purpose Graph Pattern Mining System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Arabesque: a system for distributed graph mining

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

AutoMine: harmonizing high-level abstraction and high performance for graph mining

Peregrine: a pattern-aware graph mining system

Fractal: A General-Purpose Graph Pattern Mining System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media