ABSTRACT
Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics.
In this paper, we present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.
Supplemental Material
- Aggarwal, C. C., and Wang, H. Managing and mining graph data. Springer, 2010. Google ScholarDigital Library
- Anchuri, P., Zaki, M. J., Barkol, O., Golan, S., and Shamy, M. Approximate graph mining with label costs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2013). Google ScholarDigital Library
- Avery, C. Giraph: Large-scale graph processing infrastructure on Hadoop. Proceedings of the Hadoop Summit (2011).Google Scholar
- Bahmani, B., Kumar, R., and Vassilvitskii, S. Densest subgraph in streaming and MapReduce. Proceedings of the VLDB Endowment 5, 5 (2012). Google ScholarDigital Library
- Bhuiyan, M. A., and Hasan, M. An iterative MapReduce based frequent subgraph mining algorithm. IEEE Transactions on Knowledge and Data Engineering 27, 3 (March 2015).Google ScholarCross Ref
- Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30, 1 (1998). Google ScholarDigital Library
- Bringmann, B., and Nijssen, S. What is frequent in a single graph? In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (2008), Springer-Verlag. Google ScholarDigital Library
- Bron, C., and Kerbosch, J. Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16, 9 (1973). Google ScholarDigital Library
- Cheng, J., Zhu, L., Ke, Y., and Chu, S. Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (2012). Google ScholarDigital Library
- Cheng, X., Dale, C., and Liu, J. Dataset for "statistics and social network of youtube videos". http://netsg.cs.sfu.ca/youtubedata/.Google Scholar
- Cordella, L., Foggia, P., Sansone, C., and Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 10 (2004). Google ScholarDigital Library
- Dahlhaus, E., and Karpinski, M. A fast parallel algorithm for computing all maximal cliques in a graph and the related problems. In Proceedings of SWAT, Lecture Notes in Computer Science. Springer, 1988. Google ScholarDigital Library
- Di Fatta, G., and Berthold, M. R. Dynamic load balancing for the distributed mining of molecular structures. IEEE Transactions on Parallel and Distributed Systems 17, 8 (2006). Google ScholarDigital Library
- Elseidy, M., Abdelhamid, E., Skiadopoulos, S., and Kalnis, P. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014). Google ScholarDigital Library
- Eppstein, D., and Strash, D. Listing all maximal cliques in large sparse real-world graphs. In Experimental Algorithms. Springer, 2011. Google ScholarDigital Library
- Garey, M. R., and Johnson, D. S. Computers and intractability, vol. 29. WH Freeman, 2002.Google Scholar
- Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (2012). Google ScholarDigital Library
- H., H. B., Jaffe, A. B., and Trajtenberg, M. The NBER patent citation data file: Lessons, insights and methodological tools. http://www.nber.org/patents/, 2001.Google Scholar
- Hill, S., Srichandan, B., and Sunderraman, R. An iterative MapReduce approach to frequent subgraph mining in biological datasets. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (2012). Google ScholarDigital Library
- Junttila, T., and Kaski, P. Engineering an efficient canonical labeling tool for large and sparse graphs. In Proceedings of the SIAM Workshop on Algorithm Engineering and Experiments (2007). Google ScholarDigital Library
- Kessl, R., Talukder, N., Anchuri, P., and Zaki, M. J. Parallel graph mining with GPUs. In Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (2014).Google Scholar
- Kuramochi, M., and Karypis, G. Finding frequent patterns in a large sparse graph. Data mining and knowledge discovery 11, 3 (2005). Google ScholarDigital Library
- Lin, W., Xiao, X., and Ghinita, G. Large-scale frequent subgraph mining in MapReduce. In Proceedings of the IEEE International Conference on Data Engineering (2014).Google ScholarCross Ref
- Lu, W., Chen, G., Tung, A. K., and Zhao, F. Efficiently extracting frequent subgraphs using MapReduce. In Proceedings of the IEEE International Conference on Big Data (2013).Google ScholarCross Ref
- Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2010). Google ScholarDigital Library
- McCune, R. R., Weninger, T., and Madey, G. Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. arXiv:1507.04405 (2015).Google Scholar
- McSherry, F., Isard, M., and Murray, D. G. Scalability! but at what cost. In Proceedings of the USENIX Workshop on Hot Topics in Operating Systems (2015). Google ScholarDigital Library
- Mejova, Y., Haddadi, H., Noulas, A., and Weber, I. #Foodporn: Obesity patterns in culinary interactions. In Proceedings of the ACM conference on Digital Health 2015 (2015). Google ScholarDigital Library
- Oliveira Aparicio, D., Pinto Ribeiro, P. M., and Da Silva, F. M. A. Parallel subgraph counting for multicore architectures. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications (2014). Google ScholarDigital Library
- Pržulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23, 2 (2007). Google ScholarDigital Library
- Ribeiro, P., and Silva, F. G-Tries: A data structure for storing and finding subgraphs. Data Mining and Knowledge Discovery 28, 2 (2014). Google ScholarDigital Library
- Schmidt, M. C., Samatova, N. F., Thomas, K., and Park, B.-H. A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69, 4 (2009). Google ScholarDigital Library
- Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., and Xu, N. Parallel subgraph listing in a large-scale graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data (2014). Google ScholarDigital Library
- Slota, G. M., and Madduri, K. Complex network analysis using parallel approximate motif counting. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2014). Google ScholarDigital Library
- Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., and Aboulnaga, A. Arabesque: A system for distributed graph mining - Extended version. Technical Report, Qatar Computing Research Institute, 2015.Google Scholar
- Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., and McPherson, J. From "think like a vertex" to "think like a graph". Proceedings of the VLDB Endowment 7, 3 (2013). Google ScholarDigital Library
- Uno, T. MACE: Maximal clique enumerator, ver 2.0. http://research.nii.ac.jp/~uno/code/mace.html.Google Scholar
- Valiant, L. G. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990). Google ScholarDigital Library
- Wang, C., and Parthasarathy, S. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the Annual International Conference on Supercomputing (2004). Google ScholarDigital Library
- Wu, B., Yang, S., Zhao, H., and Wang, B. A distributed algorithm to enumerate all maximal cliques in MapReduce. In Proceedings of the International Conference on Frontier of Computer Science and Technology (2009). Google ScholarDigital Library
- Xiang, J., Guo, C., and Aboulnaga, A. Scalable maximum clique computation using MapReduce. In Proceedings of the IEEE International Coference on Data Engineering (2013). Google ScholarDigital Library
- Yan, D., Cheng, J., Lu, Y., and Ng, W. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014). Google ScholarDigital Library
- Yan, X., and Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the IEEE International Conference on Data Mining (2002). Google ScholarDigital Library
- Zhao, Z., Wang, G., Butt, A. R., Khan, M., Kumar, V. A., and Marathe, M. V. Sahad: Subgraph analysis in massive networks using Hadoop. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (2012). Google ScholarDigital Library
Index Terms
- Arabesque: a system for distributed graph mining
Recommendations
AutoMine: harmonizing high-level abstraction and high performance for graph mining
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesGraph mining algorithms that aim at identifying structural patterns of graphs are typically more complex than graph computation algorithms such as breadth first search. Researchers have implemented several systems with high-level and flexible interfaces ...
Peregrine: a pattern-aware graph mining system
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer SystemsGraph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined ...
Fractal: A General-Purpose Graph Pattern Mining System
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataIn this paper we propose Fractal, a high performance and high productivity system for supporting distributed graph pattern mining (GPM) applications. Fractal employs a dynamic (auto-tuned) load-balancing based on a hierarchical and locality-aware work ...
Comments