ABSTRACT
Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges.
- Stanford large network dataset collection. http://snap.stanford.edu/data/index.html.Google Scholar
- Temporal evolution of the uk web. http://law.di.unimi.it/webdata/uk-2007-05/.Google Scholar
- F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, 2011. Google ScholarDigital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In In Fourth SIAM International Conference on Data Mining (SDM), 2004.Google ScholarCross Ref
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI'06: Seventh Symposium on Operating System Design and Implementation, 2006. Google ScholarDigital Library
- A. Ching and C. Kunz. Giraph: Large-scale graph processing on hadoop. In Hadoop Summit, 2010.Google Scholar
- J. Cohen. Graph Twiddling in a MapReduce World. Computing in Science and Engineering, 11(4):29--41, July 2009. Google ScholarDigital Library
- E. Dahlhaus. Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algorithms, 36(2). Google ScholarDigital Library
- C. Doll, T. Hartmann, and D. Wagner. Fully-dynamic hierarchical graph clustering using cut trees. In WADS, 2011. Google ScholarDigital Library
- H. Gazit. An optimal randomized parallel algorithm for finding connected components in a graph. SIAM J. Comput., 20(6):1046--1067, 1991. Google ScholarDigital Library
- M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the mapreduce framework. In T. Asano, S.-I. Nakano, Y. Okamoto, and O. Watanabe, editors, ISAAC, volume 7074 of Lecture Notes in Computer Science, pages 374--383. Springer, 2011. Google ScholarDigital Library
- H. il Koo and D. H. Kim. Scene text detection via connected component clustering and nontext filtering. IEEE Transactions on Image Processing, 22(6), 2013.Google Scholar
- U. Kang and C. Faloutsos. Big graph mining: algorithms and discoveries. SIGKDD Explorations, 14(2), 2012. Google ScholarDigital Library
- U. Kang, M. McGlohon, L. Akoglu, and C. Faloutsos. Patterns on the connected components of terabyte-scale graphs. In ICDM, 2010. Google ScholarDigital Library
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations. 2009.Google Scholar
- D. R. Karger, N. Nisan, and M. Parnas. Fast connected components algorithms for the erew pram. SIAM J. Comput., 28(3):1021--1034, 1999. Google ScholarDigital Library
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, 2010. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web (WWW) Conference, 2010. Google ScholarDigital Library
- A. Kyrola, G. E. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In OSDI, pages 31--46, 2012. Google ScholarDigital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
- C. F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21, 1995. Google ScholarDigital Library
- S. J. Plimpton and K. D. Devine. MapReduce in MPI for Large-scale Graph Algorithms. Special issue of Parallel Computing, 2011. Google ScholarDigital Library
- S. Rajasekaran. Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst., 16(6). Google ScholarDigital Library
- V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. http://www.cs.duke.edu/ ashwin/pubs/cc-icde13-full.pdf, 2012.Google Scholar
- V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. In ICDE, 2013.Google ScholarDigital Library
- J. Reif. Optimal parallel algorithms for interger sorting and graph connectivity. In Technical report, 1985.Google Scholar
- T. Seidl, B. Boden, and S. Fries. Cc-mr - finding connected components in huge graphs with mapreduce. In ECML/PKDD, 2012.Google ScholarCross Ref
- Y. Shiloach and U. Vishkin. An O(logn) parallel connectivity algorithm. Journal of Algorithms, 3:57--67, 1982.Google ScholarCross Ref
- S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
Index Terms
- Connected Components in MapReduce and Beyond
Recommendations
Connected components of big graphs in fixed MapReduce rounds
In this paper a fast distributed approach based on MapReduce is introduced to find connected components of big graphs. Unlike previous approaches in which the number of rounds is dependent on graph topology (especially graph's diameter), our method is ...
Estimating the number of connected components in sublinear time
We present an algorithm that estimates the number of connected components of a graph over n vertices within an additive error of @en in (sublinear) time O(1@e^2log(1@e)). A connected component C is the maximal subset of nodes of an undirected graph G ...
Small Components of the 5-Subgraph of a Contraction-critically 5-Connected Graph
The subgraph of a 5-connected graph G induced by the set of degree 5 vertices is said to be the 5-subgraph of G. An edge of a 5-connected graph is said to be 5-contractible if the contraction of the edge results in a 5-connected graph. A 5-connected ...
Comments