skip to main content
10.1145/2670979.2670997acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial
Open Access

Connected Components in MapReduce and Beyond

Published:03 November 2014Publication History

ABSTRACT

Computing connected components of a graph lies at the core of many data mining algorithms, and is a fundamental subroutine in graph clustering. This problem is well studied, yet many of the algorithms with good theoretical guarantees perform poorly in practice, especially when faced with graphs with hundreds of billions of edges. In this paper, we design improved algorithms based on traditional MapReduce architecture for large scale data analysis. We also explore the effect of augmenting MapReduce with a distributed hash table (DHT) service. We show that these algorithms have provable theoretical guarantees, and easily outperform previously studied algorithms, sometimes by more than an order of magnitude. In particular, our iterative MapReduce algorithms run 3 to 15 times faster than the best previously studied algorithms, and the MapReduce implementation using a DHT is 10 to 30 times faster than the best previously studied algorithms. These are the fastest algorithms that easily scale to graphs with hundreds of billions of edges.

References

  1. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html.Google ScholarGoogle Scholar
  2. Temporal evolution of the uk web. http://law.di.unimi.it/webdata/uk-2007-05/.Google ScholarGoogle Scholar
  3. F. N. Afrati, V. Borkar, M. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In In Fourth SIAM International Conference on Data Mining (SDM), 2004.Google ScholarGoogle ScholarCross RefCross Ref
  5. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI'06: Seventh Symposium on Operating System Design and Implementation, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ching and C. Kunz. Giraph: Large-scale graph processing on hadoop. In Hadoop Summit, 2010.Google ScholarGoogle Scholar
  7. J. Cohen. Graph Twiddling in a MapReduce World. Computing in Science and Engineering, 11(4):29--41, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Dahlhaus. Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algorithms, 36(2). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Doll, T. Hartmann, and D. Wagner. Fully-dynamic hierarchical graph clustering using cut trees. In WADS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Gazit. An optimal randomized parallel algorithm for finding connected components in a graph. SIAM J. Comput., 20(6):1046--1067, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the mapreduce framework. In T. Asano, S.-I. Nakano, Y. Okamoto, and O. Watanabe, editors, ISAAC, volume 7074 of Lecture Notes in Computer Science, pages 374--383. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. il Koo and D. H. Kim. Scene text detection via connected component clustering and nontext filtering. IEEE Transactions on Image Processing, 22(6), 2013.Google ScholarGoogle Scholar
  13. U. Kang and C. Faloutsos. Big graph mining: algorithms and discoveries. SIGKDD Explorations, 14(2), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. U. Kang, M. McGlohon, L. Akoglu, and C. Faloutsos. Patterns on the connected components of terabyte-scale graphs. In ICDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations. 2009.Google ScholarGoogle Scholar
  16. D. R. Karger, N. Nisan, and M. Parnas. Fast connected components algorithms for the erew pram. SIAM J. Comput., 28(3):1021--1034, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web (WWW) Conference, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Kyrola, G. E. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. In OSDI, pages 31--46, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. J. Plimpton and K. D. Devine. MapReduce in MPI for Large-scale Graph Algorithms. Special issue of Parallel Computing, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Rajasekaran. Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst., 16(6). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. http://www.cs.duke.edu/ ashwin/pubs/cc-icde13-full.pdf, 2012.Google ScholarGoogle Scholar
  26. V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Reif. Optimal parallel algorithms for interger sorting and graph connectivity. In Technical report, 1985.Google ScholarGoogle Scholar
  28. T. Seidl, B. Boden, and S. Fries. Cc-mr - finding connected components in huge graphs with mapreduce. In ECML/PKDD, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  29. Y. Shiloach and U. Vishkin. An O(logn) parallel connectivity algorithm. Journal of Algorithms, 3:57--67, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  30. S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Connected Components in MapReduce and Beyond

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SOCC '14: Proceedings of the ACM Symposium on Cloud Computing
      November 2014
      383 pages
      ISBN:9781450332521
      DOI:10.1145/2670979

      Copyright © 2014 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 November 2014

      Check for updates

      Qualifiers

      • tutorial
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate169of722submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader