skip to main content
10.1145/1989323.1989399acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Local graph sparsification for scalable clustering

Published:12 June 2011Publication History

ABSTRACT

In this paper we look at how to sparsify a graph i.e. how to reduce the edgeset while keeping the nodes intact, so as to enable faster graph clustering without sacrificing quality. The main idea behind our approach is to preferentially retain the edges that are likely to be part of the same cluster. We propose to rank edges using a simple similarity-based heuristic that we efficiently compute by comparing the minhash signatures of the nodes incident to the edge. For each node, we select the top few edges to be retained in the sparsified graph. Extensive empirical results on several real networks and using four state-of-the-art graph clustering and community discovery algorithms reveal that our proposed approach realizes excellent speedups (often in the range 10-50), with little or no deterioration in the quality of the resulting clusters. In fact, for at least two of the four clustering algorithms, our sparsification consistently enables higher clustering accuracies.

References

  1. D. Achlioptas and F. McSherry. Fast computation of low rank matrix approximations. In STOC '01, pages 611--618, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Arora, E. Hazan, and S. Kale. A fast random sampling algorithm for sparsifying matrices. APPROX-RANDOM '06, pages 272--279, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In KDD '08, pages 16--24, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. A. Benczúr and D. R. Karger. Approximating s-t minimum cuts in O(n2) time. In STOC '96, pages 47--55, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Bohman, C. Coopery, and A. Friezez. Min-Wise independent linear permutations. the electronic journal of combinatorics, 7(R26):2, 2000.Google ScholarGoogle Scholar
  6. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC '98, pages 327--336, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In WSDM '08, pages 95--106, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Clauset, C. Shalizi, and M. Newman. Power-law distributions in empirical data. SIAM review, 51(4):661--703, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Costanzo, A. Baryshnikova, J. Bellay, Y. Kim, E. Spear, C. Sevier, H. Ding, J. Koh, K. Toufighi, S. Mostafavi, et al. The genetic landscape of a cell. Science's STKE, 327(5964):425, 2010.Google ScholarGoogle Scholar
  10. B. B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword search on external memory data graphs. PVLDB, 1(1):1189--1204, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. I. S. Dhillon, Y. Guan, and B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Trans. PAMI, 29(11):1944--1957, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Fortunato. Community detection in graphs. Phys. Reports, 486:75--174, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB '05, pages 721--732, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Glattfelder and S. Battiston. Backbone of complex networks of corporations: The flow of control. Phys. Rev. E, 80(3):36104, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  15. R. C. K. and S. Sudarshan. Graph clustering for keyword search. In COMAD '09, 2009.Google ScholarGoogle Scholar
  16. D. R. Karger. Random sampling in cut, flow, and network design problems. In STOC '94, pages 648--657, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Krishnamurthy, M. Faloutsos, M. Chrobak, L. Lao, J. Cui, and A. Percus. Reducing large internet topologies for faster simulations. NETWORKING '05, pages 328--341, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In WWW '10, pages 591--600, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Phys. Rev. E, 78(4):46110, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  21. K. Lang and S. Rao. A flow-based method for improving the expansion or conductance of graph cuts. LNCS '04, pages 325--337, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD '06, pages 631--636, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Statistical properties of community structure in large social and information networks. In WWW '08, pages 695--704, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Macskassy and F. Provost. Classification in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research, 8:935--983, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. S. Maiya and T. Y. Berger-Wolf. Sampling community structure. In WWW '10, pages 701--710, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In IMC '07, pages 29--42, October 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69(2):026113, Feb 2004.Google ScholarGoogle ScholarCross RefCross Ref
  28. F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. DMKD, 3(2):131--169, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Pu, J. Wong, B. Turner, E. Cho, and S. Wodak. Up-to-date catalogues of yeast protein complexes. Nucleic acids research, 37(3):825--831, 2009.Google ScholarGoogle Scholar
  30. A. Ruepp et al. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Research, 36(suppl 1):D646, 2008.Google ScholarGoogle Scholar
  31. V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In KDD '09, pages 737--746, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Satuluri and S. Parthasarathy. Symmetrizations for clustering directed graphs. In EDBT '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. Technical Report OSU-CISRC-11/10-TR25, The Ohio State University.Google ScholarGoogle Scholar
  34. J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. PAMI, 22(8), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. In STOC '08, pages 563--568, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. A. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC '04, pages 81--90, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Tumminello, T. Aste, T. Di Matteo, and R. Mantegna. A tool for filtering information in complex systems. PNAS, 102(30):10421, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  38. S. Van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.Google ScholarGoogle Scholar

Index Terms

  1. Local graph sparsification for scalable clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
        June 2011
        1364 pages
        ISBN:9781450306614
        DOI:10.1145/1989323

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 June 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader