ABSTRACT
In this paper we look at how to sparsify a graph i.e. how to reduce the edgeset while keeping the nodes intact, so as to enable faster graph clustering without sacrificing quality. The main idea behind our approach is to preferentially retain the edges that are likely to be part of the same cluster. We propose to rank edges using a simple similarity-based heuristic that we efficiently compute by comparing the minhash signatures of the nodes incident to the edge. For each node, we select the top few edges to be retained in the sparsified graph. Extensive empirical results on several real networks and using four state-of-the-art graph clustering and community discovery algorithms reveal that our proposed approach realizes excellent speedups (often in the range 10-50), with little or no deterioration in the quality of the resulting clusters. In fact, for at least two of the four clustering algorithms, our sparsification consistently enables higher clustering accuracies.
- D. Achlioptas and F. McSherry. Fast computation of low rank matrix approximations. In STOC '01, pages 611--618, 2001. Google ScholarDigital Library
- S. Arora, E. Hazan, and S. Kale. A fast random sampling algorithm for sparsifying matrices. APPROX-RANDOM '06, pages 272--279, 2006. Google ScholarDigital Library
- L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In KDD '08, pages 16--24, 2008. Google ScholarDigital Library
- A. A. Benczúr and D. R. Karger. Approximating s-t minimum cuts in O(n2) time. In STOC '96, pages 47--55, 1996. Google ScholarDigital Library
- T. Bohman, C. Coopery, and A. Friezez. Min-Wise independent linear permutations. the electronic journal of combinatorics, 7(R26):2, 2000.Google Scholar
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC '98, pages 327--336, 1998. Google ScholarDigital Library
- G. Buehrer and K. Chellapilla. A scalable pattern mining approach to web graph compression with communities. In WSDM '08, pages 95--106, 2008. Google ScholarDigital Library
- A. Clauset, C. Shalizi, and M. Newman. Power-law distributions in empirical data. SIAM review, 51(4):661--703, 2009. Google ScholarDigital Library
- M. Costanzo, A. Baryshnikova, J. Bellay, Y. Kim, E. Spear, C. Sevier, H. Ding, J. Koh, K. Toufighi, S. Mostafavi, et al. The genetic landscape of a cell. Science's STKE, 327(5964):425, 2010.Google Scholar
- B. B. Dalvi, M. Kshirsagar, and S. Sudarshan. Keyword search on external memory data graphs. PVLDB, 1(1):1189--1204, 2008. Google ScholarDigital Library
- I. S. Dhillon, Y. Guan, and B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Trans. PAMI, 29(11):1944--1957, 2007. Google ScholarDigital Library
- S. Fortunato. Community detection in graphs. Phys. Reports, 486:75--174, 2010.Google ScholarCross Ref
- D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB '05, pages 721--732, 2005. Google ScholarDigital Library
- J. Glattfelder and S. Battiston. Backbone of complex networks of corporations: The flow of control. Phys. Rev. E, 80(3):36104, 2009.Google ScholarCross Ref
- R. C. K. and S. Sudarshan. Graph clustering for keyword search. In COMAD '09, 2009.Google Scholar
- D. R. Karger. Random sampling in cut, flow, and network design problems. In STOC '94, pages 648--657, 1994. Google ScholarDigital Library
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359, 1999. Google ScholarDigital Library
- V. Krishnamurthy, M. Faloutsos, M. Chrobak, L. Lao, J. Cui, and A. Percus. Reducing large internet topologies for faster simulations. NETWORKING '05, pages 328--341, 2005. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In WWW '10, pages 591--600, 2010. Google ScholarDigital Library
- A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Phys. Rev. E, 78(4):46110, 2008.Google ScholarCross Ref
- K. Lang and S. Rao. A flow-based method for improving the expansion or conductance of graph cuts. LNCS '04, pages 325--337, 2004.Google ScholarCross Ref
- J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD '06, pages 631--636, 2006. Google ScholarDigital Library
- J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Statistical properties of community structure in large social and information networks. In WWW '08, pages 695--704, 2008. Google ScholarDigital Library
- S. Macskassy and F. Provost. Classification in networked data: A toolkit and a univariate case study. The Journal of Machine Learning Research, 8:935--983, 2007. Google ScholarDigital Library
- A. S. Maiya and T. Y. Berger-Wolf. Sampling community structure. In WWW '10, pages 701--710, 2010. Google ScholarDigital Library
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In IMC '07, pages 29--42, October 2007. Google ScholarDigital Library
- M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69(2):026113, Feb 2004.Google ScholarCross Ref
- F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. DMKD, 3(2):131--169, 1999. Google ScholarDigital Library
- S. Pu, J. Wong, B. Turner, E. Cho, and S. Wodak. Up-to-date catalogues of yeast protein complexes. Nucleic acids research, 37(3):825--831, 2009.Google Scholar
- A. Ruepp et al. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Research, 36(suppl 1):D646, 2008.Google Scholar
- V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In KDD '09, pages 737--746, 2009. Google ScholarDigital Library
- V. Satuluri and S. Parthasarathy. Symmetrizations for clustering directed graphs. In EDBT '11, 2011. Google ScholarDigital Library
- V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. Technical Report OSU-CISRC-11/10-TR25, The Ohio State University.Google Scholar
- J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. PAMI, 22(8), 2000. Google ScholarDigital Library
- D. A. Spielman and N. Srivastava. Graph sparsification by effective resistances. In STOC '08, pages 563--568, 2008. Google ScholarDigital Library
- D. A. Spielman and S. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC '04, pages 81--90, 2004. Google ScholarDigital Library
- M. Tumminello, T. Aste, T. Di Matteo, and R. Mantegna. A tool for filtering information in complex systems. PNAS, 102(30):10421, 2005.Google ScholarCross Ref
- S. Van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.Google Scholar
Index Terms
- Local graph sparsification for scalable clustering
Recommendations
gSparsify: Graph Motif Based Sparsification for Graph Clustering
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementGraph clustering is a fundamental problem that partitions vertices of a graph into clusters with an objective to optimize the intuitive notions of intra-cluster density and intercluster sparsity. In many real-world applications, however, the sheer sizes ...
Distributed Graph Clustering and Sparsification
Special Issue on SPAA 2017Graph clustering is a fundamental computational problem with a number of applications in algorithm design, machine learning, data mining, and analysis of social networks. Over the past decades, researchers have proposed a number of algorithmic design ...
Parallel Edge Contraction for Large Nonplanar Graph Clustering
BDIOT '18: Proceedings of the 2018 2nd International Conference on Big Data and Internet of ThingsWith the flowering of graph mining and computation technology, the field of graph clustering has become popular. Particularly, to cluster increasingly massive data represented as graph become common today. There have been many research efforts on graph ...
Comments