skip to main content
10.1145/2488388.2488483acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Efficient community detection in large networks using content and links

Published:13 May 2013Publication History

ABSTRACT

In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks.

Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient.

We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering.

Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.

References

  1. C. Aggarwal, Y. Zhao, and P. Yu. Outlier detection in graph streams. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 399--409. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC?98, pages 327--336. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC'02, pages 380--388. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS'01, volume 13, page 430. The MIT Press, 2001.Google ScholarGoogle Scholar
  6. I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In SIGKDD'04, pages 551--556. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Ester, R. Ge, B. Gao, Z. Hu, and B. Ben-Moshe. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In SDM'06, pages 25--46, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75--174, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  10. E. Fox and J. Shaw. Combination of multiple searches. NIST SPECIAL PUBLICATION SP, pages 243--243, 1994.Google ScholarGoogle Scholar
  11. T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(Suppl 1):5228, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Günnemann, B. Boden, and T. Seidl. Db-csc: a density-based approach for subspace clustering in graphs with feature vectors. In PKDD 2011, pages 565--580. Springer-Verlag, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(suppl 1):S145--S154, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  14. D. Karger. Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383--413, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  15. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Leskovec, K. Lang, and M. Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631--640. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Maiya and T. Berger-Wolf. Sampling community structure. In WWW 2010, pages 701--710. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Mohar. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2:871--898, 1991.Google ScholarGoogle Scholar
  19. M. Montague and J. Aslam. Relevance score normalization for metasearch. In CIKM'01, pages 427--433. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint latent topic models for text and citations. In SIGKDD'08, pages 542--550. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In SIGKDD'09, pages 737--746. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. In SIGMOD 2011, pages 721--732. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Strehl and J. Ghosh. Cluster ensembles?a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Tang, Z. Lu, and I. Dhillon. Clustering with multiple graphs. In ICDM'09, pages 1016--1021. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In SIGMOD'08, pages 567--580, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. I. Ulitsky and R. Shamir. Identification of functional modules using network topology and high-throughput data. BMC Systems Biology, 1(1):8, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  27. S. van Dongen. Graph clustering by flow simulation. PhD Thesis, 2000.Google ScholarGoogle Scholar
  28. T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In SIGKDD'09, pages 927--936. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Zhou, E. Manavoglu, J. Li, C. Giles, and H. Zha. Probabilistic models for discovering e-communities. In WWW'06, pages 173--182. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Zhou, H. Cheng, and J. Yu. Clustering large attributed graphs: An efficient incremental approach. In ICDM 2010, pages 689--698. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classification using matrix factorization. In SIGIR'07, pages 487--494. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient community detection in large networks using content and links

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '13: Proceedings of the 22nd international conference on World Wide Web
      May 2013
      1628 pages
      ISBN:9781450320351
      DOI:10.1145/2488388

      Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 May 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WWW '13 Paper Acceptance Rate125of831submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader