ABSTRACT
In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks.
Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient.
We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering.
Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.
- C. Aggarwal, Y. Zhao, and P. Yu. Outlier detection in graph streams. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 399--409. IEEE, 2011. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
- A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC?98, pages 327--336. ACM, 1998. Google ScholarDigital Library
- M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC'02, pages 380--388. ACM, 2002. Google ScholarDigital Library
- D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS'01, volume 13, page 430. The MIT Press, 2001.Google Scholar
- I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In SIGKDD'04, pages 551--556. ACM, 2004. Google ScholarDigital Library
- E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220, 2004.Google ScholarCross Ref
- M. Ester, R. Ge, B. Gao, Z. Hu, and B. Ben-Moshe. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In SDM'06, pages 25--46, 2006.Google ScholarCross Ref
- S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75--174, 2010.Google ScholarCross Ref
- E. Fox and J. Shaw. Combination of multiple searches. NIST SPECIAL PUBLICATION SP, pages 243--243, 1994.Google Scholar
- T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(Suppl 1):5228, 2004.Google ScholarCross Ref
- S. Günnemann, B. Boden, and T. Seidl. Db-csc: a density-based approach for subspace clustering in graphs with feature vectors. In PKDD 2011, pages 565--580. Springer-Verlag, 2011. Google ScholarDigital Library
- D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(suppl 1):S145--S154, 2002.Google ScholarCross Ref
- D. Karger. Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383--413, 1999. Google ScholarCross Ref
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998. Google ScholarDigital Library
- J. Leskovec, K. Lang, and M. Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631--640. ACM, 2010. Google ScholarDigital Library
- A. Maiya and T. Berger-Wolf. Sampling community structure. In WWW 2010, pages 701--710. ACM, 2010. Google ScholarDigital Library
- B. Mohar. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2:871--898, 1991.Google Scholar
- M. Montague and J. Aslam. Relevance score normalization for metasearch. In CIKM'01, pages 427--433. ACM, 2001. Google ScholarDigital Library
- R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint latent topic models for text and citations. In SIGKDD'08, pages 542--550. ACM, 2008. Google ScholarDigital Library
- V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In SIGKDD'09, pages 737--746. ACM, 2009. Google ScholarDigital Library
- V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. In SIGMOD 2011, pages 721--732. ACM, 2011. Google ScholarDigital Library
- A. Strehl and J. Ghosh. Cluster ensembles?a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, 2003. Google ScholarDigital Library
- W. Tang, Z. Lu, and I. Dhillon. Clustering with multiple graphs. In ICDM'09, pages 1016--1021. IEEE, 2009. Google ScholarDigital Library
- Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In SIGMOD'08, pages 567--580, 2008. Google ScholarDigital Library
- I. Ulitsky and R. Shamir. Identification of functional modules using network topology and high-throughput data. BMC Systems Biology, 1(1):8, 2007.Google ScholarCross Ref
- S. van Dongen. Graph clustering by flow simulation. PhD Thesis, 2000.Google Scholar
- T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In SIGKDD'09, pages 927--936. ACM, 2009. Google ScholarDigital Library
- D. Zhou, E. Manavoglu, J. Li, C. Giles, and H. Zha. Probabilistic models for discovering e-communities. In WWW'06, pages 173--182. ACM, 2006. Google ScholarDigital Library
- Y. Zhou, H. Cheng, and J. Yu. Clustering large attributed graphs: An efficient incremental approach. In ICDM 2010, pages 689--698. IEEE, 2010. Google ScholarDigital Library
- S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classification using matrix factorization. In SIGIR'07, pages 487--494. ACM, 2007. Google ScholarDigital Library
Index Terms
- Efficient community detection in large networks using content and links
Recommendations
Combining link and content for community detection: a discriminative approach
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningIn this paper, we consider the problem of combining link and content analysis for community detection from networked data, such as paper citation networks and Word Wide Web. Most existing approaches combine link and content information by a generative ...
A classification for community discovery methods in complex networks
Many real-world networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature ...
Detection of web communities from community cores
WISS'10: Proceedings of the 2010 international conference on Web information systems engineeringA Web community, as a significant pattern of the Web, formed by a group of pages focusing on a common topic. Web communities are able to be oriented by complete bipartite graphs (CBG for short, and also known as community cores). Investigations have ...
Comments