research-article

Efficient community detection in large networks using content and links

Authors:
Yiye Ruan

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
David Fuhry

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Srinivasan Parthasarathy

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

WWW '13: Proceedings of the 22nd international conference on World Wide WebMay 2013Pages 1089–1098https://doi.org/10.1145/2488388.2488483

Published:13 May 2013Publication History

WWW '13: Proceedings of the 22nd international conference on World Wide Web

Pages 1089–1098

ABSTRACT

In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks.

Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient.

We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering.

Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.

References

C. Aggarwal, Y. Zhao, and P. Yu. Outlier detection in graph streams. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 399--409. IEEE, 2011. Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC?98, pages 327--336. ACM, 1998. Google ScholarDigital Library
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC'02, pages 380--388. ACM, 2002. Google ScholarDigital Library
D. Cohn and T. Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In NIPS'01, volume 13, page 430. The MIT Press, 2001.Google Scholar
I. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts. In SIGKDD'04, pages 551--556. ACM, 2004. Google ScholarDigital Library
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220, 2004.Google ScholarCross Ref
M. Ester, R. Ge, B. Gao, Z. Hu, and B. Ben-Moshe. Joint cluster analysis of attribute data and relationship data: the connected k-center problem. In SDM'06, pages 25--46, 2006.Google ScholarCross Ref
S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75--174, 2010.Google ScholarCross Ref
E. Fox and J. Shaw. Combination of multiple searches. NIST SPECIAL PUBLICATION SP, pages 243--243, 1994.Google Scholar
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(Suppl 1):5228, 2004.Google ScholarCross Ref
S. Günnemann, B. Boden, and T. Seidl. Db-csc: a density-based approach for subspace clustering in graphs with feature vectors. In PKDD 2011, pages 565--580. Springer-Verlag, 2011. Google ScholarDigital Library
D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(suppl 1):S145--S154, 2002.Google ScholarCross Ref
D. Karger. Random sampling in cut, flow, and network design problems. Mathematics of Operations Research, 24(2):383--413, 1999. Google ScholarCross Ref
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359--392, 1998. Google ScholarDigital Library
J. Leskovec, K. Lang, and M. Mahoney. Empirical comparison of algorithms for network community detection. In Proceedings of the 19th international conference on World wide web, pages 631--640. ACM, 2010. Google ScholarDigital Library
A. Maiya and T. Berger-Wolf. Sampling community structure. In WWW 2010, pages 701--710. ACM, 2010. Google ScholarDigital Library
B. Mohar. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2:871--898, 1991.Google Scholar
M. Montague and J. Aslam. Relevance score normalization for metasearch. In CIKM'01, pages 427--433. ACM, 2001. Google ScholarDigital Library
R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint latent topic models for text and citations. In SIGKDD'08, pages 542--550. ACM, 2008. Google ScholarDigital Library
V. Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows: applications to community discovery. In SIGKDD'09, pages 737--746. ACM, 2009. Google ScholarDigital Library
V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. In SIGMOD 2011, pages 721--732. ACM, 2011. Google ScholarDigital Library
A. Strehl and J. Ghosh. Cluster ensembles?a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, 2003. Google ScholarDigital Library
W. Tang, Z. Lu, and I. Dhillon. Clustering with multiple graphs. In ICDM'09, pages 1016--1021. IEEE, 2009. Google ScholarDigital Library
Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation for graph summarization. In SIGMOD'08, pages 567--580, 2008. Google ScholarDigital Library
I. Ulitsky and R. Shamir. Identification of functional modules using network topology and high-throughput data. BMC Systems Biology, 1(1):8, 2007.Google ScholarCross Ref
S. van Dongen. Graph clustering by flow simulation. PhD Thesis, 2000.Google Scholar
T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In SIGKDD'09, pages 927--936. ACM, 2009. Google ScholarDigital Library
D. Zhou, E. Manavoglu, J. Li, C. Giles, and H. Zha. Probabilistic models for discovering e-communities. In WWW'06, pages 173--182. ACM, 2006. Google ScholarDigital Library
Y. Zhou, H. Cheng, and J. Yu. Clustering large attributed graphs: An efficient incremental approach. In ICDM 2010, pages 689--698. IEEE, 2010. Google ScholarDigital Library
S. Zhu, K. Yu, Y. Chi, and Y. Gong. Combining content and link for classification using matrix factorization. In SIGIR'07, pages 487--494. ACM, 2007. Google ScholarDigital Library

Index Terms

Efficient community detection in large networks using content and links
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Combining link and content for community detection: a discriminative approach
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

In this paper, we consider the problem of combining link and content analysis for community detection from networked data, such as paper citation networks and Word Wide Web. Most existing approaches combine link and content information by a generative ...
Read More
A classification for community discovery methods in complex networks

Many real-world networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature ...
Read More
Detection of web communities from community cores
WISS'10: Proceedings of the 2010 international conference on Web information systems engineering

A Web community, as a significant pattern of the Web, formed by a group of pages focusing on a common topic. Web communities are able to be oriented by complete bipartite graphs (CBG for short, and also known as community cores). Investigations have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '13: Proceedings of the 22nd international conference on World Wide Web
May 2013
1628 pages
ISBN:9781450320351
DOI:10.1145/2488388
General Chairs:
Daniel Schwabe
PUC-Rio - Brazil
,
Virgílio Almeida
UFMG - Brazil
,
Hartmut Glaser
CGI.br - Brazil
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Labs - Spain & Chile
,
Sue Moon
KAIST - South Korea
Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
content analysis
graph clustering
web mining
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '13 Paper Acceptance Rate125of831submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 219
  Total Citations
  View Citations
- 1,563
  Total Downloads
- Downloads (Last 12 months)82
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient community detection in large networks using content and links

WWW '13: Proceedings of the 22nd international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining link and content for community detection: a discriminative approach

A classification for community discovery methods in complex networks

Detection of web communities from community cores