Article

Enhancing clustering blog documents by utilizing author/reader comments

Authors:
Beibei Li

University of Kentucky, Lexington, KY

University of Kentucky, Lexington, KY
View Profile

,
Shuting Xu

Virginia State University, Petersburg, VA

Virginia State University, Petersburg, VA
View Profile

,
Jun Zhang

University of Kentucky, Lexington, KY

University of Kentucky, Lexington, KY
View Profile

ACM-SE 45: Proceedings of the 45th annual southeast regional conferenceMarch 2007Pages 94–99https://doi.org/10.1145/1233341.1233359

Published:23 March 2007Publication History

ACM-SE 45: Proceedings of the 45th annual southeast regional conference

Pages 94–99

ABSTRACT

Blogs are a new form of internet phenomenon and a vast everincreasing information resource. Mining blog files for information is a very new research direction in data mining. Blog files are different from standard web files and may need specialized mining strategies. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files.

References

Aschenbrenner, A., and Miksch, S. Blog mining in a corporate environment, Technical Report ASGAARD-TR-2005-11, Smart Agent Technologies, 2005.Google Scholar
Berry, M. W., and Browne, M. Email surveillance using non-negative matrix factorization, Computational & Mathematical Organization Theory, 11 (2005), 249--264. Google ScholarDigital Library
Blood, R. The Weblog Handbook: Practical Advice on Creating and Maintaining Your Blog, Perseus Publishing, Cambridge, MA, 2002. Google ScholarDigital Library
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis, Journal of the Society of Information Science, 41(1990), 391--407.Google ScholarCross Ref
Dhillon, I. S., and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 1 (2001), 143--175. Google ScholarDigital Library
Herring, S. C., Scheidt, L. A., Bonus, S., and Wright, E. Bridging the gap: A genre analysis of weblogs. In Proceedings of the 37^th Hawaii International Conference on System Sciences, 2004. Google ScholarDigital Library
Hoyt C. Mining the blogosphere, the HUB Magazine, January 10, 2006, http://hubmagazine.com/?p=76, last accessed on October 30, 2006.Google Scholar
Kumar, R., Novak, J., Raghavan, P., and Tomkins, A. On the bursty evolution of Blogsphere. In WWW2003, (Budapest, Hungary, 2003). Google ScholarDigital Library
Liu, H., Li, J., and Wong, L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics, 13(2002), 51--60.Google Scholar
MacQueen, J. B. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5^th Symposium on Mathematics, Statistics, and Probability, University of California Press, 1967, 281--297.Google Scholar
Malkin, M. All about the Minnesota school shooter, March 23, 2005, http://michellemalkin.com/archives/001837.htm, last accessed on November 1, 2006.Google Scholar
Nicolov, N., Salvetti, F., Liberman, M., and Martin, J. H. Computational approaches to analyzing weblogs. In Papers from 2006 AAAI Spring Symposium, 2006.Google Scholar
Salton, G., and McGill, M. J. Introduction to Modern Retrieval, McGraw-Hill, New York, NY, 1983. Google ScholarDigital Library
Sifry, D. Sifry's alerts, at http://www.sifry.com/alerts/archives/000436.html, accessed on October 31, 2006.Google Scholar
Tang, B., Shepherd, M., Milios, E., and Heywood, M. Comparing and combing dimension reduction techniques for efficient test clustering, In Proceedings of the Workshop on Feature Selection for Data Mining, SIAM Data Mining, 2005.Google Scholar
Torio, J. Blogs, A Global Conversation, Master's Thesis, Syracuse University, 2005.Google Scholar
Xu, S., and Zhang, J., A parallel hybrid web document clustering algorithm and its performance study, Journal of Supercomputing, 30(2004), 117--131. Google ScholarDigital Library
Zamir, O., and Etzioni, O. Web document clustering: A feasibility demonstration. In SIGIR'98, (Melbourne, Australia, 1998). Google ScholarDigital Library
Zhao, Y., and Karypis, G. Criterion Function for Document Clustering Experiments and Analysis, Technical Report #01--40, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 2001.Google Scholar

Index Terms

Enhancing clustering blog documents by utilizing author/reader comments
1. Information systems
  1. Information retrieval

Recommendations

Comments-oriented blog summarization by sentence extraction
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Much existing research on blogs focused on posts only, ignoring their comments. Our user study conducted on summarizing blog posts, however, showed that reading comments does change one's understanding about blog posts. In this research, we aim to ...
Read More
Identifying the influential bloggers: a modular approach based on sentiment analysis

The social web provides an easy and quick medium for public communication and online social interactions. In the web log, short as a blog, the bloggers share their views in the form of creating and commenting on blog posts. The bloggers who influence ...
Read More
Subject-based extraction of a latent blog community

In the blogosphere, there exist posts relevant to a particular subject and blogs that show interest in the subject. In this paper, we define a set of such posts and blogs as a blog community and propose a method for extracting the blog community ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM-SE 45: Proceedings of the 45th annual southeast regional conference
March 2007
574 pages
ISBN:9781595936295
DOI:10.1145/1233341
Conference Chairs:
David John
Wake Forest University
,
Sandria Kerr
Winston-Salem State University
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 March 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
blog
blogosphere
clustering
comment
data mining
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate178of377submissions,47%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 828
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing clustering blog documents by utilizing author/reader comments

ACM-SE 45: Proceedings of the 45th annual southeast regional conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comments-oriented blog summarization by sentence extraction

Identifying the influential bloggers: a modular approach based on sentiment analysis

Subject-based extraction of a latent blog community

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enhancing clustering blog documents by utilizing author/reader comments

ACM-SE 45: Proceedings of the 45th annual southeast regional conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comments-oriented blog summarization by sentence extraction

Identifying the influential bloggers: a modular approach based on sentiment analysis

Subject-based extraction of a latent blog community

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media