ABSTRACT
Blogs are a new form of internet phenomenon and a vast everincreasing information resource. Mining blog files for information is a very new research direction in data mining. Blog files are different from standard web files and may need specialized mining strategies. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files.
- Aschenbrenner, A., and Miksch, S. Blog mining in a corporate environment, Technical Report ASGAARD-TR-2005-11, Smart Agent Technologies, 2005.Google Scholar
- Berry, M. W., and Browne, M. Email surveillance using non-negative matrix factorization, Computational & Mathematical Organization Theory, 11 (2005), 249--264. Google ScholarDigital Library
- Blood, R. The Weblog Handbook: Practical Advice on Creating and Maintaining Your Blog, Perseus Publishing, Cambridge, MA, 2002. Google ScholarDigital Library
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis, Journal of the Society of Information Science, 41(1990), 391--407.Google ScholarCross Ref
- Dhillon, I. S., and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 1 (2001), 143--175. Google ScholarDigital Library
- Herring, S. C., Scheidt, L. A., Bonus, S., and Wright, E. Bridging the gap: A genre analysis of weblogs. In Proceedings of the 37th Hawaii International Conference on System Sciences, 2004. Google ScholarDigital Library
- Hoyt C. Mining the blogosphere, the HUB Magazine, January 10, 2006, http://hubmagazine.com/?p=76, last accessed on October 30, 2006.Google Scholar
- Kumar, R., Novak, J., Raghavan, P., and Tomkins, A. On the bursty evolution of Blogsphere. In WWW2003, (Budapest, Hungary, 2003). Google ScholarDigital Library
- Liu, H., Li, J., and Wong, L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics, 13(2002), 51--60.Google Scholar
- MacQueen, J. B. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Symposium on Mathematics, Statistics, and Probability, University of California Press, 1967, 281--297.Google Scholar
- Malkin, M. All about the Minnesota school shooter, March 23, 2005, http://michellemalkin.com/archives/001837.htm, last accessed on November 1, 2006.Google Scholar
- Nicolov, N., Salvetti, F., Liberman, M., and Martin, J. H. Computational approaches to analyzing weblogs. In Papers from 2006 AAAI Spring Symposium, 2006.Google Scholar
- Salton, G., and McGill, M. J. Introduction to Modern Retrieval, McGraw-Hill, New York, NY, 1983. Google ScholarDigital Library
- Sifry, D. Sifry's alerts, at http://www.sifry.com/alerts/archives/000436.html, accessed on October 31, 2006.Google Scholar
- Tang, B., Shepherd, M., Milios, E., and Heywood, M. Comparing and combing dimension reduction techniques for efficient test clustering, In Proceedings of the Workshop on Feature Selection for Data Mining, SIAM Data Mining, 2005.Google Scholar
- Torio, J. Blogs, A Global Conversation, Master's Thesis, Syracuse University, 2005.Google Scholar
- Xu, S., and Zhang, J., A parallel hybrid web document clustering algorithm and its performance study, Journal of Supercomputing, 30(2004), 117--131. Google ScholarDigital Library
- Zamir, O., and Etzioni, O. Web document clustering: A feasibility demonstration. In SIGIR'98, (Melbourne, Australia, 1998). Google ScholarDigital Library
- Zhao, Y., and Karypis, G. Criterion Function for Document Clustering Experiments and Analysis, Technical Report #01--40, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 2001.Google Scholar
Index Terms
- Enhancing clustering blog documents by utilizing author/reader comments
Recommendations
Comments-oriented blog summarization by sentence extraction
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementMuch existing research on blogs focused on posts only, ignoring their comments. Our user study conducted on summarizing blog posts, however, showed that reading comments does change one's understanding about blog posts. In this research, we aim to ...
Identifying the influential bloggers: a modular approach based on sentiment analysis
The social web provides an easy and quick medium for public communication and online social interactions. In the web log, short as a blog, the bloggers share their views in the form of creating and commenting on blog posts. The bloggers who influence ...
Subject-based extraction of a latent blog community
In the blogosphere, there exist posts relevant to a particular subject and blogs that show interest in the subject. In this paper, we define a set of such posts and blogs as a blog community and propose a method for extracting the blog community ...
Comments