skip to main content
10.1145/1233341.1233359acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
Article

Enhancing clustering blog documents by utilizing author/reader comments

Published:23 March 2007Publication History

ABSTRACT

Blogs are a new form of internet phenomenon and a vast everincreasing information resource. Mining blog files for information is a very new research direction in data mining. Blog files are different from standard web files and may need specialized mining strategies. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files.

References

  1. Aschenbrenner, A., and Miksch, S. Blog mining in a corporate environment, Technical Report ASGAARD-TR-2005-11, Smart Agent Technologies, 2005.Google ScholarGoogle Scholar
  2. Berry, M. W., and Browne, M. Email surveillance using non-negative matrix factorization, Computational & Mathematical Organization Theory, 11 (2005), 249--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Blood, R. The Weblog Handbook: Practical Advice on Creating and Maintaining Your Blog, Perseus Publishing, Cambridge, MA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis, Journal of the Society of Information Science, 41(1990), 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  5. Dhillon, I. S., and Modha, D. S. Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 1 (2001), 143--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Herring, S. C., Scheidt, L. A., Bonus, S., and Wright, E. Bridging the gap: A genre analysis of weblogs. In Proceedings of the 37th Hawaii International Conference on System Sciences, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hoyt C. Mining the blogosphere, the HUB Magazine, January 10, 2006, http://hubmagazine.com/?p=76, last accessed on October 30, 2006.Google ScholarGoogle Scholar
  8. Kumar, R., Novak, J., Raghavan, P., and Tomkins, A. On the bursty evolution of Blogsphere. In WWW2003, (Budapest, Hungary, 2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Liu, H., Li, J., and Wong, L. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics, 13(2002), 51--60.Google ScholarGoogle Scholar
  10. MacQueen, J. B. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Symposium on Mathematics, Statistics, and Probability, University of California Press, 1967, 281--297.Google ScholarGoogle Scholar
  11. Malkin, M. All about the Minnesota school shooter, March 23, 2005, http://michellemalkin.com/archives/001837.htm, last accessed on November 1, 2006.Google ScholarGoogle Scholar
  12. Nicolov, N., Salvetti, F., Liberman, M., and Martin, J. H. Computational approaches to analyzing weblogs. In Papers from 2006 AAAI Spring Symposium, 2006.Google ScholarGoogle Scholar
  13. Salton, G., and McGill, M. J. Introduction to Modern Retrieval, McGraw-Hill, New York, NY, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sifry, D. Sifry's alerts, at http://www.sifry.com/alerts/archives/000436.html, accessed on October 31, 2006.Google ScholarGoogle Scholar
  15. Tang, B., Shepherd, M., Milios, E., and Heywood, M. Comparing and combing dimension reduction techniques for efficient test clustering, In Proceedings of the Workshop on Feature Selection for Data Mining, SIAM Data Mining, 2005.Google ScholarGoogle Scholar
  16. Torio, J. Blogs, A Global Conversation, Master's Thesis, Syracuse University, 2005.Google ScholarGoogle Scholar
  17. Xu, S., and Zhang, J., A parallel hybrid web document clustering algorithm and its performance study, Journal of Supercomputing, 30(2004), 117--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zamir, O., and Etzioni, O. Web document clustering: A feasibility demonstration. In SIGIR'98, (Melbourne, Australia, 1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Zhao, Y., and Karypis, G. Criterion Function for Document Clustering Experiments and Analysis, Technical Report #01--40, Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 2001.Google ScholarGoogle Scholar

Index Terms

  1. Enhancing clustering blog documents by utilizing author/reader comments

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ACM-SE 45: Proceedings of the 45th annual southeast regional conference
      March 2007
      574 pages
      ISBN:9781595936295
      DOI:10.1145/1233341

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 March 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate178of377submissions,47%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader