skip to main content
10.1145/956863.956938acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Combining link-based and content-based methods for web document classification

Published:03 November 2003Publication History

ABSTRACT

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

References

  1. R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.]]Google ScholarGoogle Scholar
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, April 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Calado, B. Ribeiro-Neto, N. Ziviani, E. Moura, and I. Silva. Local versus global link information in the W eb. ACM Transactions On Information Systems, 21(1):42--63, January 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307--318, Seattle, Washington, June 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.]]Google ScholarGoogle Scholar
  6. J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Fisher and R. Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41--56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.]]Google ScholarGoogle Scholar
  8. J. Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487--498, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW -02, International Conference on the World Wide Web, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Gövert, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475--482, Kansas City, Missouri, USA, November 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Hawking and N. Craswell. Overview of TREC-2001 Web track. In The Tenth Text REtrieval Conference (TREC-2001), pages 61--67, Gaithersburg, Maryland, USA, November 2001.]]Google ScholarGoogle Scholar
  12. X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, November 2002.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML -98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, April 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML -01, 18th International Conference on Machine Learning, pages 250--257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The Web as a graph. In Proceedings of the 19th Symposium on Principles of Database Systems, pages 1--10, Dallas, Texas, USA, May 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI/ICML -98, Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998.]]Google ScholarGoogle Scholar
  19. T. Mitchell. Machine Learning. McGraw-Hill, March 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264--271. ACM Press, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference. Morgan Kaufmann Publishers, 2nd edition, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Ribeiro-Neto and R. Muntz. A belief network model for IR. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253--260, Zurich, Switzerland, August 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Ribeiro-Neto, I. Silva, and R. Muntz. Soft Computing in Information Retrieval: Techniques and Applications, chapter 11---Bayesian Network Models for IR, pages 259--291. Springer Verlag, 1st edition, 2000.]]Google ScholarGoogle Scholar
  24. A. Silva, E. Veloso, P. Golgher, B. Ribeiro-Neto, A. Laender, and N. Ziviani. CobWeb - a crawler for the brazilian web. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE'99), pages 184--191, Cancun, Mexico, September 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Slattery and M. Craven. Discovering test set regularities in relational domains. In P. Langley, editor, Proceedings of ICML -00, 17th International Conference on Machine Learning, pages 895--902, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.]]Google ScholarGoogle ScholarCross RefCross Ref
  27. A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96--99. ACM Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related Web resources. ACM Transactions on Computer-Human Interaction, 6(1):67--94, March 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Thelwall and D. Wilkinson. Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management, 2003. (in press).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, July 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In W. B. Croft and e. C. J. van Rijsbergen, editors, Proceedings of the 17rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 13--22. Springer-Verlag, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219--241, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Combining link-based and content-based methods for web document classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
        November 2003
        592 pages
        ISBN:1581137230
        DOI:10.1145/956863

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 November 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader