ABSTRACT
This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.
- R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.]]Google Scholar
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, April 1998.]] Google ScholarDigital Library
- P. Calado, B. Ribeiro-Neto, N. Ziviani, E. Moura, and I. Silva. Local versus global link information in the W eb. ACM Transactions On Information Systems, 21(1):42--63, January 2003.]] Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307--318, Seattle, Washington, June 1998.]] Google ScholarDigital Library
- D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.]]Google Scholar
- J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.]] Google ScholarDigital Library
- M. Fisher and R. Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41--56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.]]Google Scholar
- J. Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487--498, 1999.]] Google ScholarDigital Library
- E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW -02, International Conference on the World Wide Web, 2002.]] Google ScholarDigital Library
- N. Gövert, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475--482, Kansas City, Missouri, USA, November 1999.]] Google ScholarDigital Library
- D. Hawking and N. Craswell. Overview of TREC-2001 Web track. In The Tenth Text REtrieval Conference (TREC-2001), pages 61--67, Gaithersburg, Maryland, USA, November 2001.]]Google Scholar
- X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, November 2002.]]Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML -98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, April 1998.]] Google ScholarDigital Library
- T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML -01, 18th International Conference on Machine Learning, pages 250--257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarDigital Library
- M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.]]Google ScholarCross Ref
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.]] Google ScholarDigital Library
- S. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The Web as a graph. In Proceedings of the 19th Symposium on Principles of Database Systems, pages 1--10, Dallas, Texas, USA, May 2000.]] Google ScholarDigital Library
- A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI/ICML -98, Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998.]]Google Scholar
- T. Mitchell. Machine Learning. McGraw-Hill, March 1997.]] Google ScholarDigital Library
- H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264--271. ACM Press, 2000.]] Google ScholarDigital Library
- J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference. Morgan Kaufmann Publishers, 2nd edition, 1988.]] Google ScholarDigital Library
- B. Ribeiro-Neto and R. Muntz. A belief network model for IR. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253--260, Zurich, Switzerland, August 1996.]] Google ScholarDigital Library
- B. Ribeiro-Neto, I. Silva, and R. Muntz. Soft Computing in Information Retrieval: Techniques and Applications, chapter 11---Bayesian Network Models for IR, pages 259--291. Springer Verlag, 1st edition, 2000.]]Google Scholar
- A. Silva, E. Veloso, P. Golgher, B. Ribeiro-Neto, A. Laender, and N. Ziviani. CobWeb - a crawler for the brazilian web. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE'99), pages 184--191, Cancun, Mexico, September 1999.]] Google ScholarDigital Library
- S. Slattery and M. Craven. Discovering test set regularities in relational domains. In P. Langley, editor, Proceedings of ICML -00, 17th International Conference on Machine Learning, pages 895--902, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarDigital Library
- H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.]]Google ScholarCross Ref
- A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96--99. ACM Press, 2002.]] Google ScholarDigital Library
- L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related Web resources. ACM Transactions on Computer-Human Interaction, 6(1):67--94, March 1999.]] Google ScholarDigital Library
- M. Thelwall and D. Wilkinson. Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management, 2003. (in press).]] Google ScholarDigital Library
- H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, July 1991.]] Google ScholarDigital Library
- Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In W. B. Croft and e. C. J. van Rijsbergen, editors, Proceedings of the 17rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 13--22. Springer-Verlag, 1994.]] Google ScholarDigital Library
- Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219--241, 2002.]] Google ScholarDigital Library
Index Terms
- Combining link-based and content-based methods for web document classification
Recommendations
THESUS: Organizing Web document collections based on link semantics
The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim ...
Combining link-based and content-based classification method
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IILink mining is also called social network analysis. It is a new study of data mining. It is different from the traditional data mining methods. Link information is used in link mining. Link information provides richer and more accurate information about ...
Classification of Faults in Web Applications using Machine Learning
ISMSI '17: Proceedings of the 2017 International Conference on Intelligent Systems, Metaheuristics & Swarm IntelligenceWeb is huge, abundant and heterogeneous and so are the challenges that arise due to this versatility. Web Applications as the new task-centric and action-oriented facilities have assumed a distinguished role in today's Web. At the same time, faults in ...
Comments