Article

Combining link-based and content-based methods for web document classification

Authors:
Pável Calado

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Marco Cristo

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Edleno Moura

Fed. Univ. of Amazonas, Manaus, Brazil

Fed. Univ. of Amazonas, Manaus, Brazil
View Profile

,
Nivio Ziviani

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Berthier Ribeiro-Neto

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil

Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Marcos André Gonçalves

Virginia Tech, VA

Virginia Tech, VA
View Profile

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementNovember 2003Pages 394–401https://doi.org/10.1145/956863.956938

Published:03 November 2003Publication History

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

Pages 394–401

ABSTRACT

This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F₁, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.

References

R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.]]Google Scholar
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, April 1998.]] Google ScholarDigital Library
P. Calado, B. Ribeiro-Neto, N. Ziviani, E. Moura, and I. Silva. Local versus global link information in the W eb. ACM Transactions On Information Systems, 21(1):42--63, January 2003.]] Google ScholarDigital Library
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307--318, Seattle, Washington, June 1998.]] Google ScholarDigital Library
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430--436. MIT Press, 2001.]]Google Scholar
J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11--16):1467--1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.]] Google ScholarDigital Library
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41--56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.]]Google Scholar
J. Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487--498, 1999.]] Google ScholarDigital Library
E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW -02, International Conference on the World Wide Web, 2002.]] Google ScholarDigital Library
N. Gövert, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475--482, Kansas City, Missouri, USA, November 1999.]] Google ScholarDigital Library
D. Hawking and N. Craswell. Overview of TREC-2001 Web track. In The Tenth Text REtrieval Conference (TREC-2001), pages 61--67, Gaithersburg, Maryland, USA, November 2001.]]Google Scholar
X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, November 2002.]]Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML -98, 10th European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, April 1998.]] Google ScholarDigital Library
T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorisation. In C. Brodley and A. Danyluk, editors, Proceedings of ICML -01, 18th International Conference on Machine Learning, pages 250--257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarDigital Library
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, January 1963.]]Google ScholarCross Ref
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999.]] Google ScholarDigital Library
S. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The Web as a graph. In Proceedings of the 19th Symposium on Principles of Database Systems, pages 1--10, Dallas, Texas, USA, May 2000.]] Google ScholarDigital Library
A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI/ICML -98, Workshop on Learning for Text Categorization, pages 41--48. AAAI Press, 1998.]]Google Scholar
T. Mitchell. Machine Learning. McGraw-Hill, March 1997.]] Google ScholarDigital Library
H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 264--271. ACM Press, 2000.]] Google ScholarDigital Library
J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference. Morgan Kaufmann Publishers, 2nd edition, 1988.]] Google ScholarDigital Library
B. Ribeiro-Neto and R. Muntz. A belief network model for IR. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 253--260, Zurich, Switzerland, August 1996.]] Google ScholarDigital Library
B. Ribeiro-Neto, I. Silva, and R. Muntz. Soft Computing in Information Retrieval: Techniques and Applications, chapter 11---Bayesian Network Models for IR, pages 259--291. Springer Verlag, 1st edition, 2000.]]Google Scholar
A. Silva, E. Veloso, P. Golgher, B. Ribeiro-Neto, A. Laender, and N. Ziviani. CobWeb - a crawler for the brazilian web. In Proceedings of the String Processing and Information Retrieval Symposium (SPIRE'99), pages 184--191, Cancun, Mexico, September 1999.]] Google ScholarDigital Library
S. Slattery and M. Craven. Discovering test set regularities in relational domains. In P. Langley, editor, Proceedings of ICML -00, 17th International Conference on Machine Learning, pages 895--902, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US.]] Google ScholarDigital Library
H. G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, July 1973.]]Google ScholarCross Ref
A. Sun, E.-P. Lim, and W.-K. Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96--99. ACM Press, 2002.]] Google ScholarDigital Library
L. Terveen, W. Hill, and B. Amento. Constructing, organizing, and visualizing collections of topically related Web resources. ACM Transactions on Computer-Human Interaction, 6(1):67--94, March 1999.]] Google ScholarDigital Library
M. Thelwall and D. Wilkinson. Finding similar academic Web sites with links, bibliometric couplings and colinks. Information Processing & Management, 2003. (in press).]] Google ScholarDigital Library
H. Turtle and W. B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187--222, July 1991.]] Google ScholarDigital Library
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In W. B. Croft and e. C. J. van Rijsbergen, editors, Proceedings of the 17rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 13--22. Springer-Verlag, 1994.]] Google ScholarDigital Library
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2-3):219--241, 2002.]] Google ScholarDigital Library

Index Terms

Combining link-based and content-based methods for web document classification
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval

Recommendations

THESUS: Organizing Web document collections based on link semantics

The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim ...
Read More
Combining link-based and content-based classification method
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

Link mining is also called social network analysis. It is a new study of data mining. It is different from the traditional data mining methods. Link information is used in link mining. Link information provides richer and more accurate information about ...
Read More
Classification of Faults in Web Applications using Machine Learning
ISMSI '17: Proceedings of the 2017 International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence

Web is huge, abundant and heterogeneous and so are the challenges that arise due to this versatility. Web Applications as the new task-centric and action-oriented facilities have assumed a distinguished role in today's Web. At the same time, faults in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
November 2003
592 pages
ISBN:1581137230
DOI:10.1145/956863
General Chair:
Donald Kraft
Louisiana State University
,
Program Chairs:
Ophir Frieder
Illinois Institute of Technology
,
Joachim Hammer
University of Florida
,
Sajda Qureshi
University of Nebraska, Omaha
,
Len Seligman
The MITRE Corporation
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bayesian networks
classification
link analysis
web
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 1,721
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Combining link-based and content-based methods for web document classification

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

THESUS: Organizing Web document collections based on link semantics

Combining link-based and content-based classification method

Classification of Faults in Web Applications using Machine Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Combining link-based and content-based methods for web document classification

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

THESUS: Organizing Web document collections based on link semantics

Combining link-based and content-based classification method

Classification of Faults in Web Applications using Machine Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media