Enhanced hypertext categorization using hyperlinks

Authors:
Soumen Chakrabarti

IBM Almaden

IBM Almaden
View Profile

,
Byron Dom

IBM Almaden

IBM Almaden
View Profile

,
Piotr Indyk

Stanford University

Stanford University
View Profile

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of dataJune 1998Pages 307–318https://doi.org/10.1145/276304.276332

Published:01 June 1998Publication History

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

Pages 307–318

ABSTRACT

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!¹ and the US Patent Database². In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

References

1.K, All. Learning relational rules from relational data. Data Mining Seminar, IBM Almaden, San Jose, CA, Sept. 1997.Google Scholar
2.C. Apte, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. A CM Transactions on information Systems, 1994. IBM Research Report RC18879. Google ScholarDigital Library
3.L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole, 1984. ISBN: 0-534-98054-6.Google Scholar
4.S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases, in VLDB, Athens, Greece, Aug. 1997. Invlted submission to VLDB Journal, 1998. Google ScholarDigital Library
5.C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Sixth World Wide Web Oor~ferer, ce, San Jose, CA, 1996.Google Scholar
6.R. Chellappa and A. fain. Markov Random Fields: Theory and Applications. Academic Press, 1993.Google Scholar
7.W. B. Croft and H. R. Turtle. A retrieval model for incorporating hypertext links. In A OM Hypertext, pages 213-224. ACM, 1989. Google ScholarDigital Library
8.W. B. Croft and H. R. Turtle. Retrieval strategies for hypertext. Information Processing and Management, 29(3):313-324, 1993. Google ScholarDigital Library
9.w. Deng and S. S. Iyengar. A new probabilistic relaxation scheme and its application to edge detection. IEEE Transactions of Pattern Analysis and Machine Intelligence, 18(4):432-437, Apr. 1996. Google ScholarDigital Library
10.R. Duds and P. Hart. Pattern Classification and Scene Analysis. Wiley. New York 1973.Google Scholar
11.D. Bppsteln. Fmd,ng the k shortest paths. In Symposium on the Foundations of Computer Science. IEEE, 1994.Google Scholar
12.D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In VLDB, Athens, Greece, 1997. Google ScholarDigital Library
13.H. P. Frei and D. Steiger. Making use of hypertext links when retrieving information. In A CM European Conference on Hypertezt (ECHT), pages 102-111. ACM, Nov. 1992. Google ScholarDigital Library
14.H. P. Frei and D. Steiger. The use of semantic links in hypertext information retrieval. Information Processing and Management, 31(1):1-13, 1995. Google ScholarDigital Library
15.Y. Freund and D. Pen. Learning to model sequences generated by switching distributions. In Proceedings of the Eighth Annual A CM Conference on Computational Learning Theory (COLT), pages 41-50, 1995. Google ScholarDigital Library
16.M. Hearst and C. Karadi. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of A CM SIGIR 97, pages 246--255. ACM Press, 1997. Google ScholarDigital Library
17.R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IBEE Transactions of Patter~n Analysis and Machine Intelligence, PAMI-5(3):267-287, May 1983.Google Scholar
18.D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In International Conference on Machine Learning, volume 14. Morgan-Kaufmann, July 1997. To appear. Google ScholarDigital Library
19.K. L. Kwok. The use of titles and cited titles as document representations for automatic classification. Information Processing and Management, 11:201-206, 1975.Google ScholarCross Ref
20.K. L. Kwok. A document-document similarity measure based on cited titles and probability theory and its application to relevance feedback retrieval. In C. J. van Ri~sbergen, editor, Research and Development in Information Retrieval, pages 221- 232. Cambridge University Press, 1984. Google ScholarDigital Library
21.K. L. Kwok. A probabitistic theory of indexing and similarity measure based on cited and citing documents. Journal of the American Society for Information Science, 36(342-351), 1985. Google ScholarDigital Library
22.K. L. Kwok. On the use of bibliographically related titles for the enhancements of document representations. Information Processing and Management, 24(2):123-131, 1988. Google ScholarDigital Library
23.N. Lavrac and S. Dzeroski. Inductive Logic Programming - Techniques and Applications. Chichester, 1994. Google ScholarDigital Library
24.D. Lucarella and A. Zanzi. Information retrieval from hypertext' an approach using plausible inference. Information Processing and Management, 29(3):P.99-312, 1993. Google ScholarDigital Library
25.M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast sea|able classifier for data mining. In Pr0c. of the Fifth Int'I Conference on E~tending Database Technology, Avignon, France, March 1996. Google ScholarDigital Library
26.S. Muggleton and C. Feng. Efficient induction of logic programs. In Proceedings of the Workshop on Algorithmic Learning Theory. Japanese Society for Artificial Intelligence, 1990.Google Scholar
27.M. Pazzani, C. Brunk, and G. Silverstein. A knowledge-intensive approach to learning relational concepts. In Machine Learning: Proceedings of the Eighth International Workshop (ML91), Ithaca, NY, 1991. Morgan Kaufmann.Google ScholarDigital Library
28.L. Pelkowit~. A continuous relaxation labeling algorithm for markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20(3):709-715, May 1990.Google ScholarCross Ref
29.R. Quinlan. Learning logical definitions from relations. Machine Learning, 5, 3:239-266, 1990. Google ScholarDigital Library
30.M. Sahami. Web classification using Bayesian nets. Personal communication, Oct. 1997.Google Scholar
31.G. Salton. Associative document retrieval techniques using bibliographic information. Journal o/the A CM, 10(4):440-457, Oct. 1963. Google ScholarDigital Library
32.G. Salton and M. J. McGill. Introduction to ModeT'n Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
33.G. Salton and Y. Zhang. Enhancement of text representations using related document titles. Information Processing and Management, 22(5):385-394, 1986. Google ScholarDigital Library
34.j. Savoy. A learning scheme for information retrieval in hypertext. Information Processing and Management, 30(4):515-533, 1994. Google ScholarDigital Library
35.J. Savoy. A new probabilistic scheme for information retrieval in hypertext. The New Review of I-Iype~.rned~a and Multi~'nedia, pages 107-134, 1995.Google Scholar
36.J. Savoy. An extended vector processing scheme for searching information in hypertext. Information Processing and Management, 32(2):155-170, Mar. 1996. Google ScholarDigital Library
37.J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalableparallel classifier for data mining. In Proc. of the 2Zth Int'l Conference on Very Large Databases, Bombay, India, Sept 1996. Google ScholarDigital Library
38.Y. Singer and W. W. Cohen. Context-sensitive learning methods for text categorization. In SIGIR. ACM, 1996. Google ScholarDigital Library
39.J. R. Smith and S.-F. Chang. Visually searching the web for content. IEEE Multimedia, 4(3):12-20, 1997. Google ScholarDigital Library
40.R. Weiss, B. Velez~ M. A. Sheldon, C. Nemprempre, P. Szilagyi, A. Duds, and D. K. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In A OM Hyperte~t, Washington, DC, Mar. 1996. Google ScholarDigital Library

Index Terms

Enhanced hypertext categorization using hyperlinks

Recommendations

Enhanced hypertext categorization using hyperlinks

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and ...
Read More
Using kNN model for automatic text categorization

An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new ...
Read More
Enhanced web document summarization using hyperlinks
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia

This paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
June 1998
599 pages
ISBN:0897919955
DOI:10.1145/276304
Chairmen:
Laura Haas
IBM AlmadenResearch Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editors:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
,
Michael Franklin
Univ. of Maryland, College Park
ACM SIGMOD Record Volume 27, Issue 2
June 1998
595 pages
ISSN:0163-5808
DOI:10.1145/276305
Chairmen:
Laura Haas
IBM Almaden Research Center, San Jose, CA
,
Pamela Drew
Boeing Co.
,
Editor:
Ashutosh Tiwary
Boeing Co.; and Univ. of Washington, Seattle
Issue’s Table of Contents
Copyright © 1998 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1998
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 526
  Total Citations
  View Citations
- 2,587
  Total Downloads
- Downloads (Last 12 months)176
- Downloads (Last 6 weeks)45
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhanced hypertext categorization using hyperlinks

SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Enhanced hypertext categorization using hyperlinks

Using kNN model for automatic text categorization

Enhanced web document summarization using hyperlinks