ABSTRACT
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.
- 1.K, All. Learning relational rules from relational data. Data Mining Seminar, IBM Almaden, San Jose, CA, Sept. 1997.Google Scholar
- 2.C. Apte, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. A CM Transactions on information Systems, 1994. IBM Research Report RC18879. Google ScholarDigital Library
- 3.L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole, 1984. ISBN: 0-534-98054-6.Google Scholar
- 4.S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases, in VLDB, Athens, Greece, Aug. 1997. Invlted submission to VLDB Journal, 1998. Google ScholarDigital Library
- 5.C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Sixth World Wide Web Oor~ferer, ce, San Jose, CA, 1996.Google Scholar
- 6.R. Chellappa and A. fain. Markov Random Fields: Theory and Applications. Academic Press, 1993.Google Scholar
- 7.W. B. Croft and H. R. Turtle. A retrieval model for incorporating hypertext links. In A OM Hypertext, pages 213-224. ACM, 1989. Google ScholarDigital Library
- 8.W. B. Croft and H. R. Turtle. Retrieval strategies for hypertext. Information Processing and Management, 29(3):313-324, 1993. Google ScholarDigital Library
- 9.w. Deng and S. S. Iyengar. A new probabilistic relaxation scheme and its application to edge detection. IEEE Transactions of Pattern Analysis and Machine Intelligence, 18(4):432-437, Apr. 1996. Google ScholarDigital Library
- 10.R. Duds and P. Hart. Pattern Classification and Scene Analysis. Wiley. New York 1973.Google Scholar
- 11.D. Bppsteln. Fmd,ng the k shortest paths. In Symposium on the Foundations of Computer Science. IEEE, 1994.Google Scholar
- 12.D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In VLDB, Athens, Greece, 1997. Google ScholarDigital Library
- 13.H. P. Frei and D. Steiger. Making use of hypertext links when retrieving information. In A CM European Conference on Hypertezt (ECHT), pages 102-111. ACM, Nov. 1992. Google ScholarDigital Library
- 14.H. P. Frei and D. Steiger. The use of semantic links in hypertext information retrieval. Information Processing and Management, 31(1):1-13, 1995. Google ScholarDigital Library
- 15.Y. Freund and D. Pen. Learning to model sequences generated by switching distributions. In Proceedings of the Eighth Annual A CM Conference on Computational Learning Theory (COLT), pages 41-50, 1995. Google ScholarDigital Library
- 16.M. Hearst and C. Karadi. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of A CM SIGIR 97, pages 246--255. ACM Press, 1997. Google ScholarDigital Library
- 17.R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IBEE Transactions of Patter~n Analysis and Machine Intelligence, PAMI-5(3):267-287, May 1983.Google Scholar
- 18.D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In International Conference on Machine Learning, volume 14. Morgan-Kaufmann, July 1997. To appear. Google ScholarDigital Library
- 19.K. L. Kwok. The use of titles and cited titles as document representations for automatic classification. Information Processing and Management, 11:201-206, 1975.Google ScholarCross Ref
- 20.K. L. Kwok. A document-document similarity measure based on cited titles and probability theory and its application to relevance feedback retrieval. In C. J. van Ri~sbergen, editor, Research and Development in Information Retrieval, pages 221- 232. Cambridge University Press, 1984. Google ScholarDigital Library
- 21.K. L. Kwok. A probabitistic theory of indexing and similarity measure based on cited and citing documents. Journal of the American Society for Information Science, 36(342-351), 1985. Google ScholarDigital Library
- 22.K. L. Kwok. On the use of bibliographically related titles for the enhancements of document representations. Information Processing and Management, 24(2):123-131, 1988. Google ScholarDigital Library
- 23.N. Lavrac and S. Dzeroski. Inductive Logic Programming - Techniques and Applications. Chichester, 1994. Google ScholarDigital Library
- 24.D. Lucarella and A. Zanzi. Information retrieval from hypertext' an approach using plausible inference. Information Processing and Management, 29(3):P.99-312, 1993. Google ScholarDigital Library
- 25.M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast sea|able classifier for data mining. In Pr0c. of the Fifth Int'I Conference on E~tending Database Technology, Avignon, France, March 1996. Google ScholarDigital Library
- 26.S. Muggleton and C. Feng. Efficient induction of logic programs. In Proceedings of the Workshop on Algorithmic Learning Theory. Japanese Society for Artificial Intelligence, 1990.Google Scholar
- 27.M. Pazzani, C. Brunk, and G. Silverstein. A knowledge-intensive approach to learning relational concepts. In Machine Learning: Proceedings of the Eighth International Workshop (ML91), Ithaca, NY, 1991. Morgan Kaufmann.Google ScholarDigital Library
- 28.L. Pelkowit~. A continuous relaxation labeling algorithm for markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20(3):709-715, May 1990.Google ScholarCross Ref
- 29.R. Quinlan. Learning logical definitions from relations. Machine Learning, 5, 3:239-266, 1990. Google ScholarDigital Library
- 30.M. Sahami. Web classification using Bayesian nets. Personal communication, Oct. 1997.Google Scholar
- 31.G. Salton. Associative document retrieval techniques using bibliographic information. Journal o/the A CM, 10(4):440-457, Oct. 1963. Google ScholarDigital Library
- 32.G. Salton and M. J. McGill. Introduction to ModeT'n Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
- 33.G. Salton and Y. Zhang. Enhancement of text representations using related document titles. Information Processing and Management, 22(5):385-394, 1986. Google ScholarDigital Library
- 34.j. Savoy. A learning scheme for information retrieval in hypertext. Information Processing and Management, 30(4):515-533, 1994. Google ScholarDigital Library
- 35.J. Savoy. A new probabilistic scheme for information retrieval in hypertext. The New Review of I-Iype~.rned~a and Multi~'nedia, pages 107-134, 1995.Google Scholar
- 36.J. Savoy. An extended vector processing scheme for searching information in hypertext. Information Processing and Management, 32(2):155-170, Mar. 1996. Google ScholarDigital Library
- 37.J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalableparallel classifier for data mining. In Proc. of the 2Zth Int'l Conference on Very Large Databases, Bombay, India, Sept 1996. Google ScholarDigital Library
- 38.Y. Singer and W. W. Cohen. Context-sensitive learning methods for text categorization. In SIGIR. ACM, 1996. Google ScholarDigital Library
- 39.J. R. Smith and S.-F. Chang. Visually searching the web for content. IEEE Multimedia, 4(3):12-20, 1997. Google ScholarDigital Library
- 40.R. Weiss, B. Velez~ M. A. Sheldon, C. Nemprempre, P. Szilagyi, A. Duds, and D. K. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In A OM Hyperte~t, Washington, DC, Mar. 1996. Google ScholarDigital Library
Index Terms
- Enhanced hypertext categorization using hyperlinks
Recommendations
Enhanced hypertext categorization using hyperlinks
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and ...
Using kNN model for automatic text categorization
An investigation is conducted on two well-known similarity-based learning approaches to text categorization: the k-nearest neighbors (kNN) classifier and the Rocchio classifier. After identifying the weakness and strength of each technique, a new ...
Enhanced web document summarization using hyperlinks
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermediaThis paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider ...
Comments