skip to main content
10.1145/276304.276332acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free Access

Enhanced hypertext categorization using hyperlinks

Published:01 June 1998Publication History

ABSTRACT

A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain high-quality semantic clues that are lost upon a purely term-based classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented with pre-classified samples from Yahoo!1 and the US Patent Database2. In previous work, we developed a text classifier that misclassified only 13% of the documents in the well-known Reuters benchmark; this was comparable to the best results ever obtained. This classifier misclassified 36% of the patents, indicating that classifying hypertext can be more difficult than classifying text. Naively using terms in neighboring documents increased error to 38%; our hypertext classifier reduced it to 21%. Results with the Yahoo! sample were more dramatic: the text classifier showed 68% error, whereas our hypertext classifier reduced this to only 21%.

References

  1. 1.K, All. Learning relational rules from relational data. Data Mining Seminar, IBM Almaden, San Jose, CA, Sept. 1997.Google ScholarGoogle Scholar
  2. 2.C. Apte, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. A CM Transactions on information Systems, 1994. IBM Research Report RC18879. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole, 1984. ISBN: 0-534-98054-6.Google ScholarGoogle Scholar
  4. 4.S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases, in VLDB, Athens, Greece, Aug. 1997. Invlted submission to VLDB Journal, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal. Web search using automatic classification. In Sixth World Wide Web Oor~ferer, ce, San Jose, CA, 1996.Google ScholarGoogle Scholar
  6. 6.R. Chellappa and A. fain. Markov Random Fields: Theory and Applications. Academic Press, 1993.Google ScholarGoogle Scholar
  7. 7.W. B. Croft and H. R. Turtle. A retrieval model for incorporating hypertext links. In A OM Hypertext, pages 213-224. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.W. B. Croft and H. R. Turtle. Retrieval strategies for hypertext. Information Processing and Management, 29(3):313-324, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.w. Deng and S. S. Iyengar. A new probabilistic relaxation scheme and its application to edge detection. IEEE Transactions of Pattern Analysis and Machine Intelligence, 18(4):432-437, Apr. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.R. Duds and P. Hart. Pattern Classification and Scene Analysis. Wiley. New York 1973.Google ScholarGoogle Scholar
  11. 11.D. Bppsteln. Fmd,ng the k shortest paths. In Symposium on the Foundations of Computer Science. IEEE, 1994.Google ScholarGoogle Scholar
  12. 12.D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In VLDB, Athens, Greece, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.H. P. Frei and D. Steiger. Making use of hypertext links when retrieving information. In A CM European Conference on Hypertezt (ECHT), pages 102-111. ACM, Nov. 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.H. P. Frei and D. Steiger. The use of semantic links in hypertext information retrieval. Information Processing and Management, 31(1):1-13, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.Y. Freund and D. Pen. Learning to model sequences generated by switching distributions. In Proceedings of the Eighth Annual A CM Conference on Computational Learning Theory (COLT), pages 41-50, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.M. Hearst and C. Karadi. Cat-a-Cone: An interactive interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of A CM SIGIR 97, pages 246--255. ACM Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IBEE Transactions of Patter~n Analysis and Machine Intelligence, PAMI-5(3):267-287, May 1983.Google ScholarGoogle Scholar
  18. 18.D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In International Conference on Machine Learning, volume 14. Morgan-Kaufmann, July 1997. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.K. L. Kwok. The use of titles and cited titles as document representations for automatic classification. Information Processing and Management, 11:201-206, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  20. 20.K. L. Kwok. A document-document similarity measure based on cited titles and probability theory and its application to relevance feedback retrieval. In C. J. van Ri~sbergen, editor, Research and Development in Information Retrieval, pages 221- 232. Cambridge University Press, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.K. L. Kwok. A probabitistic theory of indexing and similarity measure based on cited and citing documents. Journal of the American Society for Information Science, 36(342-351), 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.K. L. Kwok. On the use of bibliographically related titles for the enhancements of document representations. Information Processing and Management, 24(2):123-131, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.N. Lavrac and S. Dzeroski. Inductive Logic Programming - Techniques and Applications. Chichester, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.D. Lucarella and A. Zanzi. Information retrieval from hypertext' an approach using plausible inference. Information Processing and Management, 29(3):P.99-312, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast sea|able classifier for data mining. In Pr0c. of the Fifth Int'I Conference on E~tending Database Technology, Avignon, France, March 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.S. Muggleton and C. Feng. Efficient induction of logic programs. In Proceedings of the Workshop on Algorithmic Learning Theory. Japanese Society for Artificial Intelligence, 1990.Google ScholarGoogle Scholar
  27. 27.M. Pazzani, C. Brunk, and G. Silverstein. A knowledge-intensive approach to learning relational concepts. In Machine Learning: Proceedings of the Eighth International Workshop (ML91), Ithaca, NY, 1991. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.L. Pelkowit~. A continuous relaxation labeling algorithm for markov random fields. IEEE Transactions on Systems, Man and Cybernetics, 20(3):709-715, May 1990.Google ScholarGoogle ScholarCross RefCross Ref
  29. 29.R. Quinlan. Learning logical definitions from relations. Machine Learning, 5, 3:239-266, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. 30.M. Sahami. Web classification using Bayesian nets. Personal communication, Oct. 1997.Google ScholarGoogle Scholar
  31. 31.G. Salton. Associative document retrieval techniques using bibliographic information. Journal o/the A CM, 10(4):440-457, Oct. 1963. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. 32.G. Salton and M. J. McGill. Introduction to ModeT'n Information Retrieval. McGraw-Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. 33.G. Salton and Y. Zhang. Enhancement of text representations using related document titles. Information Processing and Management, 22(5):385-394, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. 34.j. Savoy. A learning scheme for information retrieval in hypertext. Information Processing and Management, 30(4):515-533, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 35.J. Savoy. A new probabilistic scheme for information retrieval in hypertext. The New Review of I-Iype~.rned~a and Multi~'nedia, pages 107-134, 1995.Google ScholarGoogle Scholar
  36. 36.J. Savoy. An extended vector processing scheme for searching information in hypertext. Information Processing and Management, 32(2):155-170, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. 37.J. Sharer, R. Agrawal, and M. Mehta. SPRINT: A scalableparallel classifier for data mining. In Proc. of the 2Zth Int'l Conference on Very Large Databases, Bombay, India, Sept 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. 38.Y. Singer and W. W. Cohen. Context-sensitive learning methods for text categorization. In SIGIR. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. 39.J. R. Smith and S.-F. Chang. Visually searching the web for content. IEEE Multimedia, 4(3):12-20, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. 40.R. Weiss, B. Velez~ M. A. Sheldon, C. Nemprempre, P. Szilagyi, A. Duds, and D. K. Gifford. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In A OM Hyperte~t, Washington, DC, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhanced hypertext categorization using hyperlinks

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SIGMOD '98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data
                  June 1998
                  599 pages
                  ISBN:0897919955
                  DOI:10.1145/276304

                  Copyright © 1998 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 June 1998

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate785of4,003submissions,20%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader