skip to main content
10.1145/900051.900070acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
Article

Untangling compound documents on the web

Published:26 August 2003Publication History

ABSTRACT

Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.

References

  1. HTML 4.01 specification, W3C recommendation. http://www.w3.org/TR/REC-html40/struct/links.html, December 1999.Google ScholarGoogle Scholar
  2. Lada~A. Adamic and Eytan Adar. Friends and neighbors on the web. http://www.hpl.hp.com/shl/people/eytan/fnn.pdf.Google ScholarGoogle Scholar
  3. Robert M. Akscyn, Donald L. McCracken, and Elise A. Yoder. Kms: A distributed hypermedia system for managing knowledge in organizations. Comm. of the ACM, 31(7):820--835, July 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Berners-Lee, R. Fielding, and L. Masinter. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. RFC 2396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, May 2001. See http://www.sciam.com/2001/0501issue/0501berners-lee.html.Google ScholarGoogle ScholarCross RefCross Ref
  6. Rodrigo A. Botafogo and Ben Shneiderman. Identifying aggregates in hypertext structures. In UK Conference on Hypertext, pages 63--74, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.Google ScholarGoogle Scholar
  8. Michael Chen, Marti~A. Hearst, Jason Hong, and James Lin. Cha-cha: A system for organizing intranet search results. In USENIX Symposium on Internet Technologies and Systems, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. C. Daley and P. G. Neumann. A general-purpose file system for secondary storage. In AFIPS Conference Proceedings, volume 27, pages 213--229, 1965. http://www.multicians.org/fjcc4.html.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Frank G. Halasz. Reflections on notecards: Seven issues for the next generation of hypermedia systems. Comm. of the ACM, 7(31):836--852, July 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Frank G. Halasz and Mayer Schwartz. The Dexter hypertext reference model. Comm. of the ACM, 37(2):30--39, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14, 1963.Google ScholarGoogle Scholar
  13. S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases from the Web. In Proceedings of the 25th VLDB Conference, pages 639--650, 1999. http://www.almaden.ibm.com/cs/k53/vldbkb.ps. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wen-Syan Li, SelÇcuk Candan, Quoc Vu, and Divyakant Agrawal. Query relaxation by structure and semantics for retrieval of logical web documents. IEEE Transactions on Knowledge and Data Engineering, 14(4):768--790, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yoshiaki Mizuuchi and Keishi Tajima. Finding context paths for Web pages. In Proceedings of Hypertext 99, pages 13--22, Darmstadt, Germany, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Theodor~Holm Nelson. Embedded markup considered harmful. http://www.xml.com/pub/a/w3j/s3.nelson.html, October 1997.Google ScholarGoogle Scholar
  17. David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Les Giles. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Science, 99(8):5207--5211, April 16 2002.Google ScholarGoogle ScholarCross RefCross Ref
  18. Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large altavista query log. Technical report, DEC Systems Research Center, 1998. Technical note 1998-14.Google ScholarGoogle Scholar
  19. H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  20. H. G. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of American Society for Information Science, 24(4):265--269, 1973.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ellen Spertus. Parasite: Mining structural information on the web. In Proceedings of the Sixth International Conference on the World Wide Web, volume~29 of Computer Networks, pages 1205--1215, 1997. http://decweb.ethz.ch/WWW6/Technical/Paper206/Paper206.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Amanda Spink, Dietmar Wolfram, B. J. Jansen, and Tefko Saracevic. Searching the web: The public and their queries. Journal of the American Society for Information Science, 53(2):226--234, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, and Katsumi Tanaka. Discovery and retrieval of logical information units in web. In Proceedings of the Workshop on Organizing Web Space (WOWS 99), pages 13--23, Berkeley, CA, August 1999.Google ScholarGoogle Scholar
  24. Duncan J. Watts and Steven H. Strogatz. Collective dynamics of "small-world networks". Nature, 393:440--442, June 4 1998.Google ScholarGoogle ScholarCross RefCross Ref
  25. Ron Weiss, Bienvenido Vélez, Mark A. Sheldon, Chanathip Namprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford. Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In ACM Conference on Hypertext, pages 180--193, Washington USA, 1996. See http://www.psrg.lcs.mit.edu/publications/Papers/hypert.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Untangling compound documents on the web

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
      August 2003
      232 pages
      ISBN:1581137044
      DOI:10.1145/900051

      Copyright © 2003 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 August 2003

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      HYPERTEXT '03 Paper Acceptance Rate36of136submissions,26%Overall Acceptance Rate378of1,158submissions,33%

      Upcoming Conference

      HT '24
      35th ACM Conference on Hypertext and Social Media
      September 10 - 13, 2024
      Poznan , Poland

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader