ABSTRACT
Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.
- HTML 4.01 specification, W3C recommendation. http://www.w3.org/TR/REC-html40/struct/links.html, December 1999.Google Scholar
- Lada~A. Adamic and Eytan Adar. Friends and neighbors on the web. http://www.hpl.hp.com/shl/people/eytan/fnn.pdf.Google Scholar
- Robert M. Akscyn, Donald L. McCracken, and Elise A. Yoder. Kms: A distributed hypermedia system for managing knowledge in organizations. Comm. of the ACM, 31(7):820--835, July 1988. Google ScholarDigital Library
- T. Berners-Lee, R. Fielding, and L. Masinter. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. RFC 2396. Google ScholarDigital Library
- Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, May 2001. See http://www.sciam.com/2001/0501issue/0501berners-lee.html.Google ScholarCross Ref
- Rodrigo A. Botafogo and Ben Shneiderman. Identifying aggregates in hypertext structures. In UK Conference on Hypertext, pages 63--74, 1991. Google ScholarDigital Library
- Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.Google Scholar
- Michael Chen, Marti~A. Hearst, Jason Hong, and James Lin. Cha-cha: A system for organizing intranet search results. In USENIX Symposium on Internet Technologies and Systems, 1999. Google ScholarDigital Library
- R. C. Daley and P. G. Neumann. A general-purpose file system for secondary storage. In AFIPS Conference Proceedings, volume 27, pages 213--229, 1965. http://www.multicians.org/fjcc4.html.Google ScholarDigital Library
- Frank G. Halasz. Reflections on notecards: Seven issues for the next generation of hypermedia systems. Comm. of the ACM, 7(31):836--852, July 1988. Google ScholarDigital Library
- Frank G. Halasz and Mayer Schwartz. The Dexter hypertext reference model. Comm. of the ACM, 37(2):30--39, 1994. Google ScholarDigital Library
- M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14, 1963.Google Scholar
- S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases from the Web. In Proceedings of the 25th VLDB Conference, pages 639--650, 1999. http://www.almaden.ibm.com/cs/k53/vldbkb.ps. Google ScholarDigital Library
- Wen-Syan Li, SelÇcuk Candan, Quoc Vu, and Divyakant Agrawal. Query relaxation by structure and semantics for retrieval of logical web documents. IEEE Transactions on Knowledge and Data Engineering, 14(4):768--790, 2002. Google ScholarDigital Library
- Yoshiaki Mizuuchi and Keishi Tajima. Finding context paths for Web pages. In Proceedings of Hypertext 99, pages 13--22, Darmstadt, Germany, 1999. Google ScholarDigital Library
- Theodor~Holm Nelson. Embedded markup considered harmful. http://www.xml.com/pub/a/w3j/s3.nelson.html, October 1997.Google Scholar
- David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Les Giles. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Science, 99(8):5207--5211, April 16 2002.Google ScholarCross Ref
- Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large altavista query log. Technical report, DEC Systems Research Center, 1998. Technical note 1998-14.Google Scholar
- H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.Google ScholarCross Ref
- H. G. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of American Society for Information Science, 24(4):265--269, 1973.Google ScholarCross Ref
- Ellen Spertus. Parasite: Mining structural information on the web. In Proceedings of the Sixth International Conference on the World Wide Web, volume~29 of Computer Networks, pages 1205--1215, 1997. http://decweb.ethz.ch/WWW6/Technical/Paper206/Paper206.html. Google ScholarDigital Library
- Amanda Spink, Dietmar Wolfram, B. J. Jansen, and Tefko Saracevic. Searching the web: The public and their queries. Journal of the American Society for Information Science, 53(2):226--234, 2001. Google ScholarDigital Library
- Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, and Katsumi Tanaka. Discovery and retrieval of logical information units in web. In Proceedings of the Workshop on Organizing Web Space (WOWS 99), pages 13--23, Berkeley, CA, August 1999.Google Scholar
- Duncan J. Watts and Steven H. Strogatz. Collective dynamics of "small-world networks". Nature, 393:440--442, June 4 1998.Google ScholarCross Ref
- Ron Weiss, Bienvenido Vélez, Mark A. Sheldon, Chanathip Namprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford. Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In ACM Conference on Hypertext, pages 180--193, Washington USA, 1996. See http://www.psrg.lcs.mit.edu/publications/Papers/hypert.pdf. Google ScholarDigital Library
Index Terms
- Untangling compound documents on the web
Recommendations
Understanding web documents: finding pagelets for transformation using structural patterns
Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web ...
Categorisation of web documents using extraction ontologies
Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
Ontology-focused crawling of Web documents
SAC '03: Proceedings of the 2003 ACM symposium on Applied computingThe Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized. Due to the distributed nature of the World Wide Web it is difficult to use it as a tool for ...
Comments