Article

Untangling compound documents on the web

Authors:
Nadav Eiron

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Kevin S. McCurley

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermediaAugust 2003Pages 85–94https://doi.org/10.1145/900051.900070

Published:26 August 2003Publication History

HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia

Pages 85–94

ABSTRACT

Most text analysis is designed to deal with the concept of a "document", namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of "document" and "web node" are not synonymous, and that authors often tend to deploy documents as collections of URLs, which we call "compound documents". In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes.

References

HTML 4.01 specification, W3C recommendation. http://www.w3.org/TR/REC-html40/struct/links.html, December 1999.Google Scholar
Lada~A. Adamic and Eytan Adar. Friends and neighbors on the web. http://www.hpl.hp.com/shl/people/eytan/fnn.pdf.Google Scholar
Robert M. Akscyn, Donald L. McCracken, and Elise A. Yoder. Kms: A distributed hypermedia system for managing knowledge in organizations. Comm. of the ACM, 31(7):820--835, July 1988. Google ScholarDigital Library
T. Berners-Lee, R. Fielding, and L. Masinter. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. RFC 2396. Google ScholarDigital Library
Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. Scientific American, May 2001. See http://www.sciam.com/2001/0501issue/0501berners-lee.html.Google ScholarCross Ref
Rodrigo A. Botafogo and Ben Shneiderman. Identifying aggregates in hypertext structures. In UK Conference on Hypertext, pages 63--74, 1991. Google ScholarDigital Library
Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.Google Scholar
Michael Chen, Marti~A. Hearst, Jason Hong, and James Lin. Cha-cha: A system for organizing intranet search results. In USENIX Symposium on Internet Technologies and Systems, 1999. Google ScholarDigital Library
R. C. Daley and P. G. Neumann. A general-purpose file system for secondary storage. In AFIPS Conference Proceedings, volume 27, pages 213--229, 1965. http://www.multicians.org/fjcc4.html.Google ScholarDigital Library
Frank G. Halasz. Reflections on notecards: Seven issues for the next generation of hypermedia systems. Comm. of the ACM, 7(31):836--852, July 1988. Google ScholarDigital Library
Frank G. Halasz and Mayer Schwartz. The Dexter hypertext reference model. Comm. of the ACM, 37(2):30--39, 1994. Google ScholarDigital Library
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14, 1963.Google Scholar
S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Extracting large-scale knowledge bases from the Web. In Proceedings of the 25th VLDB Conference, pages 639--650, 1999. http://www.almaden.ibm.com/cs/k53/vldbkb.ps. Google ScholarDigital Library
Wen-Syan Li, SelÇcuk Candan, Quoc Vu, and Divyakant Agrawal. Query relaxation by structure and semantics for retrieval of logical web documents. IEEE Transactions on Knowledge and Data Engineering, 14(4):768--790, 2002. Google ScholarDigital Library
Yoshiaki Mizuuchi and Keishi Tajima. Finding context paths for Web pages. In Proceedings of Hypertext 99, pages 13--22, Darmstadt, Germany, 1999. Google ScholarDigital Library
Theodor~Holm Nelson. Embedded markup considered harmful. http://www.xml.com/pub/a/w3j/s3.nelson.html, October 1997.Google Scholar
David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover, and C. Les Giles. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Science, 99(8):5207--5211, April 16 2002.Google ScholarCross Ref
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large altavista query log. Technical report, DEC Systems Research Center, 1998. Technical note 1998-14.Google Scholar
H. A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425--440, 1955.Google ScholarCross Ref
H. G. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of American Society for Information Science, 24(4):265--269, 1973.Google ScholarCross Ref
Ellen Spertus. Parasite: Mining structural information on the web. In Proceedings of the Sixth International Conference on the World Wide Web, volume~29 of Computer Networks, pages 1205--1215, 1997. http://decweb.ethz.ch/WWW6/Technical/Paper206/Paper206.html. Google ScholarDigital Library
Amanda Spink, Dietmar Wolfram, B. J. Jansen, and Tefko Saracevic. Searching the web: The public and their queries. Journal of the American Society for Information Science, 53(2):226--234, 2001. Google ScholarDigital Library
Keishi Tajima, Kenji Hatano, Takeshi Matsukura, Ryouichi Sano, and Katsumi Tanaka. Discovery and retrieval of logical information units in web. In Proceedings of the Workshop on Organizing Web Space (WOWS 99), pages 13--23, Berkeley, CA, August 1999.Google Scholar
Duncan J. Watts and Steven H. Strogatz. Collective dynamics of "small-world networks". Nature, 393:440--442, June 4 1998.Google ScholarCross Ref
Ron Weiss, Bienvenido Vélez, Mark A. Sheldon, Chanathip Namprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford. Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In ACM Conference on Hypertext, pages 180--193, Washington USA, 1996. See http://www.psrg.lcs.mit.edu/publications/Papers/hypert.pdf. Google ScholarDigital Library

Index Terms

Untangling compound documents on the web
1. Information systems
  1. Information retrieval

Recommendations

Understanding web documents: finding pagelets for transformation using structural patterns

Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web ...
Read More
Categorisation of web documents using extraction ontologies

Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
Read More
Ontology-focused crawling of Web documents
SAC '03: Proceedings of the 2003 ACM symposium on Applied computing

The Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized. Due to the distributed nature of the World Wide Web it is difficult to use it as a tool for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
August 2003
232 pages
ISBN:1581137044
DOI:10.1145/900051
General Chairs:
Helen Ashman
University of Nottingham, UK
,
Tim Brailsford
University of Nottingham, UK
,
Program Chairs:
Les Carr
University of Southampton, UK
,
Lynda Hardman
CWI & Technical University of Eindhoven, The Netherlands
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
composites
hypertext
information retrieval
semantic web
wasted space
Qualifiers
- Article
Conference

Acceptance Rates
HYPERTEXT '03 Paper Acceptance Rate36of136submissions,26%Overall Acceptance Rate378of1,158submissions,33%
More
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 723
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Untangling compound documents on the web

HYPERTEXT '03: Proceedings of the fourteenth ACM conference on Hypertext and hypermedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding web documents: finding pagelets for transformation using structural patterns

Categorisation of web documents using extraction ontologies

Ontology-focused crawling of Web documents