This dissertation investigates the role of contextual information in the automated retrieval and display of full-text documents, using robust natural language processing algorithms to automatically detect structure in and assign topic labels to texts. Many long texts are comprised of complex topic and subtopic structure, a fact ignored by existing information access methods. I present two algorithms which detect such structure, and two visual display paradigms which use the results of these algorithms to show the interactions of multiple main topics, multiple subtopics, and the relations between main topics and subtopics.
The first algorithm, called TextTiling, recognizes the subtopic structure of texts as dictated by their content. It uses domain-independent lexical frequency and distribution information to partition texts into multi-paragraph passages. The results are found to correspond well to reader judgments of major subtopic boundaries. The second algorithm assigns multiple main topic labels to each text, where the labels are chosen from pre-defined, intuitive category sets; the algorithm is trained on unlabeled text.
A new iconic representation, called TileBars uses TextTiles to simultaneously and compactly display query term frequency, query term distribution and relative document length. This representation provides an informative alternative to ranking long uxts according to their overall similarity to a query. For example, a user can choose to view those documents that have an extended discussion of one set of terms and a brief but overlapping discussion of a second set of terms. This representation also allows for relevance feedback on patterns of term distribution.
TileBars display documents only in terms of words supplied in the user query. For a given retrieved text, if the query words do not correspond to its main topics, the user cannot discern in what context the query terms were used. For example, a query on contaminants may retrieve documents whose main topics relate to nuclear power, food, or oil spills. To address this issue, I describe a graphical interface, called Cougar, that displays retrieved documents in terms of interactions among their automatically-assigned main topics, thus allowing users to familiarize themselves with the topics and terminology of a text collection.
Cited By
- Caracciolo C and de Rijke M Generating and retrieving text segments for focused access to scientific documents Proceedings of the 28th European conference on Advances in Information Retrieval, (350-361)
- Karoui L, Aufaure M and Bennacer N Context-based Hierarchical Clustering for the Ontology Learning Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, (420-427)
- Fragkou P, Petridis V and Kehagias A (2019). A Dynamic Programming Algorithm for Linear Text Segmentation, Journal of Intelligent Information Systems, 23:2, (179-197), Online publication date: 1-Sep-2004.
- Jingbo Z and Tianshun Y A knowledge-based approach to text classification Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18, (1-5)
- Blei D and Moreno P Topic segmentation with an aspect hidden Markov model Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, (343-348)
- Church K Empirical estimates of adaptation Proceedings of the 18th conference on Computational linguistics - Volume 1, (180-186)
- Oard D (1997). The State of the Art in Text Filtering, User Modeling and User-Adapted Interaction, 7:3, (141-178), Online publication date: 1-Mar-1997.
- Marshall C and Shipman F Spatial hypertext and the practice of information triage Proceedings of the eighth ACM conference on Hypertext, (124-133)
- Hearst M TileBars Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, (59-66)
Index Terms
- Context and structure in automated full-text information access
Recommendations
Finding structure in noisy text: topic classification and unsupervised clustering
This paper addresses two types of classification of noisy, unstructured text such as newsgroup messages: (1) spotting messages containing topics of interest, and (2) automatic conceptual organization of messages without prior knowledge of topics of ...
Approaches to passage retrieval in full text information systems
SIGIR '93: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrievalLarge collections of full-text documents are now commonly used in automated information retrieval. When the stored document texts are long, the retrieval of complete documents may not be in the users' best interest. In such circumstance, efficient and ...