skip to main content
Robust automated topic identification
Publisher:
  • University of Southern California
  • Computer Science Dept. 200 University Park Los Angeles, CA
  • United States
ISBN:978-0-591-67266-4
Order Number:AAI9816048
Pages:
229
Bibliometrics
Skip Abstract Section
Abstract

As the amount of on-line text keeps growing, it becomes increasingly difficult for humans to process the deluge of information in the time available. We need automatic text processing systems to help us scan through huge volume of texts, route them to relevant parties, filter them into prespecified categories, or even summarize them. To achieve this, one crucial step is to identify the major topics of the texts, since summarization, text routing, etc., centrally require knowing the topics. In this research, we investigated several topic identification methods and developed three major results:

1. We extended existing word-based frequency counting methods to form a new concept-based frequency method based on the assumption 'the more a concept is mentioned in a text, the more important it is.' We used the knowledge base WordNet to generalize words into concepts and showed how to select concepts of the appropriate degree of generalization.

2. We studied patterns of word co-occurrence (topic signatures) consisting of sets of keywords that uniquely identify the topics of interest. We showed how to acquire keywords from texts pre-classified for each topic, using the $tf\sp{\*}idf$ measure. We also demonstrated how to identify topics using topic signatures, introduced confusion sets and multi-level topic signatures, and discussed the problems associated with multiple topics in a text.

3. We described, implemented, and evaluated a method to learn the Optimal Position Policy (OPP) for finding topic-rich sentences in texts. This work is based on the Position Hypothesis: in genres with fixed discourse structure, the (ordinal) position of a sentence is related to its importance in a text. We showed how to verify the Position Hypothesis using topic keywords, empirically identify important sentence positions in a genre or domain, and quantitatively evaluate the results with various measures.

Contributors
  • University of Southern California

Recommendations