Robust automated topic identification

January 1997

Author:
Chin-Yew Lin

Publisher:

University of Southern California
Computer Science Dept. 200 University Park Los Angeles, CA
United States

ISBN:978-0-591-67266-4

Order Number:AAI9816048

Pages:

229

Purchase on ProQuest

Bibliometrics

Abstract

As the amount of on-line text keeps growing, it becomes increasingly difficult for humans to process the deluge of information in the time available. We need automatic text processing systems to help us scan through huge volume of texts, route them to relevant parties, filter them into prespecified categories, or even summarize them. To achieve this, one crucial step is to identify the major topics of the texts, since summarization, text routing, etc., centrally require knowing the topics. In this research, we investigated several topic identification methods and developed three major results:

1. We extended existing word-based frequency counting methods to form a new concept-based frequency method based on the assumption 'the more a concept is mentioned in a text, the more important it is.' We used the knowledge base WordNet to generalize words into concepts and showed how to select concepts of the appropriate degree of generalization.

2. We studied patterns of word co-occurrence (topic signatures) consisting of sets of keywords that uniquely identify the topics of interest. We showed how to acquire keywords from texts pre-classified for each topic, using the $tf\sp{\*}idf$ measure. We also demonstrated how to identify topics using topic signatures, introduced confusion sets and multi-level topic signatures, and discussed the problems associated with multiple topics in a text.

3. We described, implemented, and evaluated a method to learn the Optimal Position Policy (OPP) for finding topic-rich sentences in texts. This work is based on the Position Hypothesis: in genres with fixed discourse structure, the (ordinal) position of a sentence is related to its importance in a text. We showed how to verify the Position Hypothesis using topic keywords, empirically identify important sentence positions in a genre or domain, and quantitatively evaluate the results with various measures.

Cited By

Contributors

Chinyew Lin
University of Southern California
- Publication Years1995 - 2006
- Publication counts25
- Citation count1,099
- Available for Download24
- Downloads (cumulative)21,127
- Downloads (12 months)1,080
- Downloads (6 weeks)185
- Average Downloads per Article880
- Average Citation per Article44
View Full Profile

Recommendations

Automated topic naming

Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). ...
Read More
Topic sentiment change analysis
MLDM'11: Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition

Public opinions on a topic may change over time. Topic Sentiment change analysis is a new research problem consisting of two main components: (a) mining opinions on a certain topic, and (b) detect significant changes of sentiment of the opinions on the ...
Read More
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Read More

Comments

Browse Theses

Sections

Cited By

Automated topic naming

Topic sentiment change analysis

Topic analysis for topic-focused multi-document summarization

Sections

Cited By

Save to Binder

Recommendations

Automated topic naming

Topic sentiment change analysis

Topic analysis for topic-focused multi-document summarization