As the amount of on-line text keeps growing, it becomes increasingly difficult for humans to process the deluge of information in the time available. We need automatic text processing systems to help us scan through huge volume of texts, route them to relevant parties, filter them into prespecified categories, or even summarize them. To achieve this, one crucial step is to identify the major topics of the texts, since summarization, text routing, etc., centrally require knowing the topics. In this research, we investigated several topic identification methods and developed three major results:
1. We extended existing word-based frequency counting methods to form a new concept-based frequency method based on the assumption 'the more a concept is mentioned in a text, the more important it is.' We used the knowledge base WordNet to generalize words into concepts and showed how to select concepts of the appropriate degree of generalization.
2. We studied patterns of word co-occurrence (topic signatures) consisting of sets of keywords that uniquely identify the topics of interest. We showed how to acquire keywords from texts pre-classified for each topic, using the $tf\sp{\*}idf$ measure. We also demonstrated how to identify topics using topic signatures, introduced confusion sets and multi-level topic signatures, and discussed the problems associated with multiple topics in a text.
3. We described, implemented, and evaluated a method to learn the Optimal Position Policy (OPP) for finding topic-rich sentences in texts. This work is based on the Position Hypothesis: in genres with fixed discourse structure, the (ordinal) position of a sentence is related to its importance in a text. We showed how to verify the Position Hypothesis using topic keywords, empirically identify important sentence positions in a genre or domain, and quantitatively evaluate the results with various measures.
Cited By
- Schönhofen P (2018). Identifying document topics using the Wikipedia category network, Web Intelligence and Agent Systems, 7:2, (195-207), Online publication date: 1-Apr-2009.
- Wang G, Chua T and Zhao M Exploring knowledge of sub-domain in a multi-resolution bootstrapping framework for concept detection in news video Proceedings of the 16th ACM international conference on Multimedia, (249-258)
- Schonhofen P Identifying Document Topics Using the Wikipedia Category Network Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, (456-462)
- Feng D, Kim J, Shaw E and Hovy E Towards modeling threaded discussions using induced ontology knowledge proceedings of the 21st national conference on Artificial intelligence - Volume 2, (1289-1294)
- Ferret O, Grau B and Jardino M A cross-comparison of two clustering methods Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9, (1-8)
- Lin C Training a selection function for extraction Proceedings of the eighth international conference on Information and knowledge management, (55-62)
- Kaufmann S Cohesion and collocation Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, (591-595)
- Hovy E and Lin C Automated text summarization and the SUMMARIST system Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998, (197-214)
Recommendations
Automated topic naming
Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). ...
Topic sentiment change analysis
MLDM'11: Proceedings of the 7th international conference on Machine learning and data mining in pattern recognitionPublic opinions on a topic may change over time. Topic Sentiment change analysis is a new research problem consisting of two main components: (a) mining opinions on a certain topic, and (b) detect significant changes of sentiment of the opinions on the ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementTopic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...