ABSTRACT
This paper presents a new approach for term extraction using minimal resources. A term candidate extraction algorithm is proposed to identify features of the relatively stable and domain independent term delimiters rather than that of the terms. For term verification, a link analysis based method is proposed to calculate the relevance between term candidates and the sentences in the domain specific corpus from which the candidates are extracted. The proposed approach requires no prior domain knowledge, no general corpora, no full segmentation and minimal adaptation for new domains. Consequently, the method can be used in any domain corpus and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show quite significant improvements over existing techniques and also verify the efficiency and relative domain independent nature of the approach. Experiments on new term extraction also indicate that the approach is quite effective for identifying new terms in a domain making it useful for domain knowledge update.
- Chang Jing-Shin. 2005. Domain Specific Word Extraction from Hierarchical Web Documents: A First Step toward Building Lexicon Trees from Web Corpora. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Learning: 64--71.Google Scholar
- Chien LF. 1999. Pat-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval. Information Processing and Management, vol. 35: 501--521.Google ScholarCross Ref
- Eibe Frank, Gordon. W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning. 1999. Domain-specific Keyphrase Extraction. In Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI-99: 668--673. Google ScholarDigital Library
- Feng Haodi, Kang Chen, Xiaotie Deng, and Weimin Zheng, 2004. Accessor variety criteria for Chinese word extraction. Computational Linguistics, 30(1):75--93. Google ScholarDigital Library
- Hiroshi Nakagawa, and Tatsunori Mori. 2002. A simple but powerful automatic term extraction method. In COMPUTERM-2002 Proceedings of the 2nd International Workshop on Computational Term: 29--35. Taiwan, August 2002. Google ScholarDigital Library
- Hisamitsu T., and Y. Niwa. 2002. A measure of term representativeness based on the number of co-occurring salient words. In Proceedings of the 19th COLING, 2002. Google ScholarDigital Library
- Huang Chu-Ren, Petr Šimon, Shu-Kai Hsieh, and Laurent Pr'evot. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. In Proceedings of the ACL 2007 Demo and Poster Sessions: 69--72. Joachims T. 2000. Estimating the Generalization Performance of a SVM Efficiently. In Proceedings of the International Conference on Machine Learning, Morgan Kaufman, 2000. Google ScholarDigital Library
- Kageura K., and B. Umino. 1996. Methods of automatic term recognition: a review. Term 3(2):259--289.Google ScholarCross Ref
- Kleinberg J. 1997. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms: 668--677. New Orleans, America, January 1997. Google ScholarDigital Library
- Ji Luning, and Qin Lu. 2007. Chinese Term Extraction Using Window-Based Contextual Information. In Proceedings of CICLing 2007, LNCS 4394: 62--74. Google ScholarDigital Library
- Li Hongqiao, Chang-Ning Huang, Jianfeng Gao, and Xiaozhong Fan. The Use of SVM for Chinese New Word Identification. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNL P2004): 723--732. Hainan Island, China, March 2004. Google ScholarDigital Library
- Luo Shengfen, and Maosong Sun. 2003. Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing: 24--30. Google ScholarDigital Library
- McDonald, David D. 1993. Internal and External Evidence in the Identification and Semantic Categorization of Proper Names. In Proceedings of the Workshop on Acquisition of Lexical Knowledge from Text, pages 32--43, Columbus, OH, June. Special Interest Group on the Lexicon of the Association for Computational Linguistics.Google Scholar
- Nasreen AbdulJaleel and Yan Qu. 2005. Domain Term Extraction and Structuring via Link Analysis. In Proceedings of the AAAI '05 Workshop on Link Analysis: 39--46.Google Scholar
- Salton, G., and McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill. Google ScholarDigital Library
- Schone, P. and Jurafsky D. 2001. Is Knowledge-free Induction of Multiword Unit Dictionary Headwords a solved problem? In Proceedings of EMNLP2001.Google Scholar
- Sornlertlamvanich V., Potipiti T., and Charoenporn T. 2000. Automatic Corpus-based Thai Word Extraction with the C4.5 Learning Algorithm. In Proceedings of COLING 2000. Google ScholarDigital Library
- Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer, 1995. Google ScholarDigital Library
- Zhou GD, Shen D, Zhang J, Su J, and Tan SH. 2005. Recognition of Protein/Gene Names from Text using an Ensemble of Classifiers. BMC Bioinformatics 2005, 6(Suppl 1): S7.Google Scholar
Index Terms
- Chinese term extraction using minimal resources
Recommendations
Discovering Chinese Compound Term Using Termhood and Unithood Measures
CHINAGRID '11: Proceedings of the 2011 Sixth Annual ChinaGrid ConferenceDomain terms play a crucial role in many research areas, which has led to a rise in demand for automatic domain terms extraction. In this paper, we present a two-level evaluation approach based on term hood and unit hood to extract Chinese domain ...
A delimiter-based general approach for Chinese term extraction
This article addresses a two-step approach for term extraction. In the first step on term candidate extraction, a new delimiter-based approach is proposed to identify features of the delimiters of term candidates rather than those of the term candidates ...
Research on Automatic Chinese Multi-word Term Extraction Based on Term Component
ICCPOL '09: Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based EconomyThis paper presents an automatic Chinese multi-word term extraction method based on the unithood and the termhood measure. The unithood of the candidate term is measured by the strength of inner unity and marginal variety. Term component is taken into ...
Comments