ABSTRACT
This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.
- Academia Sinica Balanced Corpus of Modern Chinese. http://www.sinica.edu.tw/SinicaCorpus/Google Scholar
- Chen K. J and Liu S. H. 1992. Word Identification for Mandarin Chinese sentences. Proceedings of the 14th conference on Computational Linguistics, p. 101--107, France. Google ScholarDigital Library
- Chiang, T.-H., J.-S. Chang, M.-Y. Lin and K.-Y. Su. 1996. Statistical Word Segmentation. In C.-R. Huang, K.-J. Chen and B. K. T'sou (eds.): Journal of Chinese Linguistics, Monograph Series, Number 9, Readings in Chinese Natural Language Processing, pp. 147--173.Google Scholar
- Gao, J. and A. Wu and Mu Li and C.-N. Huang and H. Li and X. Xia and H. Qin. 2004. Adaptive Chinese Word Segmentation. In Proceedings of ACL-2004. Google ScholarDigital Library
- Meng, H. and C. W. Ip. 1999. An Analytical Study of Transformational Tagging for Chinese Text. In. Proceedings of ROCLING XII. 101--122. TaipeiGoogle Scholar
- Ruggieri S. 2004. YaDT: Yet another Decision Tree builder. Proceedings of the 16th International Conference on Tools with Artificial Intelligence (ICTAI 2004): 260--265. IEEE Press, November 2004. Google ScholarDigital Library
- Richard Sproat and Thomas Emerson. 2003. The First International Chinese Word Segmentation Bakeoff. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, July 2003. Google ScholarDigital Library
- Xue, N. 2003. Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing. 8(1): 29--48Google Scholar
- Redington, M. and N. Chater and C. Huang and L. Chang and K. Chen. 1995. The Universality of Simple Distributional Methods: Identifying Syntactic Categories in Mandarin Chinese. Presented at the Proceedings of the International Conference on Cognitive Science and Natural Language Processing. Dublin City University.Google Scholar
- Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification
Recommendations
Splitting-merging model of Chinese word tokenization and segmentation
Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction ...
An integrated approach to chinese word segmentation and part-of-speech tagging
ICCPOL'06: Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges aheadThis paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration', ...
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
Comments