ABSTRACT
This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
- Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. 2004. Accessor variety criteria for Chinese word extraction. Comput. Linguist., 30:75--93. Google ScholarDigital Library
- Wenbin Jiang, Liang Huang, and Qun Liu. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging -- a case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 522--530. Association for Computational Linguistics, Suntec, Singapore. Google ScholarDigital Library
- Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of ACL-08: HLT, pages 595--603. Association for Computational Linguistics, Columbus, Ohio.Google Scholar
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282--289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarDigital Library
- Zhongguo Li and Maosong Sun. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist., 35:505--512. Google ScholarDigital Library
- Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 337--342. Association for Computational Linguistics, Boston, Massachusetts, USA.Google Scholar
- Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (crfs).Google Scholar
- Valentin I. Spitkovsky, Daniel Jurafsky, and Hiyan Alshawi. 2010. Profiting from mark-up: Hypertext annotations for guided parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1278--1287. Association for Computational Linguistics, Uppsala, Sweden. Google ScholarDigital Library
- Weiwei Sun. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Coling 2010: Posters, pages 1211--1219. Coling 2010 Organizing Committee, Beijing, China. Google ScholarDigital Library
- Weiwei Sun. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL 2011 Conference. Association for Computational Linguistics, Portland, Oregon, United States. Google ScholarDigital Library
- Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. 2009. A discriminative latent variable Chinese segmenter with hybrid word/character information. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 56--64. Association for Computational Linguistics, Boulder, Colorado. Google ScholarDigital Library
- Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter. In In Fourth SIGHAN Workshop on Chinese Language Processing.Google Scholar
- Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384--394. Association for Computational Linguistics, Uppsala, Sweden. Google ScholarDigital Library
- Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1017--1024. Coling 2008 Organizing Committee, Manchester, UK. Google ScholarDigital Library
- Nianwen Xue. 2003. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics and Chinese Language Processing.Google Scholar
- Enhancing Chinese word segmentation using unlabeled data
Recommendations
Chinese word sense disambiguation using hownet
ICNC'05: Proceedings of the First international conference on Advances in Natural Computation - Volume Part IWord sense disambiguation plays an important role in natural language processing, such as information retrieval, text summarization, machine translation etc. This paper proposes a corpus-based Chinese word sense disambiguation approach using HowNet. The ...
Splitting-merging model of Chinese word tokenization and segmentation
Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction ...
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessionsWe proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Comments