research-article

Free Access

Enhancing Chinese word segmentation using unlabeled data

Authors:
Weiwei Sun

Saarland University, and German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany

Saarland University, and German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
View Profile

,
Jia Xu

German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany

German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
View Profile

Authors Info & Claims

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language ProcessingJuly 2011Pages 970–979

Published:27 July 2011Publication History

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

Pages 970–979

ABSTRACT

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

References

Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. 2004. Accessor variety criteria for Chinese word extraction. Comput. Linguist., 30:75--93. Google ScholarDigital Library
Wenbin Jiang, Liang Huang, and Qun Liu. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging -- a case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 522--530. Association for Computational Linguistics, Suntec, Singapore. Google ScholarDigital Library
Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of ACL-08: HLT, pages 595--603. Association for Computational Linguistics, Columbus, Ohio.Google Scholar
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282--289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarDigital Library
Zhongguo Li and Maosong Sun. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist., 35:505--512. Google ScholarDigital Library
Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 337--342. Association for Computational Linguistics, Boston, Massachusetts, USA.Google Scholar
Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (crfs).Google Scholar
Valentin I. Spitkovsky, Daniel Jurafsky, and Hiyan Alshawi. 2010. Profiting from mark-up: Hypertext annotations for guided parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1278--1287. Association for Computational Linguistics, Uppsala, Sweden. Google ScholarDigital Library
Weiwei Sun. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Coling 2010: Posters, pages 1211--1219. Coling 2010 Organizing Committee, Beijing, China. Google ScholarDigital Library
Weiwei Sun. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL 2011 Conference. Association for Computational Linguistics, Portland, Oregon, United States. Google ScholarDigital Library
Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. 2009. A discriminative latent variable Chinese segmenter with hybrid word/character information. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 56--64. Association for Computational Linguistics, Boulder, Colorado. Google ScholarDigital Library
Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter. In In Fourth SIGHAN Workshop on Chinese Language Processing.Google Scholar
Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384--394. Association for Computational Linguistics, Uppsala, Sweden. Google ScholarDigital Library
Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1017--1024. Coling 2008 Organizing Committee, Manchester, UK. Google ScholarDigital Library
Nianwen Xue. 2003. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics and Chinese Language Processing.Google Scholar

Enhancing Chinese word segmentation using unlabeled data

Recommendations

Chinese word sense disambiguation using hownet
ICNC'05: Proceedings of the First international conference on Advances in Natural Computation - Volume Part I

Word sense disambiguation plays an important role in natural language processing, such as information retrieval, text summarization, machine translation etc. This paper proposes a corpus-based Chinese word sense disambiguation approach using HowNet. The ...
Read More
Splitting-merging model of Chinese word tokenization and segmentation

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction ...
Read More
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessions

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
July 2011
1647 pages
ISBN:9781937284114
General Chair:
Paola Merlo
University of Geneva
,
Program Chairs:
Regina Barzilay
Massachusetts Institute of Technology
,
Mark Johnson
Macquarie University
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 27 July 2011
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate73of234submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 454
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing Chinese word segmentation using unlabeled data

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Chinese word sense disambiguation using hownet

Splitting-merging model of Chinese word tokenization and segmentation

Subword-based tagging for confidence-dependent Chinese word segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Enhancing Chinese word segmentation using unlabeled data

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Chinese word sense disambiguation using hownet

Splitting-merging model of Chinese word tokenization and segmentation

Subword-based tagging for confidence-dependent Chinese word segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media