skip to main content
10.5555/2145432.2145538dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Enhancing Chinese word segmentation using unlabeled data

Authors Info & Claims
Published:27 July 2011Publication History

ABSTRACT

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

References

  1. Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. 2004. Accessor variety criteria for Chinese word extraction. Comput. Linguist., 30:75--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Wenbin Jiang, Liang Huang, and Qun Liu. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging -- a case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 522--530. Association for Computational Linguistics, Suntec, Singapore. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of ACL-08: HLT, pages 595--603. Association for Computational Linguistics, Columbus, Ohio.Google ScholarGoogle Scholar
  4. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282--289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zhongguo Li and Maosong Sun. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist., 35:505--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 337--342. Association for Computational Linguistics, Boston, Massachusetts, USA.Google ScholarGoogle Scholar
  7. Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (crfs).Google ScholarGoogle Scholar
  8. Valentin I. Spitkovsky, Daniel Jurafsky, and Hiyan Alshawi. 2010. Profiting from mark-up: Hypertext annotations for guided parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1278--1287. Association for Computational Linguistics, Uppsala, Sweden. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Weiwei Sun. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Coling 2010: Posters, pages 1211--1219. Coling 2010 Organizing Committee, Beijing, China. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Weiwei Sun. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL 2011 Conference. Association for Computational Linguistics, Portland, Oregon, United States. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. 2009. A discriminative latent variable Chinese segmenter with hybrid word/character information. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 56--64. Association for Computational Linguistics, Boulder, Colorado. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter. In In Fourth SIGHAN Workshop on Chinese Language Processing.Google ScholarGoogle Scholar
  13. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384--394. Association for Computational Linguistics, Uppsala, Sweden. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1017--1024. Coling 2008 Organizing Committee, Manchester, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nianwen Xue. 2003. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics and Chinese Language Processing.Google ScholarGoogle Scholar
  1. Enhancing Chinese word segmentation using unlabeled data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
          July 2011
          1647 pages
          ISBN:9781937284114

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 27 July 2011

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate73of234submissions,31%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader