skip to main content
10.5555/1557769.1557791dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
research-article
Free Access

Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification

Published:25 June 2007Publication History

ABSTRACT

This paper addresses two remaining challenges in Chinese word segmentation. The challenge in HLT is to find a robust segmentation method that requires no prior lexical knowledge and no extensive training to adapt to new types of data. The challenge in modelling human cognition and acquisition it to segment words efficiently without using knowledge of wordhood. We propose a radical method of word segmentation to meet both challenges. The most critical concept that we introduce is that Chinese word segmentation is the classification of a string of character-boundaries (CB's) into either word-boundaries (WB's) and non-word-boundaries. In Chinese, CB's are delimited and distributed in between two characters. Hence we can use the distributional properties of CB among the background character strings to predict which CB's are WB's.

References

  1. Academia Sinica Balanced Corpus of Modern Chinese. http://www.sinica.edu.tw/SinicaCorpus/Google ScholarGoogle Scholar
  2. Chen K. J and Liu S. H. 1992. Word Identification for Mandarin Chinese sentences. Proceedings of the 14th conference on Computational Linguistics, p. 101--107, France. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chiang, T.-H., J.-S. Chang, M.-Y. Lin and K.-Y. Su. 1996. Statistical Word Segmentation. In C.-R. Huang, K.-J. Chen and B. K. T'sou (eds.): Journal of Chinese Linguistics, Monograph Series, Number 9, Readings in Chinese Natural Language Processing, pp. 147--173.Google ScholarGoogle Scholar
  4. Gao, J. and A. Wu and Mu Li and C.-N. Huang and H. Li and X. Xia and H. Qin. 2004. Adaptive Chinese Word Segmentation. In Proceedings of ACL-2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Meng, H. and C. W. Ip. 1999. An Analytical Study of Transformational Tagging for Chinese Text. In. Proceedings of ROCLING XII. 101--122. TaipeiGoogle ScholarGoogle Scholar
  6. Ruggieri S. 2004. YaDT: Yet another Decision Tree builder. Proceedings of the 16th International Conference on Tools with Artificial Intelligence (ICTAI 2004): 260--265. IEEE Press, November 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Richard Sproat and Thomas Emerson. 2003. The First International Chinese Word Segmentation Bakeoff. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Xue, N. 2003. Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing. 8(1): 29--48Google ScholarGoogle Scholar
  9. Redington, M. and N. Chater and C. Huang and L. Chang and K. Chen. 1995. The Universality of Simple Distributional Methods: Identifying Syntactic Categories in Mandarin Chinese. Presented at the Proceedings of the International Conference on Cognitive Science and Natural Language Processing. Dublin City University.Google ScholarGoogle Scholar
  1. Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        ACL '07: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
        June 2007
        247 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 25 June 2007

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate85of443submissions,19%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader