skip to main content
10.3115/1118905.1118921dlproceedingsArticle/Chapter ViewAbstractPublication PageshltConference Proceedingsconference-collections
Article
Free Access

POS-tagger for English-Vietnamese bilingual corpus

Authors Info & Claims
Published:31 May 2003Publication History

ABSTRACT

Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages. Furthermore, hand-annotation of even reasonably well-determined features such as part-of-speech (POS) tags has proved to be labor intensive and costly. In this paper, we suggest a solution to partially overcome the annotated resource shortage in Vietnamese by building a POS-tagger for an automatically word-aligned English-Vietnamese parallel Corpus (named EVC). This POS-tagger made use of the Transformation-Based Learning (or TBL) method to bootstrap the POS-annotation results of the English POS-tagger by exploiting the POS-information of the corresponding Vietnamese words via their word-alignments in EVC. Then, we directly project POS-annotations from English side to Vietnamese via available word alignments. This POS-annotated Vietnamese corpus will be manually corrected to become an annotated training data for Vietnamese NLP tasks such as POS-tagger, Phrase-Chunker, Parser, Word-Sense Disambiguator, etc.

References

  1. E. Brill. 1993. A Corpus-based approach to Language Learning, PhD-thesis, Pennsylvania Uni., USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Brill. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, 21(4), pp. 543--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Brill. 1997. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. In Natural Language Processing Using Very Large Corpora. Kluwer Academic Press.Google ScholarGoogle Scholar
  4. J. Curran. 1999. Transformation-Based Learning in Shallow Natural Language Processing, Honours Thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia.Google ScholarGoogle Scholar
  5. E. Charniak. 1997. Statistical parsing with a context-free grammar and word statistics, in Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI Press/MIT Press, Menlo Park. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Dagan, I. Alon, and S. Ulrike. 1991. Two languages are more informative than one. In Proceedings of the 29th Annual ACL, Berkeley, CA, pp. 130--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Daelemans, J. Zavrel, P. Berck, S. Gillis. 1996. MTB: A Memory-Based Part-of-Speech Tagger Generator. In Proceedings of 4th Workshop on Very Large Corpora, Copenhagen.Google ScholarGoogle Scholar
  8. D. Dien, H. Kiem, and N. V. Toan. 2001a. Vietnamese Word Segmentation, Proceedings of NLPRS'01 (The 6th Natural Language Processing Pacific Rim Symposium), Tokyo, Japan, 11/2001, pp. 749--756.Google ScholarGoogle Scholar
  9. D. Dien. 2001b. Building an English-Vietnamese bilingual corpus, Master thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam.Google ScholarGoogle Scholar
  10. D. Dien, H. Kiem, T. Ngan, X. Quang, Q. Hung, P. Hoi, V. Toan. 2002a. Word alignment in English - Vietnamese bilingual corpus, Proceedings of EALPIIT'02, Hanoi, Vietnam, 1/2002, pp. 3--11.Google ScholarGoogle Scholar
  11. D. Dien, H. Kiem. 2002b. Building a training corpus for word sense disambiguation in the English-to-Vietnamese Machine Translation, Proceedings of Workshop on Machine Translation in Asia, COLING-02, Taiwan, 9/2002, pp. 26--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Florian, and G. Ngai. 2001a. Transformation-Based Learning in the fast lane, Proceedings of North America ACL-2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Florian, and G. Ngai. 2001b. Fast Transformation-Based Learning Toolkit. Technical Report.Google ScholarGoogle Scholar
  14. W. Gale, K. W. Church, and D. Yarowsky. 1992. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the Int. Conf. on Theoretical and Methodological Issues in MT, pp. 101--112.Google ScholarGoogle Scholar
  15. H. Phe. 1998. Tù diên tiêng Viêt (Vietnamese Dictionary). Center of Lexicography. Da Nang Publisher.Google ScholarGoogle Scholar
  16. G. Sampson. 1995. English for the Computer: The SUSANNE Corpus and Analytic Scheme, Clarendon Press (Oxford University Press).Google ScholarGoogle Scholar
  17. H. Schmid. 1994a. Probabilistic POS Tagging using Decision Trees, Proceedings of International Conference on New methods in Language Processing, Manchester, UK.Google ScholarGoogle Scholar
  18. H. Schmid. 1994b. POS Tagging with Neural Networks, Proceedings of International Conference on Computational Linguistics, Kyoto, Japan, pp. 172--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Yarowsky and G. Ngai. 2001. Induce, Multilingual POS Tagger and NP bracketer via projection on aligned corpora, Proceedings of NAACL-01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. POS-tagger for English-Vietnamese bilingual corpus

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
        May 2003
        124 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 31 May 2003

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate240of768submissions,31%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader