ABSTRACT
Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages. Furthermore, hand-annotation of even reasonably well-determined features such as part-of-speech (POS) tags has proved to be labor intensive and costly. In this paper, we suggest a solution to partially overcome the annotated resource shortage in Vietnamese by building a POS-tagger for an automatically word-aligned English-Vietnamese parallel Corpus (named EVC). This POS-tagger made use of the Transformation-Based Learning (or TBL) method to bootstrap the POS-annotation results of the English POS-tagger by exploiting the POS-information of the corresponding Vietnamese words via their word-alignments in EVC. Then, we directly project POS-annotations from English side to Vietnamese via available word alignments. This POS-annotated Vietnamese corpus will be manually corrected to become an annotated training data for Vietnamese NLP tasks such as POS-tagger, Phrase-Chunker, Parser, Word-Sense Disambiguator, etc.
- E. Brill. 1993. A Corpus-based approach to Language Learning, PhD-thesis, Pennsylvania Uni., USA. Google ScholarDigital Library
- E. Brill. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, 21(4), pp. 543--565. Google ScholarDigital Library
- E. Brill. 1997. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. In Natural Language Processing Using Very Large Corpora. Kluwer Academic Press.Google Scholar
- J. Curran. 1999. Transformation-Based Learning in Shallow Natural Language Processing, Honours Thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia.Google Scholar
- E. Charniak. 1997. Statistical parsing with a context-free grammar and word statistics, in Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI Press/MIT Press, Menlo Park. Google ScholarDigital Library
- I. Dagan, I. Alon, and S. Ulrike. 1991. Two languages are more informative than one. In Proceedings of the 29th Annual ACL, Berkeley, CA, pp. 130--137. Google ScholarDigital Library
- W. Daelemans, J. Zavrel, P. Berck, S. Gillis. 1996. MTB: A Memory-Based Part-of-Speech Tagger Generator. In Proceedings of 4th Workshop on Very Large Corpora, Copenhagen.Google Scholar
- D. Dien, H. Kiem, and N. V. Toan. 2001a. Vietnamese Word Segmentation, Proceedings of NLPRS'01 (The 6th Natural Language Processing Pacific Rim Symposium), Tokyo, Japan, 11/2001, pp. 749--756.Google Scholar
- D. Dien. 2001b. Building an English-Vietnamese bilingual corpus, Master thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam.Google Scholar
- D. Dien, H. Kiem, T. Ngan, X. Quang, Q. Hung, P. Hoi, V. Toan. 2002a. Word alignment in English - Vietnamese bilingual corpus, Proceedings of EALPIIT'02, Hanoi, Vietnam, 1/2002, pp. 3--11.Google Scholar
- D. Dien, H. Kiem. 2002b. Building a training corpus for word sense disambiguation in the English-to-Vietnamese Machine Translation, Proceedings of Workshop on Machine Translation in Asia, COLING-02, Taiwan, 9/2002, pp. 26--32. Google ScholarDigital Library
- R. Florian, and G. Ngai. 2001a. Transformation-Based Learning in the fast lane, Proceedings of North America ACL-2001. Google ScholarDigital Library
- R. Florian, and G. Ngai. 2001b. Fast Transformation-Based Learning Toolkit. Technical Report.Google Scholar
- W. Gale, K. W. Church, and D. Yarowsky. 1992. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the Int. Conf. on Theoretical and Methodological Issues in MT, pp. 101--112.Google Scholar
- H. Phe. 1998. Tù diên tiêng Viêt (Vietnamese Dictionary). Center of Lexicography. Da Nang Publisher.Google Scholar
- G. Sampson. 1995. English for the Computer: The SUSANNE Corpus and Analytic Scheme, Clarendon Press (Oxford University Press).Google Scholar
- H. Schmid. 1994a. Probabilistic POS Tagging using Decision Trees, Proceedings of International Conference on New methods in Language Processing, Manchester, UK.Google Scholar
- H. Schmid. 1994b. POS Tagging with Neural Networks, Proceedings of International Conference on Computational Linguistics, Kyoto, Japan, pp. 172--176. Google ScholarDigital Library
- D. Yarowsky and G. Ngai. 2001. Induce, Multilingual POS Tagger and NP bracketer via projection on aligned corpora, Proceedings of NAACL-01. Google ScholarDigital Library
- POS-tagger for English-Vietnamese bilingual corpus
Recommendations
Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation
COLING-MTIA '02: Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual ...
Projecting dependency syntax labels from English into Vietnamese in English-Vietnamese bilingual corpus
In natural language processing, the corpora play an important role, particularly labelled corpora, such as labelled part-of-speech corpora, labelled component syntax corpora, and labelled dependency syntax corpora. These labelled corpora are used for ...
Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents
SoICT '19: Proceedings of the 10th International Symposium on Information and Communication TechnologyBilingual terminologies are important resources for natural language processing as well as for human use. The automatic acquisition of bilingual terminologies is mostly based on bilingual corpora. However, monolingual corpora could also be a good source ...
Comments