Article

Free Access

POS-tagger for English-Vietnamese bilingual corpus

Authors:
Dinh Dien

Vietnam National University of HCMC, HCM City, Vietnam

Vietnam National University of HCMC, HCM City, Vietnam
View Profile

,
Hoang Kiem

Vietnam National University of HCMC, HCM City

Vietnam National University of HCMC, HCM City
View Profile

HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3May 2003Pages 88–95https://doi.org/10.3115/1118905.1118921

Published:31 May 2003Publication History

HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

Pages 88–95

ABSTRACT

Corpus-based Natural Language Processing (NLP) tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for unpopular languages (e.g. Vietnamese) are at a deadlock due to absence of annotated training data for these languages. Furthermore, hand-annotation of even reasonably well-determined features such as part-of-speech (POS) tags has proved to be labor intensive and costly. In this paper, we suggest a solution to partially overcome the annotated resource shortage in Vietnamese by building a POS-tagger for an automatically word-aligned English-Vietnamese parallel Corpus (named EVC). This POS-tagger made use of the Transformation-Based Learning (or TBL) method to bootstrap the POS-annotation results of the English POS-tagger by exploiting the POS-information of the corresponding Vietnamese words via their word-alignments in EVC. Then, we directly project POS-annotations from English side to Vietnamese via available word alignments. This POS-annotated Vietnamese corpus will be manually corrected to become an annotated training data for Vietnamese NLP tasks such as POS-tagger, Phrase-Chunker, Parser, Word-Sense Disambiguator, etc.

References

E. Brill. 1993. A Corpus-based approach to Language Learning, PhD-thesis, Pennsylvania Uni., USA. Google ScholarDigital Library
E. Brill. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, 21(4), pp. 543--565. Google ScholarDigital Library
E. Brill. 1997. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. In Natural Language Processing Using Very Large Corpora. Kluwer Academic Press.Google Scholar
J. Curran. 1999. Transformation-Based Learning in Shallow Natural Language Processing, Honours Thesis, Basser Department of Computer Science, University of Sydney, Sydney, Australia.Google Scholar
E. Charniak. 1997. Statistical parsing with a context-free grammar and word statistics, in Proceedings of the Fourteenth National Conference on Artificial Intelligence, AAAI Press/MIT Press, Menlo Park. Google ScholarDigital Library
I. Dagan, I. Alon, and S. Ulrike. 1991. Two languages are more informative than one. In Proceedings of the 29th Annual ACL, Berkeley, CA, pp. 130--137. Google ScholarDigital Library
W. Daelemans, J. Zavrel, P. Berck, S. Gillis. 1996. MTB: A Memory-Based Part-of-Speech Tagger Generator. In Proceedings of 4th Workshop on Very Large Corpora, Copenhagen.Google Scholar
D. Dien, H. Kiem, and N. V. Toan. 2001a. Vietnamese Word Segmentation, Proceedings of NLPRS'01 (The 6th Natural Language Processing Pacific Rim Symposium), Tokyo, Japan, 11/2001, pp. 749--756.Google Scholar
D. Dien. 2001b. Building an English-Vietnamese bilingual corpus, Master thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam.Google Scholar
D. Dien, H. Kiem, T. Ngan, X. Quang, Q. Hung, P. Hoi, V. Toan. 2002a. Word alignment in English - Vietnamese bilingual corpus, Proceedings of EALPIIT'02, Hanoi, Vietnam, 1/2002, pp. 3--11.Google Scholar
D. Dien, H. Kiem. 2002b. Building a training corpus for word sense disambiguation in the English-to-Vietnamese Machine Translation, Proceedings of Workshop on Machine Translation in Asia, COLING-02, Taiwan, 9/2002, pp. 26--32. Google ScholarDigital Library
R. Florian, and G. Ngai. 2001a. Transformation-Based Learning in the fast lane, Proceedings of North America ACL-2001. Google ScholarDigital Library
R. Florian, and G. Ngai. 2001b. Fast Transformation-Based Learning Toolkit. Technical Report.Google Scholar
W. Gale, K. W. Church, and D. Yarowsky. 1992. Using bilingual materials to develop word sense disambiguation methods. In Proceedings of the Int. Conf. on Theoretical and Methodological Issues in MT, pp. 101--112.Google Scholar
H. Phe. 1998. Tù diên tiêng Viêt (Vietnamese Dictionary). Center of Lexicography. Da Nang Publisher.Google Scholar
G. Sampson. 1995. English for the Computer: The SUSANNE Corpus and Analytic Scheme, Clarendon Press (Oxford University Press).Google Scholar
H. Schmid. 1994a. Probabilistic POS Tagging using Decision Trees, Proceedings of International Conference on New methods in Language Processing, Manchester, UK.Google Scholar
H. Schmid. 1994b. POS Tagging with Neural Networks, Proceedings of International Conference on Computational Linguistics, Kyoto, Japan, pp. 172--176. Google ScholarDigital Library
D. Yarowsky and G. Ngai. 2001. Induce, Multilingual POS Tagger and NP bracketer via projection on aligned corpora, Proceedings of NAACL-01. Google ScholarDigital Library

POS-tagger for English-Vietnamese bilingual corpus
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation
COLING-MTIA '02: Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16

The most difficult task in machine translation is the elimination of ambiguity in human languages. A certain word in English as well as Vietnamese often has different meanings which depend on their syntactical position in the sentence and the actual ...
Read More
Projecting dependency syntax labels from English into Vietnamese in English-Vietnamese bilingual corpus

In natural language processing, the corpora play an important role, particularly labelled corpora, such as labelled part-of-speech corpora, labelled component syntax corpora, and labelled dependency syntax corpora. These labelled corpora are used for ...
Read More
Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents
SoICT '19: Proceedings of the 10th International Symposium on Information and Communication Technology

Bilingual terminologies are important resources for natural language processing as well as for human use. The automatic acquisition of bilingual terminologies is mostly based on bilingual corpora. However, monolingual corpora could also be a good source ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3
May 2003
124 pages
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 31 May 2003
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate240of768submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 758
  Total Downloads
- Downloads (Last 12 months)104
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

POS-tagger for English-Vietnamese bilingual corpus

HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

ABSTRACT

References

Cited By

Recommendations

Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation

Projecting dependency syntax labels from English into Vietnamese in English-Vietnamese bilingual corpus

Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

POS-tagger for English-Vietnamese bilingual corpus

HLT-NAACL-PARALLEL '03: Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond - Volume 3

ABSTRACT

References

Cited By

Recommendations

Building a training corpus for word sense disambiguation in English-to-Vietnamese machine translation

Projecting dependency syntax labels from English into Vietnamese in English-Vietnamese bilingual corpus

Rule based English-Vietnamese bilingual terminology extraction from Vietnamese documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media