note

Translating Low-Resource Languages by Vocabulary Adaptation from Close Counterparts

Authors:
Peyman Passban

Centre, School of Computing, Dublin City University, Ireland

Centre, School of Computing, Dublin City University, Ireland

0000-0002-5901-2132
View Profile

,
Qun Liu

ADAPT Centre, School of Computing, Dublin City University, Ireland

ADAPT Centre, School of Computing, Dublin City University, Ireland
View Profile

,
Andy Way

ADAPT Centre, School of Computing, Dublin City University, Ireland

ADAPT Centre, School of Computing, Dublin City University, Ireland
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16 Issue 4Article No.: 29pp 1–14https://doi.org/10.1145/3099556

Published:08 September 2017Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Some natural languages belong to the same family or share similar syntactic and/or semantic regularities. This property persuades researchers to share computational models across languages and benefit from high-quality models to boost existing low-performance counterparts. In this article, we follow a similar idea, whereby we develop statistical and neural machine translation (MT) engines that are trained on one language pair but are used to translate another language. First we train a reliable model for a high-resource language, and then we exploit cross-lingual similarities and adapt the model to work for a close language with almost zero resources. We chose Turkish (Tr) and Azeri or Azerbaijani (Az) as the proposed pair in our experiments. Azeri suffers from lack of resources as there is almost no bilingual corpus for this language. Via our techniques, we are able to train an engine for the Az → English (En) direction, which is able to outperform all other existing models.

References

Eleftherios Avramidis and Philipp Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In Proceeding of the the Annual Meeting of the Association for Computational Linguistics (ACL’08). 763--770.Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.Google Scholar
Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML Unsupervised and Transfer Learning Workshop. 17--36. Google ScholarDigital Library
Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus phrase-based machine translation quality: A case study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 257--267.Google ScholarCross Ref
Arianna Bisazza and Marcello Federico. 2009. Morphological pre-processing for Turkish to English statistical machine translation. In Proceedings of the 6th International Workshop on Spoken Language Translation (IWSLT’09). 129--135.Google Scholar
Arianna Bisazza, Nick Ruiz, and Marcello Federico. 2011. Fill-up versus interpolation methods for phrase-based SMT adaptation. In Proceedings of the 8th International Workshop on Spoken Language Translation (IWSLT’11).Google Scholar
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google ScholarCross Ref
Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1693--1703.Google ScholarCross Ref
Ilknur Durgar El-Kahlout and Kemal Oflazer. 2006. Initial explorations in English to Turkish statistical machine translation. In Proceedings of the Workshop on Statistical Machine Translation. 7--14. Google ScholarDigital Library
Ahmed El Kholy, Nizar Habash, Gregor Leusch, Evgeny Matusov, and Hassan Sawaf. 2013. Selective combination of pivot and direct statistical machine translation models. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 1174--1180.Google Scholar
Gülsen Eryigit and Eref Adali. 2004. An affix stripping morphological analyzer for turkish. In Proceedings of the IASTED International Conference on Artificial Intelligence and Applications. 299--304.Google Scholar
Rauf Fatullayev, Ali Abbasov, and Abulfat Fatullayev. 2008. Dilmanc is the 1st MT system for azerbaijani. In Proceedings of the 2nd Swedish Language Technology Conference (SLTC’08). 63--64.Google Scholar
Sharon Goldwater and David McClosky. 2005. Improving statistical MT through morphological analysis. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. 676--683. Google ScholarDigital Library
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Wenbin Jiang, Yajuan Lü, Liang Huang, and Qun Liu. 2015. Automatic adaptation of annotations. Comput. Linguist. 41, 1 (2015), 119--147. Google ScholarDigital Library
Bevan Jones, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight. 2012. Semantics-based machine translation with hyperedge replacement grammars. In Proceedings of the 24th International Conference on Computational Linguistics. 1359--1376.Google Scholar
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700--1709.Google Scholar
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 388--395.Google Scholar
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177--180. Google ScholarDigital Library
Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. 48--54. Google ScholarDigital Library
Philipp Koehn and Josh Schroeder. 2007. Experiments in domain adaptation for statistical machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation. 224--227. Google ScholarDigital Library
Pierre Lison and Jrg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 923--929.Google Scholar
Antonio Valerio Miceli-Barone and Giuseppe Attardi. 2013. Pre-reordering for machine translation using transition-based walks on dependency parse trees. In Proceedings of the Eighth Workshop on Statistical Machine Translation. 162--167.Google Scholar
RP Ñeco and Mikel L Forcada. 1996. Beyond mealy machines: Learning translators with recurrent neural networks. In Proceedings of the World Conference on Neural Networks. 408--411.Google Scholar
Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics—Volume 1. 160--167. Google ScholarDigital Library
Kemal Oflazer and Ilknur Durgar El-Kahlout. 2007. Exploring different representational units in English-to-Turkish statistical machine translation. In Proceedings of the 2nd Workshop on Statistical Machine Translation. Prague, Czech Republic, 25--32. Google ScholarDigital Library
Kurtulus Öztopçu. 1993. A comparison of modern azeri with modern turkish. Azerbaijan Int. 1, 3 (1993).Google Scholar
Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359. Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318. Google ScholarDigital Library
Holger Schwenk, Daniel Dchelotte, and Jean-Luc Gauvain. 2006. Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL on Main Conference Poster Sessions. 723--730. Google ScholarDigital Library
Andreas Stolcke. 2002. SRILM - An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP’02—INTERSPEECH).Google Scholar
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Confrence on Advances in Neural Information Processing Systems (NIPS’14). 3104--3112. Google ScholarDigital Library
Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’12). 2214--2218.Google Scholar
Dong Wang and Thomas Fang Zheng. 2015. Transfer learning for speech and language processing. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA’15). 1225--1237.Google ScholarCross Ref
Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics. Google ScholarDigital Library
Reyyan Yeniterzi and Kemal Oflazer. 2010. Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 454--464. Google ScholarDigital Library
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Google ScholarCross Ref

Index Terms

Translating Low-Resource Languages by Vocabulary Adaptation from Close Counterparts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Neural Machine Translation for Low-resource Languages: A Survey
Neural Machine Translation (NMT) has seen tremendous growth in the last ten years since the early 2000s and has already entered a mature phase. While considered the most widely used solution for Machine Translation, its performance on low-resource ...
Read More
Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian ...
Read More
Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages
Neural approaches, which are currently state-of-the-art in many areas, have contributed significantly to the exciting advancements in machine translation. However, Neural Machine Translation (NMT) requires a substantial quantity and good quality parallel ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16, Issue 4
December 2017
146 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3097269
Editor:
Nianwen Xue
Brandeis University, Waltham, USA
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 September 2017
- Accepted: 1 May 2017
- Revised: 1 March 2017
- Received: 1 November 2016
Published in tallip Volume 16, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Statistical machine translation
low-resource languages
neural machine translation
Qualifiers
- note
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 363
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Translating Low-Resource Languages by Vocabulary Adaptation from Close Counterparts

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Neural Machine Translation for Low-resource Languages: A Survey

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Translating Low-Resource Languages by Vocabulary Adaptation from Close Counterparts

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Neural Machine Translation for Low-resource Languages: A Survey

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media