ABSTRACT
In statistical machine translation, large numbers of parallel sentences are required to train the model parameters. However, plenty of the bilingual language resources available on web are aligned only at the document level. To exploit this data, we have to extract the bilingual sentences from these documents.
The common method is to break the documents into segments using predefined anchor words, then these segments are aligned. This approach is not error free, incorrect alignments may decrease the translation quality.
We present an alternative approach to extract the parallel sentences by partitioning a bilingual document into two pairs. This process is performed recursively until all the sub-pairs are short enough.
In experiments on the Chinese-English FBIS data, our method was capable of producing translation results comparable to those of a state-of-the-art sentence aligner. Using a combination of the two approaches leads to better translation performance.
- P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, June. Google ScholarDigital Library
- Y. Deng, S. Kumar, and W. Byrne. 2006. Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering, Accepted. To appear.Google Scholar
- G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of Human Language Technology, pages 128--132, San Diego, California, March. Google ScholarDigital Library
- W. A. Gale and K. W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75--90. Google ScholarDigital Library
- LDC. 2005. Linguistic data consortium resource home page. http://www.ldc.upenn.edu/Projects/TIDES.Google Scholar
- X. Ma. 2006. Champollion: A robust parallel text sentence aligner. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC), Genoa, Italy, Accepted. To appear.Google Scholar
- NIST. 2005. Machine translation home page. http://www.nist.gov/speech/tests/mt/index.htm.Google Scholar
- F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19--51, March. Google ScholarDigital Library
- K. A. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, July. Google ScholarDigital Library
- M. Simard and P. Langlais. 2003. Statistical translation alignment with compositionality constraints. In NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Canada, May. Google ScholarDigital Library
- D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377--403, September. Google ScholarDigital Library
- J. Xu, R. Zens, and H. Ney. 2005. Sentence segmentation using IBM word alignment model 1. In Proceedings of EAMT 2005 (10th Annual Conference of the European Association for Machine Translation), pages 280--287, Budapest, Hungary, May.Google Scholar
- R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu, Y. Zhang, and H. Ney. 2005. The RWTH phrase-based statistical machine translation system. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pages 155--162, Pittsburgh, PA, October.Google Scholar
Partitioning parallel documents using binary segmentation
Recommendations
Extracting parallel paragraphs and sentences from english-persian translated documents
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval TechnologyThe task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. ...
Korean Syntactic Analysis Using Dependency Rules and Segmentation
ALPIT '08: Proceedings of the 2008 International Conference on Advanced Language Processing and Web Information TechnologyThis paper presents a Korean syntactic analysis system. This syntactic analyzer provides an adequate parsing method using the dependency rules and segmentation. We use dependency grammar for syntactic analysis. Dependency grammar is very useful for ...
Structure detection and segmentation of documents using 2D stochastic context-free grammars
In this paper we define a bidimensional extension of stochastic context-free grammars for structure detection and segmentation of images of documents. Two sets of text classification features are used to perform an initial classification of each zone of ...
Comments