skip to main content
10.5555/1654650.1654662dlproceedingsArticle/Chapter ViewAbstractPublication PagesstatmtConference Proceedingsconference-collections
research-article
Free Access

Partitioning parallel documents using binary segmentation

Published:08 June 2006Publication History

ABSTRACT

In statistical machine translation, large numbers of parallel sentences are required to train the model parameters. However, plenty of the bilingual language resources available on web are aligned only at the document level. To exploit this data, we have to extract the bilingual sentences from these documents.

The common method is to break the documents into segments using predefined anchor words, then these segments are aligned. This approach is not error free, incorrect alignments may decrease the translation quality.

We present an alternative approach to extract the parallel sentences by partitioning a bilingual document into two pairs. This process is performed recursively until all the sub-pairs are short enough.

In experiments on the Chinese-English FBIS data, our method was capable of producing translation results comparable to those of a state-of-the-art sentence aligner. Using a combination of the two approaches leads to better translation performance.

References

  1. P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263--311, June. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Deng, S. Kumar, and W. Byrne. 2006. Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering, Accepted. To appear.Google ScholarGoogle Scholar
  3. G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of Human Language Technology, pages 128--132, San Diego, California, March. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. A. Gale and K. W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. LDC. 2005. Linguistic data consortium resource home page. http://www.ldc.upenn.edu/Projects/TIDES.Google ScholarGoogle Scholar
  6. X. Ma. 2006. Champollion: A robust parallel text sentence aligner. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC), Genoa, Italy, Accepted. To appear.Google ScholarGoogle Scholar
  7. NIST. 2005. Machine translation home page. http://www.nist.gov/speech/tests/mt/index.htm.Google ScholarGoogle Scholar
  8. F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19--51, March. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. A. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, July. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Simard and P. Langlais. 2003. Statistical translation alignment with compositionality constraints. In NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Canada, May. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377--403, September. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Xu, R. Zens, and H. Ney. 2005. Sentence segmentation using IBM word alignment model 1. In Proceedings of EAMT 2005 (10th Annual Conference of the European Association for Machine Translation), pages 280--287, Budapest, Hungary, May.Google ScholarGoogle Scholar
  13. R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu, Y. Zhang, and H. Ney. 2005. The RWTH phrase-based statistical machine translation system. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pages 155--162, Pittsburgh, PA, October.Google ScholarGoogle Scholar
  1. Partitioning parallel documents using binary segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        StatMT '06: Proceedings of the Workshop on Statistical Machine Translation
        June 2006
        183 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 8 June 2006

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate24of59submissions,41%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader