skip to main content
10.3115/981344.981366dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free Access

Aligning sentences in parallel corpora

Authors Info & Claims
Published:18 June 1991Publication History

ABSTRACT

In this paper we describe a statistical technique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our data, the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment computation is fast and therefore practical for application to very large collections of text. We have used this technique to align several million sentences in the English-French Hansard corpora and have achieved an accuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96% and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.

References

  1. {Baum, 1972} Baum, L. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3:1--8.Google ScholarGoogle Scholar
  2. {Bellman, 1957} Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton N.J. Google ScholarGoogle Scholar
  3. {Brown et al., 1982} Brown, P., Spohrer, J., Hochschild, P., and Baker, J. (1982). Partial traceback and dynamic programming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1629--1632, Paris, France.Google ScholarGoogle Scholar
  4. {Brown et al., 1990} Brown, P. F., Cocke, J., DellaPietra, S. A., DellaPietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):79--85. Google ScholarGoogle Scholar
  5. {Brown et al., 1988} Brown, P. F., Cocke, J., DellaPietra, S. A., DellaPietra, V. J., Jelinek, F., Mercer, R. L., and Roossin, P. S. (1988). A statistical approach to language translation. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary. Google ScholarGoogle Scholar
  6. {Catizone et al., 1989} Catizone, R., Russell, G., and Warwick, S. (1989). Deriving translation data from bilingual texts. In Proceedings of the First International Acquisition Workshop, Detroit, Michigan.Google ScholarGoogle Scholar
  7. {Dempster et al., 1977} Dempster, A., Laird, N., and Rubin, d. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1--38.Google ScholarGoogle Scholar
  8. {Gale and Church, 1991} Gale, W. A. and Church, K. W. (1991). A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California. Google ScholarGoogle Scholar
  9. {Kay, 1991} Kay, M. (1991). Text-translation alignment. In ACH/ALLC '91": "Making Connections" Conference Handbook, Tempe, Arizona.Google ScholarGoogle Scholar
  10. {Klavans and Tzoukermann, 1990} Klavans, J. and Tzoukermann, E. (1990). The bicord system. In COLING-90, pages 174--179, Helsinki, Finland. Google ScholarGoogle Scholar
  11. {Sadler, 1989} Sadler, V. (1989). The Bilingual Knowledge Bank---A New Conceptual Basis for MT. BSO/Research, Utrecht.Google ScholarGoogle Scholar
  12. {Warwick and Russell, 1990} Warwick, S. and Russell, G. (1990). Bilingual concordancing and bilingual lexicography. In EURALEX 4th International Congress, Málaga, Spain.Google ScholarGoogle Scholar
  1. Aligning sentences in parallel corpora

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        ACL '91: Proceedings of the 29th annual meeting on Association for Computational Linguistics
        June 1991
        373 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 18 June 1991

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate85of443submissions,19%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader