ABSTRACT
In this paper we describe a statistical technique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our data, the only information about the sentences that we use for calculating alignments is the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment computation is fast and therefore practical for application to very large collections of text. We have used this technique to align several million sentences in the English-French Hansard corpora and have achieved an accuracy in excess of 99% in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without the benefit of anchor points the correlation between the lengths of aligned sentences is strong enough that we should expect to achieve an accuracy of between 96% and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.
- {Baum, 1972} Baum, L. (1972). An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities, 3:1--8.Google Scholar
- {Bellman, 1957} Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton N.J. Google Scholar
- {Brown et al., 1982} Brown, P., Spohrer, J., Hochschild, P., and Baker, J. (1982). Partial traceback and dynamic programming. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1629--1632, Paris, France.Google Scholar
- {Brown et al., 1990} Brown, P. F., Cocke, J., DellaPietra, S. A., DellaPietra, V. J., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2):79--85. Google Scholar
- {Brown et al., 1988} Brown, P. F., Cocke, J., DellaPietra, S. A., DellaPietra, V. J., Jelinek, F., Mercer, R. L., and Roossin, P. S. (1988). A statistical approach to language translation. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary. Google Scholar
- {Catizone et al., 1989} Catizone, R., Russell, G., and Warwick, S. (1989). Deriving translation data from bilingual texts. In Proceedings of the First International Acquisition Workshop, Detroit, Michigan.Google Scholar
- {Dempster et al., 1977} Dempster, A., Laird, N., and Rubin, d. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(B):1--38.Google Scholar
- {Gale and Church, 1991} Gale, W. A. and Church, K. W. (1991). A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California. Google Scholar
- {Kay, 1991} Kay, M. (1991). Text-translation alignment. In ACH/ALLC '91": "Making Connections" Conference Handbook, Tempe, Arizona.Google Scholar
- {Klavans and Tzoukermann, 1990} Klavans, J. and Tzoukermann, E. (1990). The bicord system. In COLING-90, pages 174--179, Helsinki, Finland. Google Scholar
- {Sadler, 1989} Sadler, V. (1989). The Bilingual Knowledge Bank---A New Conceptual Basis for MT. BSO/Research, Utrecht.Google Scholar
- {Warwick and Russell, 1990} Warwick, S. and Russell, G. (1990). Bilingual concordancing and bilingual lexicography. In EURALEX 4th International Congress, Málaga, Spain.Google Scholar
- Aligning sentences in parallel corpora
Recommendations
A program for aligning sentences in bilingual corpora
ACL '91: Proceedings of the 29th annual meeting on Association for Computational LinguisticsResearchers in both machine translation (e.g., Brown et al., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary ...
Aligning sentences in bilingual corpora using lexical information
ACL '93: Proceedings of the 31st annual meeting on Association for Computational LinguisticsIn this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ignore word identities and only consider sentence length (Brown et al., 1991b; Gale and Church, 1991). Our ...
A program for aligning sentences in bilingual corpora
Special issue on using large corpora: IResearchers in both machine translation (e.g., Brown et al. 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann 1990) have recently become interested in studying bilingual corpora, bodies of text such as the Canadian Hansards (parliamentary ...
Comments