ABSTRACT
We introduce a novel evaluation scheme for the human evaluation of different machine translation systems. Our method is based on direct comparison of two sentences at a time by human judges. These binary judgments are then used to decide between all possible rankings of the systems. The advantages of this new method are the lower dependency on extensive evaluation guidelines, and a tighter focus on a typical evaluation task, namely the ranking of systems. Furthermore we argue that machine translation evaluations should be regarded as statistical processes, both for human and automatic evaluation. We show how confidence ranges for state-of-the-art evaluation measures such as WER and TER can be computed accurately and efficiently without having to resort to Monte Carlo estimates. We give an example of our new evaluation scheme, as well as a comparison with classical automatic and human evaluation on data from a recent international evaluation campaign.
- M. Bisani and H. Ney. 2004. Bootstrap estimates for confidence intervals in ASR performance evaluationx. IEEE ICASSP, pages 409--412, Montreal, Canada, May.Google Scholar
- T. Bui and M. Thanh. 1985. Significant improvements to the Ford-Johnson algorithm for sorting. BIT Numerical Mathematics, 25(1):70--75. Google ScholarDigital Library
- C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of Bleu in machine translation research. Proceeding of the 11th Conference of the European Chapter of the ACL: EACL 2006, pages 249--256, Trento, Italy, Apr.Google Scholar
- G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proc. ARPA Workshop on Human Language Technology. Google ScholarDigital Library
- B. Efron and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman&Hall, New York and London.Google Scholar
- L. Ford Jr and S. Johnson. 1959. A Tournament Problem. The American Mathematical Monthly, 66(5):387--389.Google ScholarCross Ref
- D. E. Knuth. 1973. The Art of Computer Programming, volume 3. Addison-Wesley, 1st edition. Sorting and Searching. Google ScholarDigital Library
- P. Koehn and C. Monz. 2006. Manual and automatic evaluation of machine translation between european languages. Proceedings of the Workshop on Statistical Machine Translation, pages 102--121, New York City, Jun. Google ScholarDigital Library
- G. Leusch, N. Ueffing, D. Vilar, and H. Ney. 2005. Preprocessing and normalization for automatic evaluation of machine translation. 43rd ACL: Proc. Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, pages 17--24, Ann Arbor, Michigan, Jun.Google Scholar
- F. J. Och. 2003. Minimum error rate training in statistical machine translation. Proc. of the 41st ACL, pages 160--167, Sapporo, Japan, Jul. Google ScholarDigital Library
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. Proc. of the 40th ACL, pages 311--318, Philadelphia, PA, Jul. Google ScholarDigital Library
- M. Peczarski. 2002. Sorting 13 elements requires 34 comparisons. LNCS, 2461/2002:785--794, Sep. Google ScholarDigital Library
- M. Snover, B. J. Dorr, R. Schwartz, J. Makhoul, L. Micciulla, and R. Weischedel. 2005. A study of translation error rate with targeted human annotation. Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR-2005-58, University of Maryland, College Park, MD.Google Scholar
- L. Thurstone. 1927. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21:384--400.Google ScholarCross Ref
- D. Vilar, E. Matusov, S. Hasan, R. Zens, and H. Ney. 2005. Statistical Machine Translation of European Parliamentary Speeches. Proceedings of MT Summit X, pages 259--266, Phuket, Thailand, Sep.Google Scholar
- M. Wells. 1971. Elements of combinatorial computing. Pergamon Press.Google Scholar
- Y. Zhang and S. Vogel. 2004. Measuring confidence intervals for the machine translation evaluation metrics. Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 4--6, Baltimore, MD.Google Scholar
Recommendations
Evaluation of machine translation
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in TechnologyMachine Translation (MT) refers to the use of a machine for performing translation task which converts text or speech from one Natural Language (NL) into another Natural Language. Machine Translation is an important technology for localization, and is ...
Dependency-based automatic evaluation for machine translation
SSST '07: Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical TranslationWe present a novel method for evaluating the output of Machine Translation (MT), based on comparing the dependency structures of the translation and reference rather than their surface string forms. Our method uses a treebank-based, widecoverage, ...
Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian
This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established multidimensional quality metrics (MQM) error taxonomy and implement ...
Comments