skip to main content
10.5555/1626355.1626368dlproceedingsArticle/Chapter ViewAbstractPublication PagesstatmtConference Proceedingsconference-collections
research-article
Free Access

Human evaluation of machine translation through binary system comparisons

Published:23 June 2007Publication History

ABSTRACT

We introduce a novel evaluation scheme for the human evaluation of different machine translation systems. Our method is based on direct comparison of two sentences at a time by human judges. These binary judgments are then used to decide between all possible rankings of the systems. The advantages of this new method are the lower dependency on extensive evaluation guidelines, and a tighter focus on a typical evaluation task, namely the ranking of systems. Furthermore we argue that machine translation evaluations should be regarded as statistical processes, both for human and automatic evaluation. We show how confidence ranges for state-of-the-art evaluation measures such as WER and TER can be computed accurately and efficiently without having to resort to Monte Carlo estimates. We give an example of our new evaluation scheme, as well as a comparison with classical automatic and human evaluation on data from a recent international evaluation campaign.

References

  1. M. Bisani and H. Ney. 2004. Bootstrap estimates for confidence intervals in ASR performance evaluationx. IEEE ICASSP, pages 409--412, Montreal, Canada, May.Google ScholarGoogle Scholar
  2. T. Bui and M. Thanh. 1985. Significant improvements to the Ford-Johnson algorithm for sorting. BIT Numerical Mathematics, 25(1):70--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of Bleu in machine translation research. Proceeding of the 11th Conference of the European Chapter of the ACL: EACL 2006, pages 249--256, Trento, Italy, Apr.Google ScholarGoogle Scholar
  4. G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proc. ARPA Workshop on Human Language Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Efron and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman&Hall, New York and London.Google ScholarGoogle Scholar
  6. L. Ford Jr and S. Johnson. 1959. A Tournament Problem. The American Mathematical Monthly, 66(5):387--389.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. E. Knuth. 1973. The Art of Computer Programming, volume 3. Addison-Wesley, 1st edition. Sorting and Searching. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Koehn and C. Monz. 2006. Manual and automatic evaluation of machine translation between european languages. Proceedings of the Workshop on Statistical Machine Translation, pages 102--121, New York City, Jun. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Leusch, N. Ueffing, D. Vilar, and H. Ney. 2005. Preprocessing and normalization for automatic evaluation of machine translation. 43rd ACL: Proc. Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, pages 17--24, Ann Arbor, Michigan, Jun.Google ScholarGoogle Scholar
  10. F. J. Och. 2003. Minimum error rate training in statistical machine translation. Proc. of the 41st ACL, pages 160--167, Sapporo, Japan, Jul. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. Proc. of the 40th ACL, pages 311--318, Philadelphia, PA, Jul. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Peczarski. 2002. Sorting 13 elements requires 34 comparisons. LNCS, 2461/2002:785--794, Sep. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Snover, B. J. Dorr, R. Schwartz, J. Makhoul, L. Micciulla, and R. Weischedel. 2005. A study of translation error rate with targeted human annotation. Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR-2005-58, University of Maryland, College Park, MD.Google ScholarGoogle Scholar
  14. L. Thurstone. 1927. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21:384--400.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. Vilar, E. Matusov, S. Hasan, R. Zens, and H. Ney. 2005. Statistical Machine Translation of European Parliamentary Speeches. Proceedings of MT Summit X, pages 259--266, Phuket, Thailand, Sep.Google ScholarGoogle Scholar
  16. M. Wells. 1971. Elements of combinatorial computing. Pergamon Press.Google ScholarGoogle Scholar
  17. Y. Zhang and S. Vogel. 2004. Measuring confidence intervals for the machine translation evaluation metrics. Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 4--6, Baltimore, MD.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    StatMT '07: Proceedings of the Second Workshop on Statistical Machine Translation
    June 2007
    281 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 23 June 2007

    Qualifiers

    • research-article

    Acceptance Rates

    StatMT '07 Paper Acceptance Rate12of38submissions,32%Overall Acceptance Rate24of59submissions,41%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader