research-article

Free Access

Human evaluation of machine translation through binary system comparisons

Authors:
David Vilar

RWTH Aachen University, Aachen, Germany

RWTH Aachen University, Aachen, Germany
View Profile

,
Gregor Leusch

RWTH Aachen University, Aachen, Germany

RWTH Aachen University, Aachen, Germany
View Profile

,
Hermann Ney

RWTH Aachen University, Aachen, Germany

RWTH Aachen University, Aachen, Germany
View Profile

,
Rafael E. Banchs

Universitat Politècnica de Catalunya, Barcelona, Spain

Universitat Politècnica de Catalunya, Barcelona, Spain
View Profile

Authors Info & Claims

StatMT '07: Proceedings of the Second Workshop on Statistical Machine TranslationJune 2007Pages 96–103

Published:23 June 2007Publication History

StatMT '07: Proceedings of the Second Workshop on Statistical Machine Translation

Pages 96–103

ABSTRACT

We introduce a novel evaluation scheme for the human evaluation of different machine translation systems. Our method is based on direct comparison of two sentences at a time by human judges. These binary judgments are then used to decide between all possible rankings of the systems. The advantages of this new method are the lower dependency on extensive evaluation guidelines, and a tighter focus on a typical evaluation task, namely the ranking of systems. Furthermore we argue that machine translation evaluations should be regarded as statistical processes, both for human and automatic evaluation. We show how confidence ranges for state-of-the-art evaluation measures such as WER and TER can be computed accurately and efficiently without having to resort to Monte Carlo estimates. We give an example of our new evaluation scheme, as well as a comparison with classical automatic and human evaluation on data from a recent international evaluation campaign.

References

M. Bisani and H. Ney. 2004. Bootstrap estimates for confidence intervals in ASR performance evaluationx. IEEE ICASSP, pages 409--412, Montreal, Canada, May.Google Scholar
T. Bui and M. Thanh. 1985. Significant improvements to the Ford-Johnson algorithm for sorting. BIT Numerical Mathematics, 25(1):70--75. Google ScholarDigital Library
C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Reevaluating the role of Bleu in machine translation research. Proceeding of the 11th Conference of the European Chapter of the ACL: EACL 2006, pages 249--256, Trento, Italy, Apr.Google Scholar
G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proc. ARPA Workshop on Human Language Technology. Google ScholarDigital Library
B. Efron and R. J. Tibshirani. 1993. An Introduction to the Bootstrap. Chapman&Hall, New York and London.Google Scholar
L. Ford Jr and S. Johnson. 1959. A Tournament Problem. The American Mathematical Monthly, 66(5):387--389.Google ScholarCross Ref
D. E. Knuth. 1973. The Art of Computer Programming, volume 3. Addison-Wesley, 1st edition. Sorting and Searching. Google ScholarDigital Library
P. Koehn and C. Monz. 2006. Manual and automatic evaluation of machine translation between european languages. Proceedings of the Workshop on Statistical Machine Translation, pages 102--121, New York City, Jun. Google ScholarDigital Library
G. Leusch, N. Ueffing, D. Vilar, and H. Ney. 2005. Preprocessing and normalization for automatic evaluation of machine translation. 43rd ACL: Proc. Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, pages 17--24, Ann Arbor, Michigan, Jun.Google Scholar
F. J. Och. 2003. Minimum error rate training in statistical machine translation. Proc. of the 41st ACL, pages 160--167, Sapporo, Japan, Jul. Google ScholarDigital Library
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. Proc. of the 40th ACL, pages 311--318, Philadelphia, PA, Jul. Google ScholarDigital Library
M. Peczarski. 2002. Sorting 13 elements requires 34 comparisons. LNCS, 2461/2002:785--794, Sep. Google ScholarDigital Library
M. Snover, B. J. Dorr, R. Schwartz, J. Makhoul, L. Micciulla, and R. Weischedel. 2005. A study of translation error rate with targeted human annotation. Technical Report LAMP-TR-126, CS-TR-4755, UMIACS-TR-2005-58, University of Maryland, College Park, MD.Google Scholar
L. Thurstone. 1927. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21:384--400.Google ScholarCross Ref
D. Vilar, E. Matusov, S. Hasan, R. Zens, and H. Ney. 2005. Statistical Machine Translation of European Parliamentary Speeches. Proceedings of MT Summit X, pages 259--266, Phuket, Thailand, Sep.Google Scholar
M. Wells. 1971. Elements of combinatorial computing. Pergamon Press.Google Scholar
Y. Zhang and S. Vogel. 2004. Measuring confidence intervals for the machine translation evaluation metrics. Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 4--6, Baltimore, MD.Google Scholar

Recommendations

Evaluation of machine translation
ICWET '11: Proceedings of the International Conference & Workshop on Emerging Trends in Technology

Machine Translation (MT) refers to the use of a machine for performing translation task which converts text or speech from one Natural Language (NL) into another Natural Language. Machine Translation is an important technology for localization, and is ...
Read More
Dependency-based automatic evaluation for machine translation
SSST '07: Proceedings of the NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation

We present a novel method for evaluating the output of Machine Translation (MT), based on comparing the dependency structures of the translation and reference rather than their surface string forms. Our method uses a treebank-based, widecoverage, ...
Read More
Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established multidimensional quality metrics (MQM) error taxonomy and implement ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
StatMT '07: Proceedings of the Second Workshop on Statistical Machine Translation
June 2007
281 pages
Program Chairs:
Chris Callison-Burch
Johns Hopkins University
,
Philipp Koehn
University of Edinburgh
,
Christof Monz
Queen Mary, University of London
,
Cameron Shaw Fordyce
Center for the Evaluation of Language and Communication Technologies
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 23 June 2007
Qualifiers
- research-article
Conference

Acceptance Rates
StatMT '07 Paper Acceptance Rate12of38submissions,32%Overall Acceptance Rate24of59submissions,41%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 289
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Human evaluation of machine translation through binary system comparisons

StatMT '07: Proceedings of the Second Workshop on Statistical Machine Translation

ABSTRACT

References

Cited By

Recommendations

Evaluation of machine translation

Dependency-based automatic evaluation for machine translation

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Human evaluation of machine translation through binary system comparisons

StatMT '07: Proceedings of the Second Workshop on Statistical Machine Translation

ABSTRACT

References

Cited By

Recommendations

Evaluation of machine translation

Dependency-based automatic evaluation for machine translation

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media