skip to main content
10.5555/1613715.1613751dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

Published:25 October 2008Publication History

ABSTRACT

Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement between Mechanical Turk non-expert annotations and existing gold standard labels provided by expert labelers. For the task of affect recognition, we also show that using non-expert labels for training machine learning algorithms can be as effective as using gold standard annotations from experts. We propose a technique for bias correction that significantly improves annotation quality on two tasks. We conclude that many large labeling tasks can be effectively designed and carried out in this method at a fraction of the usual expense.

References

  1. Paul S. Albert and Lori E. Dodd. 2004. A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard. Biometrics, Vol. 60 (2004), pp. 427--435.Google ScholarGoogle ScholarCross RefCross Ref
  2. Collin F. Baker, Charles J. Fillmore and John B. Lowe. 1998. The Berkeley FrameNet project. In Proc. of COLING-ACL 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michele Banko and Eric Brill. 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Proc. of ACL-2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Junfu Cai, Wee Sun Lee and Yee Whye Teh. 2007. Improving Word Sense Disambiguation Using Topic Features. In Proc. of EMNLP-2007.Google ScholarGoogle Scholar
  5. Timothy Chklovski and Rada Mihalcea. 2002. Building a sense tagged corpus with Open Mind Word Expert. In Proc. of the Workshop on "Word Sense Disambiguation: Recent Successes and Future Directions", ACL 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Timothy Chklovski and Yolanda Gil. 2005. Towards Managing Knowledge Collection from Volunteer Contributors. Proceedings of AAAI Spring Symposium on Knowledge Collection from Volunteer Contributors (KCVC05).Google ScholarGoogle Scholar
  7. Ido Dagan, Oren Glickman and Bernardo Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177--190, Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wisam Dakka and Panagiotis G. Ipeirotis. 2008. Automatic Extraction of Useful Facet Terms from Text Documents. In Proc. of ICDE-2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, Vol. 28, No. 1 (1979), pp. 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  10. Michael Kaisser and John B. Lowe. 2008. A Research Collection of QuestionAnswer Sentence Pairs. In Proc. of LREC-2008.Google ScholarGoogle Scholar
  11. Michael Kaisser, Marti Hearst, and John B. Lowe. 2008. Evidence for Varying Search Results Summary Lengths. In Proc. of ACL-2008.Google ScholarGoogle Scholar
  12. Phil Katz, Matthew Singleton, Richard Wicentowski. 2007. SWAT-MP: The SemEval-2007 Systems for Task 5 and Task 14. In Proc. of SemEval-2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proc. of CHI-2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19:2, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. George A. Miller and William G. Charles. 1991. Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, vol. 6, no. 1, pp. 1--28, 1991.Google ScholarGoogle ScholarCross RefCross Ref
  16. George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunke. 1993. A semantic concordance. In Proc. of HLT-1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Preslav Nakov. 2008. Paraphrasing Verbs for Noun Compound Interpretation. In Proc. of the Workshop on Multiword Expressions, LREC-2008.Google ScholarGoogle Scholar
  18. Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. The Proposition Bank: A Corpus Annotated with Semantic Roles. Computational Linguistics, 31:1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sameer Pradhan, Edward Loper, Dmitriy Dligach and Martha Palmer. 2007. SemEval-2007 Task-17: English Lexical Sample, SRL and All Words. In Proc. of SemEval-2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. James Pustejovsky, Patrick Hanks, Roser Saur, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus. In Proc. of Corpus Linguistics 2003, 647--656.Google ScholarGoogle Scholar
  21. Philip Resnik. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. JAIR, Volume 11, pages 95--130.Google ScholarGoogle ScholarCross RefCross Ref
  22. Herbert Rubenstein and John B. Goodenough. 1965. Contextual Correlates of Synonymy. Communications of the ACM, 8(10):627--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proc. of KDD-2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Push Singh. 2002. The public acquisition of commonsense knowledge. In Proc. of AAAI Spring Symposium on Acquiring (and Using) Linguistic (and World) Knowledge for Information Access, 2002.Google ScholarGoogle Scholar
  25. Alexander Sorokin and David Forsyth. 2008. Utility data annotation with Amazon Mechanical Turk. To appear in Proc. of First IEEE Workshop on Internet Vision at CVPR, 2008. See also: http://vision.cs.uiuc.edu/annotation/Google ScholarGoogle Scholar
  26. David G. Stork. 1999. The Open Mind Initiative. IEEE Expert Systems and Their Applications pp. 16--20, May/June 1999.Google ScholarGoogle Scholar
  27. Carlo Strapparava and Rada Mihalcea. 2007. SemEval-2007 Task 14: Affective Text In Proc. of SemEval-2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Qi Su, Dmitry Pavlov, Jyh-Herng Chow, and Wendell C. Baker. 2007. Internet-Scale Collection of Human-Reviewed Data. In Proc. of WWW-2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Luis von Ahn and Laura Dabbish. 2004. Labeling Images with a Computer Game. In ACM Conference on Human Factors in Computing Systems, CHI 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Luis von Ahn, Mihir Kedia and Manuel Blum. 2006. Verbosity: A Game for Collecting Common-Sense Knowledge. In ACM Conference on Human Factors in Computing Systems, CHI Notes 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ellen Voorhees and Hoa Trang Dang. 2006. Overview of the TREC 2005 question answering track. In Proc. of TREC-2005.Google ScholarGoogle Scholar
  32. Janyce M. Wiebe, Rebecca F. Bruce and Thomas P. O'Hara. 1999. Development and use of a gold-standard data set for subjectivity classifications. In Proc. of ACL-1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Annie Zaenen. Submitted. Do give a penny for their thoughts. International Journal of Natural Language Engineering (submitted).Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing
    October 2008
    1129 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 25 October 2008

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate73of234submissions,31%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader