research-article

Free Access

Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

Authors:
Rion Snow

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Brendan O'Connor

Dolores Labs, Inc., San Francisco, CA

Dolores Labs, Inc., San Francisco, CA
View Profile

,
Daniel Jurafsky

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Andrew Y. Ng

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

Authors Info & Claims

EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language ProcessingOctober 2008Pages 254–263

Published:25 October 2008Publication History

EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing

Pages 254–263

ABSTRACT

Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement between Mechanical Turk non-expert annotations and existing gold standard labels provided by expert labelers. For the task of affect recognition, we also show that using non-expert labels for training machine learning algorithms can be as effective as using gold standard annotations from experts. We propose a technique for bias correction that significantly improves annotation quality on two tasks. We conclude that many large labeling tasks can be effectively designed and carried out in this method at a fraction of the usual expense.

References

Paul S. Albert and Lori E. Dodd. 2004. A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard. Biometrics, Vol. 60 (2004), pp. 427--435.Google ScholarCross Ref
Collin F. Baker, Charles J. Fillmore and John B. Lowe. 1998. The Berkeley FrameNet project. In Proc. of COLING-ACL 1998. Google ScholarDigital Library
Michele Banko and Eric Brill. 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Proc. of ACL-2001. Google ScholarDigital Library
Junfu Cai, Wee Sun Lee and Yee Whye Teh. 2007. Improving Word Sense Disambiguation Using Topic Features. In Proc. of EMNLP-2007.Google Scholar
Timothy Chklovski and Rada Mihalcea. 2002. Building a sense tagged corpus with Open Mind Word Expert. In Proc. of the Workshop on "Word Sense Disambiguation: Recent Successes and Future Directions", ACL 2002. Google ScholarDigital Library
Timothy Chklovski and Yolanda Gil. 2005. Towards Managing Knowledge Collection from Volunteer Contributors. Proceedings of AAAI Spring Symposium on Knowledge Collection from Volunteer Contributors (KCVC05).Google Scholar
Ido Dagan, Oren Glickman and Bernardo Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177--190, Springer, 2006. Google ScholarDigital Library
Wisam Dakka and Panagiotis G. Ipeirotis. 2008. Automatic Extraction of Useful Facet Terms from Text Documents. In Proc. of ICDE-2008. Google ScholarDigital Library
A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, Vol. 28, No. 1 (1979), pp. 20--28.Google ScholarCross Ref
Michael Kaisser and John B. Lowe. 2008. A Research Collection of QuestionAnswer Sentence Pairs. In Proc. of LREC-2008.Google Scholar
Michael Kaisser, Marti Hearst, and John B. Lowe. 2008. Evidence for Varying Search Results Summary Lengths. In Proc. of ACL-2008.Google Scholar
Phil Katz, Matthew Singleton, Richard Wicentowski. 2007. SWAT-MP: The SemEval-2007 Systems for Task 5 and Task 14. In Proc. of SemEval-2007. Google ScholarDigital Library
Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proc. of CHI-2008. Google ScholarDigital Library
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19:2, June 1993. Google ScholarDigital Library
George A. Miller and William G. Charles. 1991. Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, vol. 6, no. 1, pp. 1--28, 1991.Google ScholarCross Ref
George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunke. 1993. A semantic concordance. In Proc. of HLT-1993. Google ScholarDigital Library
Preslav Nakov. 2008. Paraphrasing Verbs for Noun Compound Interpretation. In Proc. of the Workshop on Multiword Expressions, LREC-2008.Google Scholar
Martha Palmer, Dan Gildea, and Paul Kingsbury. 2005. The Proposition Bank: A Corpus Annotated with Semantic Roles. Computational Linguistics, 31:1. Google ScholarDigital Library
Sameer Pradhan, Edward Loper, Dmitriy Dligach and Martha Palmer. 2007. SemEval-2007 Task-17: English Lexical Sample, SRL and All Words. In Proc. of SemEval-2007. Google ScholarDigital Library
James Pustejovsky, Patrick Hanks, Roser Saur, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus. In Proc. of Corpus Linguistics 2003, 647--656.Google Scholar
Philip Resnik. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. JAIR, Volume 11, pages 95--130.Google ScholarCross Ref
Herbert Rubenstein and John B. Goodenough. 1965. Contextual Correlates of Synonymy. Communications of the ACM, 8(10):627--633. Google ScholarDigital Library
Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proc. of KDD-2008. Google ScholarDigital Library
Push Singh. 2002. The public acquisition of commonsense knowledge. In Proc. of AAAI Spring Symposium on Acquiring (and Using) Linguistic (and World) Knowledge for Information Access, 2002.Google Scholar
Alexander Sorokin and David Forsyth. 2008. Utility data annotation with Amazon Mechanical Turk. To appear in Proc. of First IEEE Workshop on Internet Vision at CVPR, 2008. See also: http://vision.cs.uiuc.edu/annotation/Google Scholar
David G. Stork. 1999. The Open Mind Initiative. IEEE Expert Systems and Their Applications pp. 16--20, May/June 1999.Google Scholar
Carlo Strapparava and Rada Mihalcea. 2007. SemEval-2007 Task 14: Affective Text In Proc. of SemEval-2007. Google ScholarDigital Library
Qi Su, Dmitry Pavlov, Jyh-Herng Chow, and Wendell C. Baker. 2007. Internet-Scale Collection of Human-Reviewed Data. In Proc. of WWW-2007. Google ScholarDigital Library
Luis von Ahn and Laura Dabbish. 2004. Labeling Images with a Computer Game. In ACM Conference on Human Factors in Computing Systems, CHI 2004. Google ScholarDigital Library
Luis von Ahn, Mihir Kedia and Manuel Blum. 2006. Verbosity: A Game for Collecting Common-Sense Knowledge. In ACM Conference on Human Factors in Computing Systems, CHI Notes 2006. Google ScholarDigital Library
Ellen Voorhees and Hoa Trang Dang. 2006. Overview of the TREC 2005 question answering track. In Proc. of TREC-2005.Google Scholar
Janyce M. Wiebe, Rebecca F. Bruce and Thomas P. O'Hara. 1999. Development and use of a gold-standard data set for subjectivity classifications. In Proc. of ACL-1999. Google ScholarDigital Library
Annie Zaenen. Submitted. Do give a penny for their thoughts. International Journal of Natural Language Engineering (submitted).Google Scholar

Recommendations

Good neighbors make good senses: exploiting distributional similarity for unsupervised WSD
COLING '08: Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1

We present an automatic method for senselabeling of text in an unsupervised manner. The method makes use of distributionally similar words to derive an automatically labeled training set, which is then used to train a standard supervised classifier for ...
Read More
Cheap, Fast, and Good Enough for the Non-biomedical Domain but is It Usable for Clinical Natural Language Processing? Evaluating Crowdsourcing for Clinical Trial Announcement Named Entity Annotations
HISB '12: Proceedings of the 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology

Building upon previous work from the general crowdsourcing research, this study investigates the usability of crowdsourcing in the clinical NLP domain for annotating medical named entities and entity linkages in a clinical trial announcement (CTA) ...
Read More
A cheap and fast way to build useful translation lexicons
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing
October 2008
1129 pages
Program Chairs:
Mirella Lapata
University of Edinburgh
,
Hwee Tou Ng
National University of Singapore
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 25 October 2008
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate73of234submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 400
  Total Citations
  View Citations
- 6,603
  Total Downloads
- Downloads (Last 12 months)129
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Good neighbors make good senses: exploiting distributional similarity for unsupervised WSD

Cheap, Fast, and Good Enough for the Non-biomedical Domain but is It Usable for Clinical Natural Language Processing? Evaluating Crowdsourcing for Clinical Trial Announcement Named Entity Annotations

A cheap and fast way to build useful translation lexicons

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08: Proceedings of the Conference on Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Good neighbors make good senses: exploiting distributional similarity for unsupervised WSD

Cheap, Fast, and Good Enough for the Non-biomedical Domain but is It Usable for Clinical Natural Language Processing? Evaluating Crowdsourcing for Clinical Trial Announcement Named Entity Annotations

A cheap and fast way to build useful translation lexicons

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media