skip to main content
10.5555/1868720.1868726dlproceedingsArticle/Chapter ViewAbstractPublication PageslawConference Proceedingsconference-collections
research-article
Free Access

Anveshan: a framework for analysis of multiple annotators' labeling behavior

Published:15 July 2010Publication History

ABSTRACT

Manual annotation of natural language to capture linguistic information is essential for NLP tasks involving supervised machine learning of semantic knowledge. Judgements of meaning can be more or less subjective, in which case instead of a single correct label, the labels assigned might vary among annotators based on the annotators' knowledge, age, gender, intuitions, background, and so on. We introduce a framework "Anveshan," where we investigate annotator behavior to find outliers, cluster annotators by behavior, and identify confusable labels. We also investigate the effectiveness of using trained annotators versus a larger number of untrained annotators on a word sense annotation task. The annotation data comes from a word sense disambiguation task for polysemous words, annotated by both trained annotators and untrained annotators from Amazon's Mechanical turk. Our results show that Anveshan is effective in uncovering patterns in annotator behavior, and we also show that trained annotators are superior to a larger number of untrained annotators for this task.

References

  1. }}Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat. 2005. Emotions from text: machine learning for text-based emotion prediction. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 579--586, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. }}Marco Carbone, Yaakov Gal, Stuart Shieber, and Barbara Grosz. 2004. Unifying annotated discourse hierarchies to create a gold standard. In Proceedings of the 5th Sigdial Workshop on Discourse and Dialogue.Google ScholarGoogle Scholar
  4. }}Irina Chugur, Julio Gonzalo, and Felisa Verdejo. 2002. Polysemy and sense proximity in the senseval-2 test suite. In Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 32--39, Philadelphia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Jacob Cohen. 1960. A coeffiecient of agreement for nominal scales. Educational and Psychological Measurement, 20:37--46.Google ScholarGoogle ScholarCross RefCross Ref
  6. }}Mona Diab. 2004. Relieving the data acquisition bottleneck in word sense disambiguation. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 303--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Jun Hu, Rebecca J. Passonneau, and Owen Rambow. 2009. Contrasting the interaction structure of an email and a telephone corpus: A machine learning approach to annotation of dialogue function units. In Proceedings of the 10th SIGDIAL on Dialogue and Discourse. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Nancy Ide and Yorick Wilks. 2006. Making sense about sense. In E. Agirre and P. Edmonds, editors, Word Sense Disambiguation: Algorithms and Applications, pages 47--74, Dordrecht, The Netherlands. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  9. }}Nancy Ide, Tomaz Erjavec, and Dan Tufis. 2002. Sense discrimination with parallel corpora. In Proceedings of ACL'02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 54--60, Philadelphia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Nancy Ide, Collin Baker, Christiane Fellbaum, and Rebecca Passonneau. 2010. The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Adam Kilgarriff and Martha Palmer. 2000. Introduction to the special issue on senseval. Computers and the Humanities, 34:1--2.Google ScholarGoogle ScholarCross RefCross Ref
  12. }}Adam Kilgarriff. 1997. I don't believe in word senses. Computers and the Humanities, 31:91--113.Google ScholarGoogle ScholarCross RefCross Ref
  13. }}Adam Kilgarriff. 1998. SENSEVAL: An exercise in evaluating word sense disambiguation programs. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pages 581--588, Granada.Google ScholarGoogle Scholar
  14. }}Devra Klein and Gregory Murphy. 2002. Paper has been my ruin: Conceptual relations of polysemous words. Journal of Memory and Language, 47:548--70.Google ScholarGoogle ScholarCross RefCross Ref
  15. }}Klaus Krippendorff. 1980. Content Analysis: An Introduction to Its Methodology. Sage Publications, Beverly Hills, CA.Google ScholarGoogle Scholar
  16. }}Chuck P. Lam and David G. Stork. 2003. Evaluating classifiers by means of test data with noisy labels. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), pages 513--518, Acapulco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. }}George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1993. Introduction to WordNet: An on-line lexical database (revised). Technical Report Cognitive Science Laboratory (CSL) Report 43, Princeton University, Princeton. Revised March 1993.Google ScholarGoogle Scholar
  18. }}Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999. A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX Workshop On Standardizing Lexical Resources.Google ScholarGoogle Scholar
  19. }}Martha Palmer, Hoa Trang Dang, and Christiane Fellbaum. 2005a. Making fine-grained and coarsegrained sense distinctions. Journal of Natural Language Engineering, 13.2:137--163.Google ScholarGoogle ScholarCross RefCross Ref
  20. }}Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005b. The proposition bank: An annotated corpus of semantic roles. Comput. Linguist., 31(1):71--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. }}Rebecca J. Passonneau, Nizar Habash, and Owen Rambow. 2006. Inter-annotator agreement on a multilingual semantic annotation task. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 1951--1956, Genoa, Italy.Google ScholarGoogle Scholar
  22. }}Rebecca Passonneau, Tom Lippincott, Tae Yano, and Judith Klavans. 2008. Relation between agreement measures on human labeling and machine learning performance: results from an art history domain. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), pages 2841--2848.Google ScholarGoogle Scholar
  23. }}Ted Pedersen. 2002a. Assessing system agreement and instance difficulty in the lexical sample tasks of Senseval-2. In Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 40--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. }}Ted Pedersen. 2002b. Evaluating the effectiveness of ensembles of decision trees in disambiguating SEN-SEVAL lexical samples. In Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 81--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. }}Dennis Reidsma and Jean Carletta. 2008. Reliability measurement without limits. Comput. Linguist., 34(3):319--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. }}Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. 2006. Framenet ii: Extended theory and practice. Available from http://framenet.icsi.berkeley.edu/index.php.Google ScholarGoogle Scholar
  27. }}Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceeding of the 14th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining, pages 614--622, Las Vegas. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. }}Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2007. Learning to merge word senses. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1005--1014, Prague.Google ScholarGoogle Scholar
  29. }}Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 254--263, Honolulu. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. }}Jean Véronis. 1998. A study of polysemy judgements and inter-annotator agreement. In SENSEVAL Workshop, pages Sussex, England.Google ScholarGoogle Scholar
  31. }}Janyce Wiebe and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. language resources and evaluation. In Language Resources and Evaluation (formerly Computers and the Humanities, page 2005.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    LAW IV '10: Proceedings of the Fourth Linguistic Annotation Workshop
    July 2010
    305 pages
    ISBN:9781932432725

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 15 July 2010

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader