ABSTRACT
Manual annotation of natural language to capture linguistic information is essential for NLP tasks involving supervised machine learning of semantic knowledge. Judgements of meaning can be more or less subjective, in which case instead of a single correct label, the labels assigned might vary among annotators based on the annotators' knowledge, age, gender, intuitions, background, and so on. We introduce a framework "Anveshan," where we investigate annotator behavior to find outliers, cluster annotators by behavior, and identify confusable labels. We also investigate the effectiveness of using trained annotators versus a larger number of untrained annotators on a word sense annotation task. The annotation data comes from a word sense disambiguation task for polysemous words, annotated by both trained annotators and untrained annotators from Amazon's Mechanical turk. Our results show that Anveshan is effective in uncovering patterns in annotator behavior, and we also show that trained annotators are superior to a larger number of untrained annotators for this task.
- }}Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat. 2005. Emotions from text: machine learning for text-based emotion prediction. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 579--586, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- }}Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555--596. Google ScholarDigital Library
- }}Marco Carbone, Yaakov Gal, Stuart Shieber, and Barbara Grosz. 2004. Unifying annotated discourse hierarchies to create a gold standard. In Proceedings of the 5th Sigdial Workshop on Discourse and Dialogue.Google Scholar
- }}Irina Chugur, Julio Gonzalo, and Felisa Verdejo. 2002. Polysemy and sense proximity in the senseval-2 test suite. In Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 32--39, Philadelphia. Google ScholarDigital Library
- }}Jacob Cohen. 1960. A coeffiecient of agreement for nominal scales. Educational and Psychological Measurement, 20:37--46.Google ScholarCross Ref
- }}Mona Diab. 2004. Relieving the data acquisition bottleneck in word sense disambiguation. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pages 303--311. Google ScholarDigital Library
- }}Jun Hu, Rebecca J. Passonneau, and Owen Rambow. 2009. Contrasting the interaction structure of an email and a telephone corpus: A machine learning approach to annotation of dialogue function units. In Proceedings of the 10th SIGDIAL on Dialogue and Discourse. Google ScholarDigital Library
- }}Nancy Ide and Yorick Wilks. 2006. Making sense about sense. In E. Agirre and P. Edmonds, editors, Word Sense Disambiguation: Algorithms and Applications, pages 47--74, Dordrecht, The Netherlands. Springer.Google ScholarCross Ref
- }}Nancy Ide, Tomaz Erjavec, and Dan Tufis. 2002. Sense discrimination with parallel corpora. In Proceedings of ACL'02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 54--60, Philadelphia. Google ScholarDigital Library
- }}Nancy Ide, Collin Baker, Christiane Fellbaum, and Rebecca Passonneau. 2010. The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden. Google ScholarDigital Library
- }}Adam Kilgarriff and Martha Palmer. 2000. Introduction to the special issue on senseval. Computers and the Humanities, 34:1--2.Google ScholarCross Ref
- }}Adam Kilgarriff. 1997. I don't believe in word senses. Computers and the Humanities, 31:91--113.Google ScholarCross Ref
- }}Adam Kilgarriff. 1998. SENSEVAL: An exercise in evaluating word sense disambiguation programs. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), pages 581--588, Granada.Google Scholar
- }}Devra Klein and Gregory Murphy. 2002. Paper has been my ruin: Conceptual relations of polysemous words. Journal of Memory and Language, 47:548--70.Google ScholarCross Ref
- }}Klaus Krippendorff. 1980. Content Analysis: An Introduction to Its Methodology. Sage Publications, Beverly Hills, CA.Google Scholar
- }}Chuck P. Lam and David G. Stork. 2003. Evaluating classifiers by means of test data with noisy labels. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), pages 513--518, Acapulco. Google ScholarDigital Library
- }}George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. 1993. Introduction to WordNet: An on-line lexical database (revised). Technical Report Cognitive Science Laboratory (CSL) Report 43, Princeton University, Princeton. Revised March 1993.Google Scholar
- }}Hwee Tou Ng, Chung Yong Lim, and Shou King Foo. 1999. A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX Workshop On Standardizing Lexical Resources.Google Scholar
- }}Martha Palmer, Hoa Trang Dang, and Christiane Fellbaum. 2005a. Making fine-grained and coarsegrained sense distinctions. Journal of Natural Language Engineering, 13.2:137--163.Google ScholarCross Ref
- }}Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005b. The proposition bank: An annotated corpus of semantic roles. Comput. Linguist., 31(1):71--106. Google ScholarDigital Library
- }}Rebecca J. Passonneau, Nizar Habash, and Owen Rambow. 2006. Inter-annotator agreement on a multilingual semantic annotation task. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 1951--1956, Genoa, Italy.Google Scholar
- }}Rebecca Passonneau, Tom Lippincott, Tae Yano, and Judith Klavans. 2008. Relation between agreement measures on human labeling and machine learning performance: results from an art history domain. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), pages 2841--2848.Google Scholar
- }}Ted Pedersen. 2002a. Assessing system agreement and instance difficulty in the lexical sample tasks of Senseval-2. In Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 40--46. Google ScholarDigital Library
- }}Ted Pedersen. 2002b. Evaluating the effectiveness of ensembles of decision trees in disambiguating SEN-SEVAL lexical samples. In Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, pages 81--87. Google ScholarDigital Library
- }}Dennis Reidsma and Jean Carletta. 2008. Reliability measurement without limits. Comput. Linguist., 34(3):319--326. Google ScholarDigital Library
- }}Josef Ruppenhofer, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, and Jan Scheffczyk. 2006. Framenet ii: Extended theory and practice. Available from http://framenet.icsi.berkeley.edu/index.php.Google Scholar
- }}Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. 2008. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceeding of the 14th ACM SIG KDD International Conference on Knowledge Discovery and Data Mining, pages 614--622, Las Vegas. Google ScholarDigital Library
- }}Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2007. Learning to merge word senses. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1005--1014, Prague.Google Scholar
- }}Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 254--263, Honolulu. Google ScholarDigital Library
- }}Jean Véronis. 1998. A study of polysemy judgements and inter-annotator agreement. In SENSEVAL Workshop, pages Sussex, England.Google Scholar
- }}Janyce Wiebe and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. language resources and evaluation. In Language Resources and Evaluation (formerly Computers and the Humanities, page 2005.Google Scholar
Recommendations
Comparison of Methods to Annotate Named Entity Corpora
The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Transductive Multilabel Learning via Label Set Propagation
The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Comments