ABSTRACT
We address the creation of cross-lingual textual entailment corpora by means of crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.
- Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge. Proceedings of TAC 2009.Google Scholar
- Luisa Bentivogli, Elena Cabrio, Ido Dagan, Danilo Giampiccolo, Medea Lo Leggio, and Bernardo Magnini. 2010. Building Textual Entailment Specialized Data Sets: a Methodology for Isolating Linguistic Phenomena Relevant to Inference. Proceedings of LREC 2010.Google Scholar
- Michael Bloodgood and Chris Callison-Burch. 2010. Using Mechanical Turk to Build Machine Translation Evaluation Sets. Proceedings of the NAACL 2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarDigital Library
- Johan Bos, Fabio Massimo Zanzotto, and Marco Pennacchiotti. 2009. Textual Entailment at EVALITA 2009. Proceedings of EVALITA 2009.Google Scholar
- Chris Callison-Burch and Mark Dredze. 2010. Creating Speech and Language Data With Amazons Mechanical Turk. Proceedings NAACL-2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarDigital Library
- Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of language variability. Proceedings of the PASCAL Workshop of Learning Methods for Text Understanding and Mining.Google Scholar
- Yashar Mehdad, Matteo Negri, and Marcello Federico. 2010. Towards Cross-Lingual Textual Entailment. Proceedings of NAACL-HLT 2010. Google ScholarDigital Library
- Yashar Mehdad, Matteo Negri, and Marcello Federico. 2011. Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment. Proceedings of ACL-HLT 2011. Google ScholarDigital Library
- Rada Mihalcea and Carlo Strapparava. 2009. The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language. Proceedings of ACL 2009. Google ScholarDigital Library
- Joanna Mrozinski, Edward Whittaker, and Sadaoki Furui. 2008. Collecting a Why-Question Corpus for Development and Evaluation of an Automatic QA-System. Proceedings of ACL 2008.Google Scholar
- Matteo Negri and Yashar Mehdad. 2010. Creating a Bilingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush. Proceedings of the NAACL 2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarDigital Library
- Mark Sammons, V. G. Vinod Vydiswaran, and Dan Roth. 2010. Ask Not What Textual Entailment Can Do for You... Proceedings of ACL 2010. Google ScholarDigital Library
- Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. 2008. Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. Proceedings of EMNLP 2008. Google ScholarDigital Library
- Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. Proceedings of the NAACL 2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarDigital Library
- Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora
Recommendations
A Divide-Conquer Strategy for Both English and Chinese Text Chunking
ALPIT '07: Proceedings of the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007)The traditional English text chunking approach identifies phrases by using only one model and phrases with the same types of features. It has been shown that the limitations of using only one model are that: the use of the same types of features is not ...
A divide-and-conquer strategy for shallow parsing of German free texts
ANLC '00: Proceedings of the sixth conference on Applied natural language processingWe present a divide-and-conquer strategy based on finite state technology for shallow parsing of real-world German texts. In a first phase only the topological structure of a sentence (i.e., verb groups, subclauses) are determined. In a second phase the ...
Chinese Text Chunking Using Divide-Conquer Model
FSKD '08: Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 04Traditional Chinese text chunking approach is to identify phrases using only one model and same features. It has been shown that the limitations of using only one model are that: the use of the same types of features is not suitable for all phrases, and ...
Comments