skip to main content
10.5555/2145432.2145510dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora

Published:27 July 2011Publication History

ABSTRACT

We address the creation of cross-lingual textual entailment corpora by means of crowd-sourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for textual entailment, our work aims to: i) tackle the scarcity of data available to train and evaluate systems, and ii) promote the recourse to crowdsourcing as an effective way to reduce the costs of data collection without sacrificing quality. We show that a complex data creation task, for which even experts usually feature low agreement scores, can be effectively decomposed into simple subtasks assigned to non-expert annotators. The resulting dataset, obtained from a pipeline of different jobs routed to Amazon Mechanical Turk, contains more than 1,600 aligned pairs for each combination of texts-hypotheses in English, Italian and German.

References

  1. Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge. Proceedings of TAC 2009.Google ScholarGoogle Scholar
  2. Luisa Bentivogli, Elena Cabrio, Ido Dagan, Danilo Giampiccolo, Medea Lo Leggio, and Bernardo Magnini. 2010. Building Textual Entailment Specialized Data Sets: a Methodology for Isolating Linguistic Phenomena Relevant to Inference. Proceedings of LREC 2010.Google ScholarGoogle Scholar
  3. Michael Bloodgood and Chris Callison-Burch. 2010. Using Mechanical Turk to Build Machine Translation Evaluation Sets. Proceedings of the NAACL 2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Johan Bos, Fabio Massimo Zanzotto, and Marco Pennacchiotti. 2009. Textual Entailment at EVALITA 2009. Proceedings of EVALITA 2009.Google ScholarGoogle Scholar
  5. Chris Callison-Burch and Mark Dredze. 2010. Creating Speech and Language Data With Amazons Mechanical Turk. Proceedings NAACL-2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of language variability. Proceedings of the PASCAL Workshop of Learning Methods for Text Understanding and Mining.Google ScholarGoogle Scholar
  7. Yashar Mehdad, Matteo Negri, and Marcello Federico. 2010. Towards Cross-Lingual Textual Entailment. Proceedings of NAACL-HLT 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yashar Mehdad, Matteo Negri, and Marcello Federico. 2011. Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment. Proceedings of ACL-HLT 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Rada Mihalcea and Carlo Strapparava. 2009. The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language. Proceedings of ACL 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Joanna Mrozinski, Edward Whittaker, and Sadaoki Furui. 2008. Collecting a Why-Question Corpus for Development and Evaluation of an Automatic QA-System. Proceedings of ACL 2008.Google ScholarGoogle Scholar
  11. Matteo Negri and Yashar Mehdad. 2010. Creating a Bilingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush. Proceedings of the NAACL 2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mark Sammons, V. G. Vinod Vydiswaran, and Dan Roth. 2010. Ask Not What Textual Entailment Can Do for You... Proceedings of ACL 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rion Snow, Brendan O'Connor, Daniel Jurafsky and Andrew Y. Ng. 2008. Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. Proceedings of EMNLP 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. Proceedings of the NAACL 2010 Workshop on Creating Speech and Language Data With Amazons Mechanical Turk. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image DL Hosted proceedings
        EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
        July 2011
        1647 pages
        ISBN:9781937284114

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        • Published: 27 July 2011

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate73of234submissions,31%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader