skip to main content
research-article

Learning to detect english and hungarian light verb constructions

Published:21 June 2013Publication History
Skip Abstract Section

Abstract

Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.

In this study, we present our conditional random fields-based tool—called FXTagger—for identifying light verb constructions. The flexibility of the tool is demonstrated on two, typologically different, languages, namely, English and Hungarian. As earlier studies labeled different linguistic phenomena as light verb constructions, we first present a linguistics-based classification of light verb constructions and then show that FXTagger is able to identify different classes of light verb constructions in both languages.

Different types of texts may contain different types of light verb constructions; moreover, the frequency of light verb constructions may differ from domain to domain. Hence we focus on the portability of models trained on different corpora, and we also investigate the effect of simple domain adaptation techniques to reduce the gap between the domains. Our results show that in spite of domain specificities, out-domain data can also contribute to the successful LVC detection in all domains.

References

  1. Alonso, M. R. 2004. Las construcciones con verbo de apoyo. Visor Libros, Madrid.Google ScholarGoogle Scholar
  2. Apresjan, J. D. 2004. O semantičeskoj nepustote i motivirovannosti glagol'nyx leksičeskix funkcij. Voprosy jazykoznanija 4, 3--18.Google ScholarGoogle Scholar
  3. Apresjan, J. D. and Tsinman, L. L. 2002. Formal'naja model' perifrazirovanija predloženij dlja sistem pererabotki tekstkov na estestvennyx jazykax. Russkij jazyk v naučnom osveščenii 2, 4, 102--146.Google ScholarGoogle Scholar
  4. Bannard, C. 2007. A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bejcek, E. and Stranák, P. 2010. Annotation of multiword expressions in the Prague Dependency Treebank. Lang. Resources Eval. 44, 1--2, 7--21.Google ScholarGoogle Scholar
  6. Bouma, G. 2010. Collocation extraction beyond the independence assumption. In Proceedings of the ACL Conference (Short Papers). Association for Computational Linguistics, 109--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Calzolari, N., Fillmore, C., Grishman, R., Ide, N. Lenci, A., MacLeod, C., and Zampolli, A. 2002. Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC'02). 1934--1940.Google ScholarGoogle Scholar
  8. Cinková S. and Kolářová, V. 2005. Nouns as components of support verb constructions in the Prague Dependency Treebank. In Insight into Slovak and Czech Corpus Linguistics, M. Šimková, Ed., Veda Bratislava, Slovakia, 113--139.Google ScholarGoogle Scholar
  9. Cook, P., Fazly, A. and Stevenson, S. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07). Association for Computational Linguistics, 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cook, P., Fazly, A., and Stevenson, S. 2008. The VNC-tokens dataset. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 19--22.Google ScholarGoogle Scholar
  11. Daumé III, H. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 256--263.Google ScholarGoogle Scholar
  12. Diab, M. and Bhutada, P. 2009. Verb noun construction MWE token classification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. Association for Computational Linguistics, 17--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dias, G. 2003. Multiword unit hybrid extraction. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. É. kiss, K. 2002. The Syntax of Hungarian. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  15. Fazly, A. and Stevenson, S. 2007. Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions. Association for Computational Linguistics, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL'05). Association for Computational Linguistics, 363--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gurrutxaga, A. and Alegria, I. N. 2011. Automatic extraction of NV Expressions in Basque: Basic issues on co-occurrence techniques. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 2--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hendrickx, L., Mendes, A., Pereira, S., Gonçalves, A., and Duarte, I. 2010. Complex predicates annotation in a corpus of Portuguese. In Proceedings of the 4th Linguistic Annotation Workshop. Association for Computational Linguistics, 100--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kaalep, H.-J. and Muischnek, K. 2006. Multi-word verbs in a flective language: The case of Estonian. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 57--64.Google ScholarGoogle Scholar
  20. Kaalep, H.-J. and Muischnek, K. 2008. Multi-word verbs of Estonian: A database and a corpus. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 23--26.Google ScholarGoogle Scholar
  21. Kearns, K. 2002. Light verbs in English. Manuscript.Google ScholarGoogle Scholar
  22. Kim, S. N. 2008. Statistical modeling of multiword expressions. Ph.D. dissertation, University of Melbourne.Google ScholarGoogle Scholar
  23. Klein D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the Annual Meeting of the ACL. Vol. 41, 423--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Krenn, B. 2008. Description of evaluation resource—German PP-verb data. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 7--10.Google ScholarGoogle Scholar
  25. Lafferty, J. D., McCallum, A. K., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML'01). Morgan Kaufmann, San Francisco, CA, 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. McCallum, A. K. 2002. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu.Google ScholarGoogle Scholar
  27. Meyers, A., Reeves, R., MacLeod, C., Szekely, R., Zielinska, V., Young, B., and Grishman, R. 2004. The NomBank project: An interim report. In Proceedings of the HLT-NAACL Workshop: Frontiers in Corpus Annotation. A. Meyers, Ed., Association for Computational Linguistics, 24--31.Google ScholarGoogle Scholar
  28. Muischnek, K. and Kaalep, H. J. 2010. The variability of multi-word verbal expressions in Estonian. Lang. Resources Eval. 44, 1--2, 115--135.Google ScholarGoogle Scholar
  29. Nagy T., I., Vincze, V., and Berend, G. 2011. Domain-dependent identification of multiword expressions. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 622--627.Google ScholarGoogle Scholar
  30. Pecina, P. 2010. Lexical association measures and collocation extraction. Lang. Resources Eval. 44, 1-2, 137--158.Google ScholarGoogle Scholar
  31. Piao, S. S. L., Rayson, P., Archer, D., Wilson, A., and McEnery, T. 2003. Extracting multiword expressions with a semantic tagger. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 49--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ramisch, C., Villavicencio, A., and Boitet, C. 2010a. Multiword expressions in the wild? The MWEToolkit comes in handy. In Proceedings of COLING'10 (Demonstrations). 57--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ramisch, C., Villavicencio, A., and Boitet, C. 2010b. MWEToolkit: A framework for multiword expression identification. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10). N. Calzolari et al., Eds., European Language Resources Association, 19--21.Google ScholarGoogle Scholar
  34. Rayson, P., Piao, S. S., Sharoff, S., Evert, S. and Moirón, B. V. 2010. Multiword expressions: Hard going or plain sailing? Lang. Resources Eval. 44, 1-2, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  35. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'02). 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Samardžić, T. and Merlo, P. 2010. Cross-lingual variation of light verb constructions: Using parallel corpora and automatic alignment for linguistic research. In Proceedings of the Workshop on NLP and Linguistics: Finding the Common Ground. Association for Computational Linguistics, 52--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sanches, M. D., Ramisch, C., Aluísio, S. M., and Villavicencio, A. 2011. Identifying and analyzing Brazilian Portuguese complex predicates. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 74--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sanromán Vilas, B. N. 2009. Towards a semantically oriented selection of the values of Oper1: The case of golpe ‘blow’ in Spanish. In Proceedings of the 4th International Conference on Meaning-Text Theory (MTT'09). D. Beck et al., Eds., 327--337.Google ScholarGoogle Scholar
  39. Sass, B. 2010. Párhuzamos igei szerkezetek közvetlen kinyerése párhuzamos korpuszból {Extracting parallel multiword verbs from parallel corpora}. In VII. Magyar Számítóg;épes; Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., Szegedi Tudományegyetem, Szeged, 102--110.Google ScholarGoogle Scholar
  40. Sinha, R. M. 2011. Stepwise mining of multi-word expressions in Hindi. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 110--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., and Varga, D. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06). 2142--2147.Google ScholarGoogle Scholar
  42. Stevenson, S., Fazly, A., and North, R. 2004. Statistical measures of the semi-productivity of light verb constructions. In Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing. T. Tanaka et al., Eds., Association for Computational Linguistics, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Szarvas, Gy., Farkas, R., and Kocsor, A. 2006. A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In Discovery Science, 267--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Szarvas, Gy., Vincze, V., Farkas, R., Móra, Gy., and Gurevych, I. 2012. Cross-genre and cross-domain detection of semantic uncertainty. Computat. Ling. (Special Issue on Modality and Negation) 38, 2, 335--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Tan, Y. F., Kan, M.-Y., and Cui, H. 2006. Extending corpus-based identification of light verb constructions using a supervised learning framework. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 49--56.Google ScholarGoogle Scholar
  46. Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-03. W. Daelemans and M. Osborne, Eds., 142--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Toutanova, K. and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of EMNLP'00. Association for Computational Linguistics, 63--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tu, Y. and Roth, D. 2011. Learning English light verb constructions: Contextual or statistical. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. Association for Computational Linguistics, 31--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Van De Cruys, T. and Moirón, B. V. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., and Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 1034--1043.Google ScholarGoogle Scholar
  51. Vincze, V. 2011. Semi-compositional noun + verb constructions: Theoretical questions and computational linguistic analyses. Ph.D. dissertation, University of Szeged, Szeged, Hungary.Google ScholarGoogle Scholar
  52. Vincze, V. 2012. Light verb constructions in the SzegedParalellFX English--Hungarian parallel corpus. In Proceedings of LREC'12.Google ScholarGoogle Scholar
  53. Vincze, V. and Csirik, J. 2010. Hungarian corpus of light verb constructions. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling'10). Coling 2010 Organizing Committee, 1110--1118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Vincze, V., Nagy T., I., and Berend, G. 2011a. Detecting noun compounds and light verb constructions: A contrastive study. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. ACL, 116--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Vincze, V., Nagy T., I., and Berend, G. 2011b. Multiword expressions and named entities in theWiki50 corpus. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 289--295.Google ScholarGoogle Scholar
  56. Vincze, V., Szauter, D., Almási, A., Móra, Gy., Alexin, Z., and Csirik, J. 2010. Hungarian dependency treebank. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10).Google ScholarGoogle Scholar
  57. Zsibrita, J., Vincze, V., and Farkas, R. 2010. Ismeretlen kifejezések és a szófaji egyértelműsítés {Unknown expressions and POS-tagging}. In MSzNy 2010 -- VII. Magyar Számítógépes Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., University of Szeged, Szeged, Hungary, 275--283.Google ScholarGoogle Scholar

Index Terms

  1. Learning to detect english and hungarian light verb constructions

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Speech and Language Processing
      ACM Transactions on Speech and Language Processing   Volume 10, Issue 2
      Special issue on multiword expressions: From theory to practice and use, part 1
      June 2013
      91 pages
      ISSN:1550-4875
      EISSN:1550-4883
      DOI:10.1145/2483691
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 June 2013
      • Accepted: 1 February 2013
      • Revised: 1 October 2012
      • Received: 1 June 2012
      Published in tslp Volume 10, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader