research-article

Learning to detect english and hungarian light verb constructions

Authors:
Veronika Vincze

Hungarian Academy of Sciences, Hungary

Hungarian Academy of Sciences, Hungary
View Profile

,
István Nagy T.

University of Szeged, Hungary

University of Szeged, Hungary
View Profile

,
János Zsibrita

University of Szeged, Hungary

University of Szeged, Hungary
View Profile

ACM Transactions on Speech and Language Processing Volume 10 Issue 2Article No.: 6pp 1–25https://doi.org/10.1145/2483691.2483695

Published:21 June 2013Publication History

ACM Transactions on Speech and Language Processing

Abstract

Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.

In this study, we present our conditional random fields-based tool—called FXTagger—for identifying light verb constructions. The flexibility of the tool is demonstrated on two, typologically different, languages, namely, English and Hungarian. As earlier studies labeled different linguistic phenomena as light verb constructions, we first present a linguistics-based classification of light verb constructions and then show that FXTagger is able to identify different classes of light verb constructions in both languages.

Different types of texts may contain different types of light verb constructions; moreover, the frequency of light verb constructions may differ from domain to domain. Hence we focus on the portability of models trained on different corpora, and we also investigate the effect of simple domain adaptation techniques to reduce the gap between the domains. Our results show that in spite of domain specificities, out-domain data can also contribute to the successful LVC detection in all domains.

References

Alonso, M. R. 2004. Las construcciones con verbo de apoyo. Visor Libros, Madrid.Google Scholar
Apresjan, J. D. 2004. O semantičeskoj nepustote i motivirovannosti glagol'nyx leksičeskix funkcij. Voprosy jazykoznanija 4, 3--18.Google Scholar
Apresjan, J. D. and Tsinman, L. L. 2002. Formal'naja model' perifrazirovanija predloženij dlja sistem pererabotki tekstkov na estestvennyx jazykax. Russkij jazyk v naučnom osveščenii 2, 4, 102--146.Google Scholar
Bannard, C. 2007. A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 1--8. Google ScholarDigital Library
Bejcek, E. and Stranák, P. 2010. Annotation of multiword expressions in the Prague Dependency Treebank. Lang. Resources Eval. 44, 1--2, 7--21.Google Scholar
Bouma, G. 2010. Collocation extraction beyond the independence assumption. In Proceedings of the ACL Conference (Short Papers). Association for Computational Linguistics, 109--114. Google ScholarDigital Library
Calzolari, N., Fillmore, C., Grishman, R., Ide, N. Lenci, A., MacLeod, C., and Zampolli, A. 2002. Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC'02). 1934--1940.Google Scholar
Cinková S. and Kolářová, V. 2005. Nouns as components of support verb constructions in the Prague Dependency Treebank. In Insight into Slovak and Czech Corpus Linguistics, M. Šimková, Ed., Veda Bratislava, Slovakia, 113--139.Google Scholar
Cook, P., Fazly, A. and Stevenson, S. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07). Association for Computational Linguistics, 41--48. Google ScholarDigital Library
Cook, P., Fazly, A., and Stevenson, S. 2008. The VNC-tokens dataset. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 19--22.Google Scholar
Daumé III, H. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 256--263.Google Scholar
Diab, M. and Bhutada, P. 2009. Verb noun construction MWE token classification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. Association for Computational Linguistics, 17--22. Google ScholarDigital Library
Dias, G. 2003. Multiword unit hybrid extraction. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 41--48. Google ScholarDigital Library
É. kiss, K. 2002. The Syntax of Hungarian. Cambridge University Press, Cambridge, UK.Google Scholar
Fazly, A. and Stevenson, S. 2007. Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions. Association for Computational Linguistics, 9--16. Google ScholarDigital Library
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL'05). Association for Computational Linguistics, 363--370. Google ScholarDigital Library
Gurrutxaga, A. and Alegria, I. N. 2011. Automatic extraction of NV Expressions in Basque: Basic issues on co-occurrence techniques. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 2--7. Google ScholarDigital Library
Hendrickx, L., Mendes, A., Pereira, S., Gonçalves, A., and Duarte, I. 2010. Complex predicates annotation in a corpus of Portuguese. In Proceedings of the 4th Linguistic Annotation Workshop. Association for Computational Linguistics, 100--108. Google ScholarDigital Library
Kaalep, H.-J. and Muischnek, K. 2006. Multi-word verbs in a flective language: The case of Estonian. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 57--64.Google Scholar
Kaalep, H.-J. and Muischnek, K. 2008. Multi-word verbs of Estonian: A database and a corpus. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 23--26.Google Scholar
Kearns, K. 2002. Light verbs in English. Manuscript.Google Scholar
Kim, S. N. 2008. Statistical modeling of multiword expressions. Ph.D. dissertation, University of Melbourne.Google Scholar
Klein D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the Annual Meeting of the ACL. Vol. 41, 423--430. Google ScholarDigital Library
Krenn, B. 2008. Description of evaluation resource—German PP-verb data. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 7--10.Google Scholar
Lafferty, J. D., McCallum, A. K., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML'01). Morgan Kaufmann, San Francisco, CA, 282--289. Google ScholarDigital Library
McCallum, A. K. 2002. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu.Google Scholar
Meyers, A., Reeves, R., MacLeod, C., Szekely, R., Zielinska, V., Young, B., and Grishman, R. 2004. The NomBank project: An interim report. In Proceedings of the HLT-NAACL Workshop: Frontiers in Corpus Annotation. A. Meyers, Ed., Association for Computational Linguistics, 24--31.Google Scholar
Muischnek, K. and Kaalep, H. J. 2010. The variability of multi-word verbal expressions in Estonian. Lang. Resources Eval. 44, 1--2, 115--135.Google Scholar
Nagy T., I., Vincze, V., and Berend, G. 2011. Domain-dependent identification of multiword expressions. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 622--627.Google Scholar
Pecina, P. 2010. Lexical association measures and collocation extraction. Lang. Resources Eval. 44, 1-2, 137--158.Google Scholar
Piao, S. S. L., Rayson, P., Archer, D., Wilson, A., and McEnery, T. 2003. Extracting multiword expressions with a semantic tagger. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 49--56. Google ScholarDigital Library
Ramisch, C., Villavicencio, A., and Boitet, C. 2010a. Multiword expressions in the wild&quest; The MWEToolkit comes in handy. In Proceedings of COLING'10 (Demonstrations). 57--60. Google ScholarDigital Library
Ramisch, C., Villavicencio, A., and Boitet, C. 2010b. MWEToolkit: A framework for multiword expression identification. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10). N. Calzolari et al., Eds., European Language Resources Association, 19--21.Google Scholar
Rayson, P., Piao, S. S., Sharoff, S., Evert, S. and Moirón, B. V. 2010. Multiword expressions: Hard going or plain sailing&quest; Lang. Resources Eval. 44, 1-2, 1--5.Google ScholarCross Ref
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'02). 1--15. Google ScholarDigital Library
Samardžić, T. and Merlo, P. 2010. Cross-lingual variation of light verb constructions: Using parallel corpora and automatic alignment for linguistic research. In Proceedings of the Workshop on NLP and Linguistics: Finding the Common Ground. Association for Computational Linguistics, 52--60. Google ScholarDigital Library
Sanches, M. D., Ramisch, C., Aluísio, S. M., and Villavicencio, A. 2011. Identifying and analyzing Brazilian Portuguese complex predicates. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 74--82. Google ScholarDigital Library
Sanromán Vilas, B. N. 2009. Towards a semantically oriented selection of the values of Oper₁: The case of golpe ‘blow’ in Spanish. In Proceedings of the 4th International Conference on Meaning-Text Theory (MTT'09). D. Beck et al., Eds., 327--337.Google Scholar
Sass, B. 2010. Párhuzamos igei szerkezetek közvetlen kinyerése párhuzamos korpuszból {Extracting parallel multiword verbs from parallel corpora}. In VII. Magyar Számítóg;épes; Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., Szegedi Tudományegyetem, Szeged, 102--110.Google Scholar
Sinha, R. M. 2011. Stepwise mining of multi-word expressions in Hindi. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 110--115. Google ScholarDigital Library
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., and Varga, D. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06). 2142--2147.Google Scholar
Stevenson, S., Fazly, A., and North, R. 2004. Statistical measures of the semi-productivity of light verb constructions. In Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing. T. Tanaka et al., Eds., Association for Computational Linguistics, 1--8. Google ScholarDigital Library
Szarvas, Gy., Farkas, R., and Kocsor, A. 2006. A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In Discovery Science, 267--278. Google ScholarDigital Library
Szarvas, Gy., Vincze, V., Farkas, R., Móra, Gy., and Gurevych, I. 2012. Cross-genre and cross-domain detection of semantic uncertainty. Computat. Ling. (Special Issue on Modality and Negation) 38, 2, 335--367. Google ScholarDigital Library
Tan, Y. F., Kan, M.-Y., and Cui, H. 2006. Extending corpus-based identification of light verb constructions using a supervised learning framework. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 49--56.Google Scholar
Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-03. W. Daelemans and M. Osborne, Eds., 142--147. Google ScholarDigital Library
Toutanova, K. and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of EMNLP'00. Association for Computational Linguistics, 63--70. Google ScholarDigital Library
Tu, Y. and Roth, D. 2011. Learning English light verb constructions: Contextual or statistical. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. Association for Computational Linguistics, 31--39. Google ScholarDigital Library
Van De Cruys, T. and Moirón, B. V. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 25--32. Google ScholarDigital Library
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., and Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 1034--1043.Google Scholar
Vincze, V. 2011. Semi-compositional noun + verb constructions: Theoretical questions and computational linguistic analyses. Ph.D. dissertation, University of Szeged, Szeged, Hungary.Google Scholar
Vincze, V. 2012. Light verb constructions in the SzegedParalellFX English--Hungarian parallel corpus. In Proceedings of LREC'12.Google Scholar
Vincze, V. and Csirik, J. 2010. Hungarian corpus of light verb constructions. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling'10). Coling 2010 Organizing Committee, 1110--1118. Google ScholarDigital Library
Vincze, V., Nagy T., I., and Berend, G. 2011a. Detecting noun compounds and light verb constructions: A contrastive study. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. ACL, 116--121. Google ScholarDigital Library
Vincze, V., Nagy T., I., and Berend, G. 2011b. Multiword expressions and named entities in theWiki50 corpus. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 289--295.Google Scholar
Vincze, V., Szauter, D., Almási, A., Móra, Gy., Alexin, Z., and Csirik, J. 2010. Hungarian dependency treebank. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10).Google Scholar
Zsibrita, J., Vincze, V., and Farkas, R. 2010. Ismeretlen kifejezések és a szófaji egyértelműsítés {Unknown expressions and POS-tagging}. In MSzNy 2010 -- VII. Magyar Számítógépes Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., University of Szeged, Szeged, Hungary, 275--283.Google Scholar

Index Terms

Learning to detect english and hungarian light verb constructions
1. Applied computing
  1. Arts and humanities

Recommendations

Light stemming approaches for the French, Portuguese, German and Hungarian languages
SAC '06: Proceedings of the 2006 ACM symposium on Applied computing

This paper describes and evaluates various general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemmers for the French, Portuguese and ...
Read More
Identifying verbal collocations in wikipedia articles
TSD'11: Proceedings of the 14th international conference on Text, speech and dialogue

In this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle constructions and light verb constructions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on ...
Read More
Learning English light verb constructions: contextual or statistical
MWE '11: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

In this paper, we investigate a supervised machine learning framework for automatically learning of English Light Verb Constructions (LVCs). Our system achieves an 86.3% accuracy with a baseline (chance) performance of 52.2% when trained with groups of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Speech and Language Processing Volume 10, Issue 2
Special issue on multiword expressions: From theory to practice and use, part 1
June 2013
91 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/2483691
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2013
- Accepted: 1 February 2013
- Revised: 1 October 2012
- Received: 1 June 2012
Published in tslp Volume 10, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Conditional random fields
English
Hungarian
corpora
domain adaptation
light verb constructions
multiword expressions
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 252
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning to detect english and hungarian light verb constructions

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Light stemming approaches for the French, Portuguese, German and Hungarian languages

Identifying verbal collocations in wikipedia articles

Learning English light verb constructions: contextual or statistical

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning to detect english and hungarian light verb constructions

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Light stemming approaches for the French, Portuguese, German and Hungarian languages

Identifying verbal collocations in wikipedia articles

Learning English light verb constructions: contextual or statistical

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media