Abstract
Light verb constructions consist of a verbal and a nominal component, where the noun preserves its original meaning while the verb has lost it (to some degree). They are syntactically flexible and their meaning can only be partially computed on the basis of the meaning of their parts, thus they require special treatment in natural language processing. For this purpose, the first step is to identify light verb constructions.
In this study, we present our conditional random fields-based tool—called FXTagger—for identifying light verb constructions. The flexibility of the tool is demonstrated on two, typologically different, languages, namely, English and Hungarian. As earlier studies labeled different linguistic phenomena as light verb constructions, we first present a linguistics-based classification of light verb constructions and then show that FXTagger is able to identify different classes of light verb constructions in both languages.
Different types of texts may contain different types of light verb constructions; moreover, the frequency of light verb constructions may differ from domain to domain. Hence we focus on the portability of models trained on different corpora, and we also investigate the effect of simple domain adaptation techniques to reduce the gap between the domains. Our results show that in spite of domain specificities, out-domain data can also contribute to the successful LVC detection in all domains.
- Alonso, M. R. 2004. Las construcciones con verbo de apoyo. Visor Libros, Madrid.Google Scholar
- Apresjan, J. D. 2004. O semantičeskoj nepustote i motivirovannosti glagol'nyx leksičeskix funkcij. Voprosy jazykoznanija 4, 3--18.Google Scholar
- Apresjan, J. D. and Tsinman, L. L. 2002. Formal'naja model' perifrazirovanija predloženij dlja sistem pererabotki tekstkov na estestvennyx jazykax. Russkij jazyk v naučnom osveščenii 2, 4, 102--146.Google Scholar
- Bannard, C. 2007. A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 1--8. Google ScholarDigital Library
- Bejcek, E. and Stranák, P. 2010. Annotation of multiword expressions in the Prague Dependency Treebank. Lang. Resources Eval. 44, 1--2, 7--21.Google Scholar
- Bouma, G. 2010. Collocation extraction beyond the independence assumption. In Proceedings of the ACL Conference (Short Papers). Association for Computational Linguistics, 109--114. Google ScholarDigital Library
- Calzolari, N., Fillmore, C., Grishman, R., Ide, N. Lenci, A., MacLeod, C., and Zampolli, A. 2002. Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC'02). 1934--1940.Google Scholar
- Cinková S. and Kolářová, V. 2005. Nouns as components of support verb constructions in the Prague Dependency Treebank. In Insight into Slovak and Czech Corpus Linguistics, M. Šimková, Ed., Veda Bratislava, Slovakia, 113--139.Google Scholar
- Cook, P., Fazly, A. and Stevenson, S. 2007. Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07). Association for Computational Linguistics, 41--48. Google ScholarDigital Library
- Cook, P., Fazly, A., and Stevenson, S. 2008. The VNC-tokens dataset. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 19--22.Google Scholar
- Daumé III, H. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. Association for Computational Linguistics, 256--263.Google Scholar
- Diab, M. and Bhutada, P. 2009. Verb noun construction MWE token classification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. Association for Computational Linguistics, 17--22. Google ScholarDigital Library
- Dias, G. 2003. Multiword unit hybrid extraction. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 41--48. Google ScholarDigital Library
- É. kiss, K. 2002. The Syntax of Hungarian. Cambridge University Press, Cambridge, UK.Google Scholar
- Fazly, A. and Stevenson, S. 2007. Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions. Association for Computational Linguistics, 9--16. Google ScholarDigital Library
- Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL'05). Association for Computational Linguistics, 363--370. Google ScholarDigital Library
- Gurrutxaga, A. and Alegria, I. N. 2011. Automatic extraction of NV Expressions in Basque: Basic issues on co-occurrence techniques. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 2--7. Google ScholarDigital Library
- Hendrickx, L., Mendes, A., Pereira, S., Gonçalves, A., and Duarte, I. 2010. Complex predicates annotation in a corpus of Portuguese. In Proceedings of the 4th Linguistic Annotation Workshop. Association for Computational Linguistics, 100--108. Google ScholarDigital Library
- Kaalep, H.-J. and Muischnek, K. 2006. Multi-word verbs in a flective language: The case of Estonian. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 57--64.Google Scholar
- Kaalep, H.-J. and Muischnek, K. 2008. Multi-word verbs of Estonian: A database and a corpus. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 23--26.Google Scholar
- Kearns, K. 2002. Light verbs in English. Manuscript.Google Scholar
- Kim, S. N. 2008. Statistical modeling of multiword expressions. Ph.D. dissertation, University of Melbourne.Google Scholar
- Klein D. and Manning, C. D. 2003. Accurate unlexicalized parsing. In Proceedings of the Annual Meeting of the ACL. Vol. 41, 423--430. Google ScholarDigital Library
- Krenn, B. 2008. Description of evaluation resource—German PP-verb data. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE'08). 7--10.Google Scholar
- Lafferty, J. D., McCallum, A. K., and Pereira, F. C. N. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML'01). Morgan Kaufmann, San Francisco, CA, 282--289. Google ScholarDigital Library
- McCallum, A. K. 2002. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu.Google Scholar
- Meyers, A., Reeves, R., MacLeod, C., Szekely, R., Zielinska, V., Young, B., and Grishman, R. 2004. The NomBank project: An interim report. In Proceedings of the HLT-NAACL Workshop: Frontiers in Corpus Annotation. A. Meyers, Ed., Association for Computational Linguistics, 24--31.Google Scholar
- Muischnek, K. and Kaalep, H. J. 2010. The variability of multi-word verbal expressions in Estonian. Lang. Resources Eval. 44, 1--2, 115--135.Google Scholar
- Nagy T., I., Vincze, V., and Berend, G. 2011. Domain-dependent identification of multiword expressions. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 622--627.Google Scholar
- Pecina, P. 2010. Lexical association measures and collocation extraction. Lang. Resources Eval. 44, 1-2, 137--158.Google Scholar
- Piao, S. S. L., Rayson, P., Archer, D., Wilson, A., and McEnery, T. 2003. Extracting multiword expressions with a semantic tagger. In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Vol. 18, Association for Computational Linguistics, 49--56. Google ScholarDigital Library
- Ramisch, C., Villavicencio, A., and Boitet, C. 2010a. Multiword expressions in the wild? The MWEToolkit comes in handy. In Proceedings of COLING'10 (Demonstrations). 57--60. Google ScholarDigital Library
- Ramisch, C., Villavicencio, A., and Boitet, C. 2010b. MWEToolkit: A framework for multiword expression identification. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10). N. Calzolari et al., Eds., European Language Resources Association, 19--21.Google Scholar
- Rayson, P., Piao, S. S., Sharoff, S., Evert, S. and Moirón, B. V. 2010. Multiword expressions: Hard going or plain sailing? Lang. Resources Eval. 44, 1-2, 1--5.Google ScholarCross Ref
- Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing'02). 1--15. Google ScholarDigital Library
- Samardžić, T. and Merlo, P. 2010. Cross-lingual variation of light verb constructions: Using parallel corpora and automatic alignment for linguistic research. In Proceedings of the Workshop on NLP and Linguistics: Finding the Common Ground. Association for Computational Linguistics, 52--60. Google ScholarDigital Library
- Sanches, M. D., Ramisch, C., Aluísio, S. M., and Villavicencio, A. 2011. Identifying and analyzing Brazilian Portuguese complex predicates. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 74--82. Google ScholarDigital Library
- Sanromán Vilas, B. N. 2009. Towards a semantically oriented selection of the values of Oper1: The case of golpe ‘blow’ in Spanish. In Proceedings of the 4th International Conference on Meaning-Text Theory (MTT'09). D. Beck et al., Eds., 327--337.Google Scholar
- Sass, B. 2010. Párhuzamos igei szerkezetek közvetlen kinyerése párhuzamos korpuszból {Extracting parallel multiword verbs from parallel corpora}. In VII. Magyar Számítóg;épes; Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., Szegedi Tudományegyetem, Szeged, 102--110.Google Scholar
- Sinha, R. M. 2011. Stepwise mining of multi-word expressions in Hindi. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World. Association for Computational Linguistics, 110--115. Google ScholarDigital Library
- Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., and Varga, D. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06). 2142--2147.Google Scholar
- Stevenson, S., Fazly, A., and North, R. 2004. Statistical measures of the semi-productivity of light verb constructions. In Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing. T. Tanaka et al., Eds., Association for Computational Linguistics, 1--8. Google ScholarDigital Library
- Szarvas, Gy., Farkas, R., and Kocsor, A. 2006. A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms. In Discovery Science, 267--278. Google ScholarDigital Library
- Szarvas, Gy., Vincze, V., Farkas, R., Móra, Gy., and Gurevych, I. 2012. Cross-genre and cross-domain detection of semantic uncertainty. Computat. Ling. (Special Issue on Modality and Negation) 38, 2, 335--367. Google ScholarDigital Library
- Tan, Y. F., Kan, M.-Y., and Cui, H. 2006. Extending corpus-based identification of light verb constructions using a supervised learning framework. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts. Association for Computational Linguistics, 49--56.Google Scholar
- Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of CoNLL-03. W. Daelemans and M. Osborne, Eds., 142--147. Google ScholarDigital Library
- Toutanova, K. and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of EMNLP'00. Association for Computational Linguistics, 63--70. Google ScholarDigital Library
- Tu, Y. and Roth, D. 2011. Learning English light verb constructions: Contextual or statistical. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. Association for Computational Linguistics, 31--39. Google ScholarDigital Library
- Van De Cruys, T. and Moirón, B. V. 2007. Semantics-based multiword expression extraction. In Proceedings of the Workshop on a Broader Perspective on Multiword Expressions (MWE'07), Association for Computational Linguistics, 25--32. Google ScholarDigital Library
- Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., and Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 1034--1043.Google Scholar
- Vincze, V. 2011. Semi-compositional noun + verb constructions: Theoretical questions and computational linguistic analyses. Ph.D. dissertation, University of Szeged, Szeged, Hungary.Google Scholar
- Vincze, V. 2012. Light verb constructions in the SzegedParalellFX English--Hungarian parallel corpus. In Proceedings of LREC'12.Google Scholar
- Vincze, V. and Csirik, J. 2010. Hungarian corpus of light verb constructions. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling'10). Coling 2010 Organizing Committee, 1110--1118. Google ScholarDigital Library
- Vincze, V., Nagy T., I., and Berend, G. 2011a. Detecting noun compounds and light verb constructions: A contrastive study. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World. ACL, 116--121. Google ScholarDigital Library
- Vincze, V., Nagy T., I., and Berend, G. 2011b. Multiword expressions and named entities in theWiki50 corpus. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'11). 289--295.Google Scholar
- Vincze, V., Szauter, D., Almási, A., Móra, Gy., Alexin, Z., and Csirik, J. 2010. Hungarian dependency treebank. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC'10).Google Scholar
- Zsibrita, J., Vincze, V., and Farkas, R. 2010. Ismeretlen kifejezések és a szófaji egyértelműsítés {Unknown expressions and POS-tagging}. In MSzNy 2010 -- VII. Magyar Számítógépes Nyelvészeti Konferencia, A. Tanács and V. Vincze, Eds., University of Szeged, Szeged, Hungary, 275--283.Google Scholar
Index Terms
- Learning to detect english and hungarian light verb constructions
Recommendations
Light stemming approaches for the French, Portuguese, German and Hungarian languages
SAC '06: Proceedings of the 2006 ACM symposium on Applied computingThis paper describes and evaluates various general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemmers for the French, Portuguese and ...
Identifying verbal collocations in wikipedia articles
TSD'11: Proceedings of the 14th international conference on Text, speech and dialogueIn this paper, we focus on various methods for detecting verbal collocations, i.e. verb-particle constructions and light verb constructions in Wikipedia articles. Our results suggest that for verb-particle constructions, POS-tagging and restriction on ...
Learning English light verb constructions: contextual or statistical
MWE '11: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real WorldIn this paper, we investigate a supervised machine learning framework for automatically learning of English Light Verb Constructions (LVCs). Our system achieves an 86.3% accuracy with a baseline (chance) performance of 52.2% when trained with groups of ...
Comments