ABSTRACT
Making use of latent semantic analysis, we explore the hypothesis that local linguistic context can serve to identify multi-word expressions that have non-compositional meanings. We propose that vector-similarity between distribution vectors associated with an MWE as a whole and those associated with its constituent parts can serve as a good measure of the degree to which the MWE is compositional. We present experiments that show that low (cosine) similarity does, in fact, correlate with non-compositionality.
- Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press / Addison-Wesley. Google ScholarDigital Library
- Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003. An empirical model of multiword expression decomposability. In Proceedings of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 89--96, Sapporo, Japan. Google ScholarDigital Library
- Colin Bannard, Timothy Baldwin, and Alex Lascarides. 2003. A statistical approach to the semantics of verb-particles. In Proceedings of the ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages 65--72, Sapporo, Japan. Google ScholarDigital Library
- Michael W. Berry, Zlatko Drmavc, and Elisabeth R. Jessup. 1999. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335--362. Google ScholarDigital Library
- Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249--254. Google ScholarDigital Library
- Scott Cederberg and Dominic Widdows. 2003. Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction. In In Seventh Conference on Computational Natural Language Learning, pages 111--118, Edmonton, Canada, June. Google ScholarDigital Library
- Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407.Google ScholarCross Ref
- Stefan Evert and Hannah Kermes. 2003. Experiments on candidate data for collocation extraction. In Companion Volume to the Proceedings of the 10th Conference of The European Chapter of the Association for Computational Linguistics, pages 83--86, Budapest, Hungary. Google ScholarDigital Library
- Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188--195, Toulouse, France. Google ScholarDigital Library
- Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart.Google Scholar
- Christiane Fellbaum. 1998. WordNet, an electronic lexical database. MIT Press, Cambridge, MA.Google Scholar
- Nancy Ide and Jean Véronis. 1998. Word sense disambiguation: The state of the art. Computational Linguistics, 14(1).Google Scholar
- Walter Kintsch. 2001. Predication. Cognitive Science, 25(2):173--202.Google ScholarCross Ref
- Brigitte Krenn. 2000. The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations. Dissertations in Computational Linguistics and Language Technology. German Research Center for Artificial Intelligence and Saarland University, Saarbrücken, Germany.Google Scholar
- Thomas K. Landauer and Susan T. Dumais. 1997. A solution to plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104:211--240.Google ScholarCross Ref
- Thomas K. Landauer and Joseph Psotka. 2000. Simulating text understanding for educational applications with latent semantic analysis: Introduction to LSA. Interactive Learning Environments, 8(2):73--86.Google ScholarCross Ref
- Dekang Lin. 1999. Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 317--324, College Park, MD. Google ScholarDigital Library
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical NaturalLanguage Processing. The MIT Press, Cambridge, MA. Google ScholarDigital Library
- Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A. Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conferences on Intelligent Text Processing and Computational Linguistics, pages 1--15. Google ScholarDigital Library
- Patrick Schone and Daniel Jurafsky. 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of Empirical Methods in Natural Language Processing, Pittsburgh, PA.Google Scholar
- Hinrich Schütze. 1998. Automatic word sense discrimination. Computational Linguistics, 24(1):97--124. Google ScholarDigital Library
- Begoña Villada Moirón and Jörg Tiedemann. 2006. Identifying idiomatic expressions using automatic word-alignment. In Proceedings of the EACL 2006 Workshop on Multiword Expressions in a Multilingual Context, Trento, Italy.Google Scholar
- Dominic Widdows and Stanley Peters. 2003. Word vectors and quantum logic: Experiments with negation and disjunction. In Eighth Mathematics of Language Conference, pages 141--150, Bloomington, Indiana.Google Scholar
- Chengxiang Zhai. 1997. Exploiting context to identify lexical atoms --- a statistical view of linguistic context. In Proceedings of the International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97), pages 119--129.Google Scholar
Recommendations
Multi-word expressions in textual inference: much ado about nothing?
TextInfer '09: Proceedings of the 2009 Workshop on Applied Textual InferenceMulti-word expressions (MWE) have seen much attention from the NLP community. In this paper, we investigate their impact on the recognition of textual entailment (RTE). Using the manual Microsoft Research annotations, we first manually count and ...
Automatic identification of infrequent word senses
COLING '04: Proceedings of the 20th international conference on Computational LinguisticsIn this paper we show that an unsupervised method for ranking word senses automatically can be used to identify infrequently occurring senses. We demonstrate this using a ranking of noun senses derived from the BNC and evaluating on the sense-tagged ...
Two-Word Collocation Extraction Using Monolingual Word Alignment Method
Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact ...
Comments