research-article

Free Access

Word representations: a simple and general method for semi-supervised learning

Authors:
Joseph Turian

Université de Montréal, Montréal, Québec, Canada

Université de Montréal, Montréal, Québec, Canada
View Profile

,
Lev Ratinov

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

,
Yoshua Bengio

Université de Montréal, Montréal, Québec, Canada

Université de Montréal, Montréal, Québec, Canada
View Profile

Authors Info & Claims

ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational LinguisticsJuly 2010Pages 384–394

Published:11 July 2010Publication History

ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Pages 384–394

ABSTRACT

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize.com/projects/wordreprs/

References

}}Ando, R., & Zhang, T. (2005). A high-performance semi-supervised learning method for text chunking. ACL. Google ScholarDigital Library
}}Bengio, Y. (2008). Neural net language models. Scholarpedia, 3, 3881.Google ScholarCross Ref
}}Bengio, Y., Ducharme, R., & Vincent, P. (2001). A neural probabilistic language model. NIPS.Google Scholar
}}Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137--1155. Google ScholarDigital Library
}}Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. ICML. Google ScholarDigital Library
}}Bengio, Y., & Séénécal, J.-S. (2003). Quick training of probabilistic neural nets by importance sampling. AISTATS.Google Scholar
}}Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarDigital Library
}}Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18, 467--479. Google ScholarDigital Library
}}Candito, M., & Crabbéé, B. (2009). Improving generative statistical parsing with semi-supervised word clustering. IWPT (pp. 138--141). Google ScholarDigital Library
}}Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML. Google ScholarDigital Library
}}Deschacht, K., & Moens, M.-F. (2009). Semi-supervised semantic role labeling using the Latent Words Language Model. EMNLP (pp. 21--29). Google ScholarDigital Library
}}Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Using latent semantic analysis to improve access to textual information. SIGCHI Conference on Human Factors in Computing Systems (pp. 281--285). ACM. Google ScholarDigital Library
}}Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 781--799.Google ScholarCross Ref
}}Goldberg, Y., Tsarfaty, R., Adler, M., & Elhadad, M. (2009). Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and EM-HMM-based lexical probabilities. EACL. Google ScholarDigital Library
}}Honkela, T. (1997). Self-organizing maps of words for natural language processing applications. Proceedings of the International ICSC Symposium on Soft Computing.Google Scholar
}}Honkela, T., Pulkki, V., & Kohonen, T. (1995). Contextual relations of words in grimm tales, analyzed by self-organizing map. ICANN.Google Scholar
}}Huang, F., & Yates, A. (2009). Distributional representations for handling sparsity in supervised sequence labeling. ACL. Google ScholarDigital Library
}}Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. IJCNN (pp. 413--418).Google Scholar
}}Koo, T., Carreras, X., & Collins, M. (2008). Simple semi-supervised dependency parsing. ACL (pp. 595--603).Google Scholar
}}Krishnan, V., & Manning, C. D. (2006). An effective two-stage model for exploiting non-local dependencies in named entity recognition. COLING-ACL. Google ScholarDigital Library
}}Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 259--284.Google Scholar
}}Li, W., & McCallum, A. (2005). Semi-supervised sequence modeling with syntactic topic models. AAAI. Google ScholarDigital Library
}}Liang, P. (2005). Semi-supervised learning for natural language. Master's thesis, Massachusetts Institute of Technology.Google Scholar
}}Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. ACL-IJCNLP (pp. 1030--1038). Google ScholarDigital Library
}}Lund, K., & Burgess, C. (1996). Producing highdimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203--208.Google ScholarCross Ref
}}Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. Cognitive Science Proceedings, LEA (pp. 660--665).Google Scholar
}}Martin, S., Liermann, J., & Ney, H. (1998). Algorithms for bigram and trigram word clustering. Speech Communication, 24, 19--37. Google ScholarDigital Library
}}Miller, S., Guinness, J., & Zamanian, A. (2004). Name tagging with word clusters and discriminative training. HLT-NAACL (pp. 337--342).Google Scholar
}}Mnih, A., & Hinton, G. E. (2007). Three new graphical models for statistical language modelling. ICML. Google ScholarDigital Library
}}Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical distributed language model. NIPS (pp. 1081--1088).Google Scholar
}}Morin, F., & Bengio, Y. (2005). Hierarchical probabilistic neural network language model. AISTATS.Google Scholar
}}Pereira, F., Tishby, N., & Lee, L. (1993). Distributional clustering of english words. ACL (pp. 183--190). Google ScholarDigital Library
}}Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. CoNLL. Google ScholarDigital Library
}}Ritter, H., & Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 241--254.Google Scholar
}}Sahlgren, M. (2001). Vector-based semantic analysis: Representing word meanings based on random labels. Proceedings of the Semantic Knowledge Acquisition and Categorisation Workshop, ESSLLI.Google Scholar
}}Sahlgren, M. (2005). An introduction to random indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (TKE).Google Scholar
}}Sahlgren, M. (2006). The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Doctoral dissertation, Stockholm University.Google Scholar
}}Sang, E. T., & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. CoNLL. Google ScholarDigital Library
}}Schwenk, H., & Gauvain, J.-L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 765--768). Orlando, Florida.Google Scholar
}}Sha, F., & Pereira, F. C. N. (2003). Shallow parsing with conditional random fields. HLT-NAACL. Google ScholarDigital Library
}}Spitkovsky, V., Alshawi, H., & Jurafsky, D. (2010). From baby steps to leapfrog: How "less is more" in unsupervised dependency parsing. NAACL-HLT. Google ScholarDigital Library
}}Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. ACL-08: HLT (pp. 665--673).Google Scholar
}}Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervised structured conditional models for dependency parsing. EMNLP. Google ScholarDigital Library
}}Turian, J., Ratinov, L., Bengio, Y., & Roth, D. (2009). A preliminary evaluation of word representations for named-entity recognition. NIPS Workshop on Grammar Induction, Representation of Language and Language Learning.Google Scholar
}}Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research. Google ScholarCross Ref
}}Ushioda, A. (1996). Hierarchical clustering of words. COLING (pp. 1159--1162). Google ScholarDigital Library
}}Väyrynen, J., & Honkela, T. (2005). Comparison of independent component analysis and singular value decomposition in word context analysis. AKRR'05, International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning.Google Scholar
}}Väyrynen, J. J., & Honkela, T. (2004). Word category maps based on emergent features created by ICA. Proceedings of the STeP'2004 Cognition + Cybernetics Symposium (pp. 173--185). Finnish Artificial Intelligence Society.Google Scholar
}}Väyrynen, J. J., Honkela, T., & Lindqvist, L. (2007). Towards explicit semantic features using independent component analysis. Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR). Stockholm, Sweden: Swedish Institute of Computer Science.Google Scholar
}}Rehůrek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. LREC.Google Scholar
}}Zhang, T., & Johnson, D. (2003). A robust risk minimization based named entity recognition system. CoNLL. Google ScholarDigital Library
}}Zhao, H., Chen, W., Kit, C., & Zhou, G. (2009). Multilingual dependency learning: a huge feature engineering method to semantic dependency parsing. CoNLL (pp. 55--60). Google ScholarDigital Library

Index Terms

Word representations: a simple and general method for semi-supervised learning
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning

Recommendations

Deep analysis of word sense disambiguation via semi-supervised learning and neural word representations
Highlights
- Ambiguity is a challenging task in text mining addressed for word-sense disambiguation algorithms.
Abstract
Word Sense Disambiguation (WSD) aims to determine the meaning of a word in context. Different approaches have been proposed in supervised and unsupervised domains. In most cases, supervised learning provides superior WSD performance. ...
Read More
Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Statistical bilingual word alignment has been well studied in the field of machine translation. This article adapts the bilingual word alignment algorithm into a monolingual scenario to extract collocations from monolingual corpus, based on the fact ...
Read More
Enhancing Semantic Word Representations by Embedding Deep Word Relationships
ICCAE 2019: Proceedings of the 2019 11th International Conference on Computer and Automation Engineering

Word representations are created using analogy context-based statistics and lexical relations on words. Word representations are inputs for the learning models in Natural Language Understanding (NLU) tasks. However, to understand language, knowing only ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
July 2010
1618 pages
Program Chair:
Jan Hajič
Charles University in Prague, Czech Republic
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 11 July 2010
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate85of443submissions,19%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 200
  Total Citations
  View Citations
- 7,812
  Total Downloads
- Downloads (Last 12 months)186
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Word representations: a simple and general method for semi-supervised learning

ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep analysis of word sense disambiguation via semi-supervised learning and neural word representations

Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Enhancing Semantic Word Representations by Embedding Deep Word Relationships

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Word representations: a simple and general method for semi-supervised learning

ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Deep analysis of word sense disambiguation via semi-supervised learning and neural word representations

Two-Word Collocation Extraction Using Monolingual Word Alignment Method

Enhancing Semantic Word Representations by Embedding Deep Word Relationships

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media