skip to main content
10.5555/1858681.1858721dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
research-article
Free Access

Word representations: a simple and general method for semi-supervised learning

Authors Info & Claims
Published:11 July 2010Publication History

ABSTRACT

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize.com/projects/wordreprs/

References

  1. }}Ando, R., & Zhang, T. (2005). A high-performance semi-supervised learning method for text chunking. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Bengio, Y. (2008). Neural net language models. Scholarpedia, 3, 3881.Google ScholarGoogle ScholarCross RefCross Ref
  3. }}Bengio, Y., Ducharme, R., & Vincent, P. (2001). A neural probabilistic language model. NIPS.Google ScholarGoogle Scholar
  4. }}Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Bengio, Y., & Séénécal, J.-S. (2003). Quick training of probabilistic neural nets by importance sampling. AISTATS.Google ScholarGoogle Scholar
  7. }}Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18, 467--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}Candito, M., & Crabbéé, B. (2009). Improving generative statistical parsing with semi-supervised word clustering. IWPT (pp. 138--141). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Deschacht, K., & Moens, M.-F. (2009). Semi-supervised semantic role labeling using the Latent Words Language Model. EMNLP (pp. 21--29). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., & Harshman, R. (1988). Using latent semantic analysis to improve access to textual information. SIGCHI Conference on Human Factors in Computing Systems (pp. 281--285). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 781--799.Google ScholarGoogle ScholarCross RefCross Ref
  14. }}Goldberg, Y., Tsarfaty, R., Adler, M., & Elhadad, M. (2009). Enhancing unlexicalized parsing performance using a wide coverage lexicon, fuzzy tag-set mapping, and EM-HMM-based lexical probabilities. EACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Honkela, T. (1997). Self-organizing maps of words for natural language processing applications. Proceedings of the International ICSC Symposium on Soft Computing.Google ScholarGoogle Scholar
  16. }}Honkela, T., Pulkki, V., & Kohonen, T. (1995). Contextual relations of words in grimm tales, analyzed by self-organizing map. ICANN.Google ScholarGoogle Scholar
  17. }}Huang, F., & Yates, A. (2009). Distributional representations for handling sparsity in supervised sequence labeling. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. IJCNN (pp. 413--418).Google ScholarGoogle Scholar
  19. }}Koo, T., Carreras, X., & Collins, M. (2008). Simple semi-supervised dependency parsing. ACL (pp. 595--603).Google ScholarGoogle Scholar
  20. }}Krishnan, V., & Manning, C. D. (2006). An effective two-stage model for exploiting non-local dependencies in named entity recognition. COLING-ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. }}Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 259--284.Google ScholarGoogle Scholar
  22. }}Li, W., & McCallum, A. (2005). Semi-supervised sequence modeling with syntactic topic models. AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Liang, P. (2005). Semi-supervised learning for natural language. Master's thesis, Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  24. }}Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. ACL-IJCNLP (pp. 1030--1038). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. }}Lund, K., & Burgess, C. (1996). Producing highdimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203--208.Google ScholarGoogle ScholarCross RefCross Ref
  26. }}Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. Cognitive Science Proceedings, LEA (pp. 660--665).Google ScholarGoogle Scholar
  27. }}Martin, S., Liermann, J., & Ney, H. (1998). Algorithms for bigram and trigram word clustering. Speech Communication, 24, 19--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. }}Miller, S., Guinness, J., & Zamanian, A. (2004). Name tagging with word clusters and discriminative training. HLT-NAACL (pp. 337--342).Google ScholarGoogle Scholar
  29. }}Mnih, A., & Hinton, G. E. (2007). Three new graphical models for statistical language modelling. ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. }}Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical distributed language model. NIPS (pp. 1081--1088).Google ScholarGoogle Scholar
  31. }}Morin, F., & Bengio, Y. (2005). Hierarchical probabilistic neural network language model. AISTATS.Google ScholarGoogle Scholar
  32. }}Pereira, F., Tishby, N., & Lee, L. (1993). Distributional clustering of english words. ACL (pp. 183--190). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. }}Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. CoNLL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. }}Ritter, H., & Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 241--254.Google ScholarGoogle Scholar
  35. }}Sahlgren, M. (2001). Vector-based semantic analysis: Representing word meanings based on random labels. Proceedings of the Semantic Knowledge Acquisition and Categorisation Workshop, ESSLLI.Google ScholarGoogle Scholar
  36. }}Sahlgren, M. (2005). An introduction to random indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (TKE).Google ScholarGoogle Scholar
  37. }}Sahlgren, M. (2006). The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Doctoral dissertation, Stockholm University.Google ScholarGoogle Scholar
  38. }}Sang, E. T., & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. CoNLL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. }}Schwenk, H., & Gauvain, J.-L. (2002). Connectionist language modeling for large vocabulary continuous speech recognition. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 765--768). Orlando, Florida.Google ScholarGoogle Scholar
  40. }}Sha, F., & Pereira, F. C. N. (2003). Shallow parsing with conditional random fields. HLT-NAACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. }}Spitkovsky, V., Alshawi, H., & Jurafsky, D. (2010). From baby steps to leapfrog: How "less is more" in unsupervised dependency parsing. NAACL-HLT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. }}Suzuki, J., & Isozaki, H. (2008). Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. ACL-08: HLT (pp. 665--673).Google ScholarGoogle Scholar
  43. }}Suzuki, J., Isozaki, H., Carreras, X., & Collins, M. (2009). An empirical study of semi-supervised structured conditional models for dependency parsing. EMNLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. }}Turian, J., Ratinov, L., Bengio, Y., & Roth, D. (2009). A preliminary evaluation of word representations for named-entity recognition. NIPS Workshop on Grammar Induction, Representation of Language and Language Learning.Google ScholarGoogle Scholar
  45. }}Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research. Google ScholarGoogle ScholarCross RefCross Ref
  46. }}Ushioda, A. (1996). Hierarchical clustering of words. COLING (pp. 1159--1162). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. }}Väyrynen, J., & Honkela, T. (2005). Comparison of independent component analysis and singular value decomposition in word context analysis. AKRR'05, International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning.Google ScholarGoogle Scholar
  48. }}Väyrynen, J. J., & Honkela, T. (2004). Word category maps based on emergent features created by ICA. Proceedings of the STeP'2004 Cognition + Cybernetics Symposium (pp. 173--185). Finnish Artificial Intelligence Society.Google ScholarGoogle Scholar
  49. }}Väyrynen, J. J., Honkela, T., & Lindqvist, L. (2007). Towards explicit semantic features using independent component analysis. Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR). Stockholm, Sweden: Swedish Institute of Computer Science.Google ScholarGoogle Scholar
  50. }}Rehůrek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. LREC.Google ScholarGoogle Scholar
  51. }}Zhang, T., & Johnson, D. (2003). A robust risk minimization based named entity recognition system. CoNLL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. }}Zhao, H., Chen, W., Kit, C., & Zhou, G. (2009). Multilingual dependency learning: a huge feature engineering method to semantic dependency parsing. CoNLL (pp. 55--60). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Word representations: a simple and general method for semi-supervised learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image DL Hosted proceedings
          ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
          July 2010
          1618 pages
          • Program Chair:
          • Jan Hajič

          Publisher

          Association for Computational Linguistics

          United States

          Publication History

          • Published: 11 July 2010

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate85of443submissions,19%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader