skip to main content
research-article

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

Published:21 August 2017Publication History
Skip Abstract Section

Abstract

Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent. That is, the single-topic assumption may be too strong for some datasets. To address this limitation, we propose to model the topic number as a Poisson distribution, allowing each short text to be associated with a small number of topics (e.g., one to three topics). This model is named PDMM. Second, DMM (and also PDMM) does not have access to background knowledge (e.g., semantic relations between words) when modeling short texts. When a human being interprets a piece of short text, the understanding is not solely based on its content words, but also their semantic relations. Recent advances in word embeddings offer effective learning of word semantic relations from a large corpus. Such auxiliary word embeddings enable us to address the second limitation. To this end, we propose to promote the semantically related words under the same topic during the sampling process, by using the generalized Pólya urn (GPU) model. Through the GPU model, background knowledge about word semantic relations learned from millions of external documents can be easily exploited to improve topic modeling for short texts. By directly extending the PDMM model with the GPU model, we propose two more effective topic models for short texts, named GPU-DMM and GPU-PDMM. Through extensive experiments on two real-world short text collections in two languages, we demonstrate that PDMM achieves better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to better accuracy in a text classification task, as an indirect evaluation. Both GPU-DMM and GPU-PDMM further improve topic coherence and text classification accuracy. GPU-PDMM outperforms GPU-DMM at the price of higher computational costs.

References

  1. Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural Probabilistic Language Models. Springer. 137--186 pagesGoogle ScholarGoogle Scholar
  2. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In NIPS. Curran Associates, 288--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mengen Chen, Xiaoming Jin, and Dou Shen. 2011. Short text classification improved by learning multi-granularity topics. In IJCAI. IJCAI/AAAI, 1776--1781. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In SIGKDD. ACM, 1116--1125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013b. Discovering coherent topics using general knowledge. In CIKM. ACM, 209--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013a. Leveraging multi-domain prior knowledge in topic models. In IJCAI. IJCAI/AAAI, 2071--2077. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2928--2941.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML. ACM, 160--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In ACL. Association for Computer Linguistics, 795--804.Google ScholarGoogle Scholar
  12. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In SIGIR. ACM, 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In The First Workshop on Social Media Analytics. ACM, 80--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Liangjie Hong, Dawei Yin, Jian Guo, and Brian D. Davison. 2011. Tracking trends: Incorporating term volume into temporal topic models. In SIGKDD. ACM, 484--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Weihua Hu and Jun’ichi Tsujii. 2016. A latent concept topic model for robust topic inference using word embeddings. In ACL. Association for Computer Linguistics, 380--386.Google ScholarGoogle Scholar
  16. Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM. ACM, 775--784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tom Kenter and Maarten de Rijke. 2015. Short text similarity with word embeddings. In CIKM. ACM, 1411--1420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In ICML. 957--966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In SIGIR. ACM, 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zongyang Ma, Aixin Sun, Quan Yuan, and Gao Cong. 2012. Topic-driven reader comments summarization. In CIKM. ACM, 265--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hosam Mahmoud. 2008. Polya urn models. CRC press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In SIGIR. ACM, 889--892. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tomas Mikolov, Kai Chen, Greg Corrada, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google ScholarGoogle Scholar
  24. David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In EMNLP. Association for Computational Linguistics, 262--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Andriy Mnih and Geoffrey E. Hinton. 2009. A scalable hierarchical distributed language model. In NIPS. Curran Associates, 1081--1088. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In HLT-NAACL. Association for Computational Linguistics, 100--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3 (2015), 299--313.Google ScholarGoogle ScholarCross RefCross Ref
  28. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In ICML. ACM, 625--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2 (2000), 103--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. Association for Computational Linguistics, 1532--1543.Google ScholarGoogle Scholar
  31. Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to classify short and sparse text 8 teb with hidden topics from large-scale data collections. In WWW. ACM, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD. ACM, 569--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In AAAI. AAAI Press, 2270--2276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. 2010. Characterizing microblogs with topic models. In ICWSM. Press, 130--137.Google ScholarGoogle Scholar
  35. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Neurocomputing: Foundations of research, J. A. Anderson and E. Rosenfeld (Eds.). 696--699.Google ScholarGoogle Scholar
  36. Vivek Kumar Rangarajan Sridhar. 2015. Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@NAACL-HLT ’15). Association for Computational Linguistics, 192--200.Google ScholarGoogle ScholarCross RefCross Ref
  37. Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. 2010. Short text classification in twitter to improve information filtering. In SIGIR. ACM, 841--842. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Molecular Physics 57, 1 (1986), 89--95.Google ScholarGoogle ScholarCross RefCross Ref
  39. Aixin Sun. 2012. Short text classification using very few words. In SIGIR. ACM, 1145--1146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In SIGKDD. ACM, 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. 2007. Mining correlated bursty topic patterns from coordinated text streams. In SIGKDD. ACM, 784--793. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. Twitterrank: Finding topic-sensitive influential twitterers. In WSDM. ACM, 261--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Chen. 2013. A biterm topic model for short texts. In WWW. International World Wide Web Conferences Steering Committee / ACM, 1445--1456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jianhua Yin, and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In SIGKDD. ACM, 233--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xia Yunqing, Tang Nan, Hussain Amir, and Cambria Erik. 2015. Discriminative bi-term topic model for headline-based social news clustering. In AAAI. AAAI Press, 311--316.Google ScholarGoogle Scholar
  46. Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. In ECIR. Springer, 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Guoqing Zheng and Jamie Callan. 2015. Learning to reweight terms with distributed representations. In SIGIR. ACM, 575--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016a. Topic modeling of short texts: A pseudo-document view. In KDD. ACM, 2105--2114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yuan Zuo, Jichang Zhao, and Ke Xu. 2016b. Word network topic model: A simple but general solution for short and imbalanced texts. Knowledge Information Systems 48, 2 (2016), 379--398. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 36, Issue 2
      April 2018
      371 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/3133943
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 August 2017
      • Revised: 1 April 2017
      • Accepted: 1 April 2017
      • Received: 1 December 2016
      Published in tois Volume 36, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader