research-article

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

Authors:
Chenliang Li

Wuhan University, Wuhan, China

Wuhan University, Wuhan, China
View Profile

,
Yu Duan

Wuhan University, Wuhan, China

Wuhan University, Wuhan, China
View Profile

,
Haoran Wang

Wuhan University, Wuhan, China

Wuhan University, Wuhan, China
View Profile

,
Zhiqian Zhang

Wuhan University

Wuhan University
View Profile

,
Aixin Sun

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

,
Zongyang Ma

Nanyang Technological University, Nanyang Avenue, Singapore

Nanyang Technological University, Nanyang Avenue, Singapore
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 36 Issue 2Article No.: 11pp 1–30https://doi.org/10.1145/3091108

Published:21 August 2017Publication History

ACM Transactions on Information Systems

Abstract

Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent. That is, the single-topic assumption may be too strong for some datasets. To address this limitation, we propose to model the topic number as a Poisson distribution, allowing each short text to be associated with a small number of topics (e.g., one to three topics). This model is named PDMM. Second, DMM (and also PDMM) does not have access to background knowledge (e.g., semantic relations between words) when modeling short texts. When a human being interprets a piece of short text, the understanding is not solely based on its content words, but also their semantic relations. Recent advances in word embeddings offer effective learning of word semantic relations from a large corpus. Such auxiliary word embeddings enable us to address the second limitation. To this end, we propose to promote the semantically related words under the same topic during the sampling process, by using the generalized Pólya urn (GPU) model. Through the GPU model, background knowledge about word semantic relations learned from millions of external documents can be easily exploited to improve topic modeling for short texts. By directly extending the PDMM model with the GPU model, we propose two more effective topic models for short texts, named GPU-DMM and GPU-PDMM. Through extensive experiments on two real-world short text collections in two languages, we demonstrate that PDMM achieves better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to better accuracy in a text classification task, as an indirect evaluation. Both GPU-DMM and GPU-PDMM further improve topic coherence and text classification accuracy. GPU-PDMM outperforms GPU-DMM at the price of higher computational costs.

References

Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural Probabilistic Language Models. Springer. 137--186 pagesGoogle Scholar
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarDigital Library
Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In NIPS. Curran Associates, 288--296. Google ScholarDigital Library
Mengen Chen, Xiaoming Jin, and Dou Shen. 2011. Short text classification improved by learning multi-granularity topics. In IJCAI. IJCAI/AAAI, 1776--1781. Google ScholarDigital Library
Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In SIGKDD. ACM, 1116--1125. Google ScholarDigital Library
Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013b. Discovering coherent topics using general knowledge. In CIKM. ACM, 209--218. Google ScholarDigital Library
Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013a. Leveraging multi-domain prior knowledge in topic models. In IJCAI. IJCAI/AAAI, 2071--2077. Google ScholarDigital Library
Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2928--2941.Google ScholarCross Ref
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML. ACM, 160--167. Google ScholarDigital Library
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537. Google ScholarDigital Library
Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In ACL. Association for Computer Linguistics, 795--804.Google Scholar
Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In SIGIR. ACM, 50--57. Google ScholarDigital Library
Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In The First Workshop on Social Media Analytics. ACM, 80--88. Google ScholarDigital Library
Liangjie Hong, Dawei Yin, Jian Guo, and Brian D. Davison. 2011. Tracking trends: Incorporating term volume into temporal topic models. In SIGKDD. ACM, 484--492. Google ScholarDigital Library
Weihua Hu and Jun’ichi Tsujii. 2016. A latent concept topic model for robust topic inference using word embeddings. In ACL. Association for Computer Linguistics, 380--386.Google Scholar
Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM. ACM, 775--784. Google ScholarDigital Library
Tom Kenter and Maarten de Rijke. 2015. Short text similarity with word embeddings. In CIKM. ACM, 1411--1420. Google ScholarDigital Library
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In ICML. 957--966. Google ScholarDigital Library
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In SIGIR. ACM, 165--174. Google ScholarDigital Library
Zongyang Ma, Aixin Sun, Quan Yuan, and Gao Cong. 2012. Topic-driven reader comments summarization. In CIKM. ACM, 265--274. Google ScholarDigital Library
Hosam Mahmoud. 2008. Polya urn models. CRC press. Google ScholarDigital Library
Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In SIGIR. ACM, 889--892. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrada, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In EMNLP. Association for Computational Linguistics, 262--272. Google ScholarDigital Library
Andriy Mnih and Geoffrey E. Hinton. 2009. A scalable hierarchical distributed language model. In NIPS. Curran Associates, 1081--1088. Google ScholarDigital Library
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In HLT-NAACL. Association for Computational Linguistics, 100--108. Google ScholarDigital Library
Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3 (2015), 299--313.Google ScholarCross Ref
Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In ICML. ACM, 625--632. Google ScholarDigital Library
Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2 (2000), 103--134. Google ScholarDigital Library
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. Association for Computational Linguistics, 1532--1543.Google Scholar
Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to classify short and sparse text 8 teb with hidden topics from large-scale data collections. In WWW. ACM, 91--100. Google ScholarDigital Library
Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD. ACM, 569--577. Google ScholarDigital Library
Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In AAAI. AAAI Press, 2270--2276. Google ScholarDigital Library
Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. 2010. Characterizing microblogs with topic models. In ICWSM. Press, 130--137.Google Scholar
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Neurocomputing: Foundations of research, J. A. Anderson and E. Rosenfeld (Eds.). 696--699.Google Scholar
Vivek Kumar Rangarajan Sridhar. 2015. Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@NAACL-HLT ’15). Association for Computational Linguistics, 192--200.Google ScholarCross Ref
Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. 2010. Short text classification in twitter to improve information filtering. In SIGIR. ACM, 841--842. Google ScholarDigital Library
T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Molecular Physics 57, 1 (1986), 89--95.Google ScholarCross Ref
Aixin Sun. 2012. Short text classification using very few words. In SIGIR. ACM, 1145--1146. Google ScholarDigital Library
Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In SIGKDD. ACM, 448--456. Google ScholarDigital Library
Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. 2007. Mining correlated bursty topic patterns from coordinated text streams. In SIGKDD. ACM, 784--793. Google ScholarDigital Library
Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. Twitterrank: Finding topic-sensitive influential twitterers. In WSDM. ACM, 261--270. Google ScholarDigital Library
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Chen. 2013. A biterm topic model for short texts. In WWW. International World Wide Web Conferences Steering Committee / ACM, 1445--1456. Google ScholarDigital Library
Jianhua Yin, and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In SIGKDD. ACM, 233--242. Google ScholarDigital Library
Xia Yunqing, Tang Nan, Hussain Amir, and Cambria Erik. 2015. Discriminative bi-term topic model for headline-based social news clustering. In AAAI. AAAI Press, 311--316.Google Scholar
Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. In ECIR. Springer, 338--349. Google ScholarDigital Library
Guoqing Zheng and Jamie Callan. 2015. Learning to reweight terms with distributed representations. In SIGIR. ACM, 575--584. Google ScholarDigital Library
Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016a. Topic modeling of short texts: A pseudo-document view. In KDD. ACM, 2105--2114. Google ScholarDigital Library
Yuan Zuo, Jichang Zhao, and Ke Xu. 2016b. Word network topic model: A simple but general solution for short and imbalanced texts. Knowledge Information Systems 48, 2 (2016), 379--398. Google ScholarDigital Library

Index Terms

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

Topic Modeling for Short Texts with Auxiliary Word Embeddings
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval

For many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive ...
Read More
Topic Modeling of Short Texts: A Pseudo-Document View
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-...
Read More
A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide Web

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 36, Issue 2
April 2018
371 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3133943
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2017
- Revised: 1 April 2017
- Accepted: 1 April 2017
- Received: 1 December 2016
Published in tois Volume 36, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Topic model
short texts
word embeddings
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 124
  Total Citations
  View Citations
- 1,501
  Total Downloads
- Downloads (Last 12 months)70
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Topic Modeling of Short Texts: A Pseudo-Document View

A biterm topic model for short texts