Abstract
Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent. That is, the single-topic assumption may be too strong for some datasets. To address this limitation, we propose to model the topic number as a Poisson distribution, allowing each short text to be associated with a small number of topics (e.g., one to three topics). This model is named PDMM. Second, DMM (and also PDMM) does not have access to background knowledge (e.g., semantic relations between words) when modeling short texts. When a human being interprets a piece of short text, the understanding is not solely based on its content words, but also their semantic relations. Recent advances in word embeddings offer effective learning of word semantic relations from a large corpus. Such auxiliary word embeddings enable us to address the second limitation. To this end, we propose to promote the semantically related words under the same topic during the sampling process, by using the generalized Pólya urn (GPU) model. Through the GPU model, background knowledge about word semantic relations learned from millions of external documents can be easily exploited to improve topic modeling for short texts. By directly extending the PDMM model with the GPU model, we propose two more effective topic models for short texts, named GPU-DMM and GPU-PDMM. Through extensive experiments on two real-world short text collections in two languages, we demonstrate that PDMM achieves better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to better accuracy in a text classification task, as an indirect evaluation. Both GPU-DMM and GPU-PDMM further improve topic coherence and text classification accuracy. GPU-PDMM outperforms GPU-DMM at the price of higher computational costs.
- Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural Probabilistic Language Models. Springer. 137--186 pagesGoogle Scholar
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarDigital Library
- Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In NIPS. Curran Associates, 288--296. Google ScholarDigital Library
- Mengen Chen, Xiaoming Jin, and Dou Shen. 2011. Short text classification improved by learning multi-granularity topics. In IJCAI. IJCAI/AAAI, 1776--1781. Google ScholarDigital Library
- Zhiyuan Chen and Bing Liu. 2014. Mining topics in documents: Standing on the shoulders of big data. In SIGKDD. ACM, 1116--1125. Google ScholarDigital Library
- Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013b. Discovering coherent topics using general knowledge. In CIKM. ACM, 209--218. Google ScholarDigital Library
- Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malú Castellanos, and Riddhiman Ghosh. 2013a. Leveraging multi-domain prior knowledge in topic models. In IJCAI. IJCAI/AAAI, 2071--2077. Google ScholarDigital Library
- Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. 2014. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 2928--2941.Google ScholarCross Ref
- Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML. ACM, 160--167. Google ScholarDigital Library
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493--2537. Google ScholarDigital Library
- Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for topic models with word embeddings. In ACL. Association for Computer Linguistics, 795--804.Google Scholar
- Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In SIGIR. ACM, 50--57. Google ScholarDigital Library
- Liangjie Hong and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. In The First Workshop on Social Media Analytics. ACM, 80--88. Google ScholarDigital Library
- Liangjie Hong, Dawei Yin, Jian Guo, and Brian D. Davison. 2011. Tracking trends: Incorporating term volume into temporal topic models. In SIGKDD. ACM, 484--492. Google ScholarDigital Library
- Weihua Hu and Jun’ichi Tsujii. 2016. A latent concept topic model for robust topic inference using word embeddings. In ACL. Association for Computer Linguistics, 380--386.Google Scholar
- Ou Jin, Nathan N. Liu, Kai Zhao, Yong Yu, and Qiang Yang. 2011. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM. ACM, 775--784. Google ScholarDigital Library
- Tom Kenter and Maarten de Rijke. 2015. Short text similarity with word embeddings. In CIKM. ACM, 1411--1420. Google ScholarDigital Library
- Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In ICML. 957--966. Google ScholarDigital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic modeling for short texts with auxiliary word embeddings. In SIGIR. ACM, 165--174. Google ScholarDigital Library
- Zongyang Ma, Aixin Sun, Quan Yuan, and Gao Cong. 2012. Topic-driven reader comments summarization. In CIKM. ACM, 265--274. Google ScholarDigital Library
- Hosam Mahmoud. 2008. Polya urn models. CRC press. Google ScholarDigital Library
- Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In SIGIR. ACM, 889--892. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrada, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
- David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In EMNLP. Association for Computational Linguistics, 262--272. Google ScholarDigital Library
- Andriy Mnih and Geoffrey E. Hinton. 2009. A scalable hierarchical distributed language model. In NIPS. Curran Associates, 1081--1088. Google ScholarDigital Library
- David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In HLT-NAACL. Association for Computational Linguistics, 100--108. Google ScholarDigital Library
- Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics 3 (2015), 299--313.Google ScholarCross Ref
- Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In ICML. ACM, 625--632. Google ScholarDigital Library
- Kamal Nigam, Andrew Kachites MacCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2 (2000), 103--134. Google ScholarDigital Library
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. Association for Computational Linguistics, 1532--1543.Google Scholar
- Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008. Learning to classify short and sparse text 8 teb with hidden topics from large-scale data collections. In WWW. ACM, 91--100. Google ScholarDigital Library
- Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD. ACM, 569--577. Google ScholarDigital Library
- Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In AAAI. AAAI Press, 2270--2276. Google ScholarDigital Library
- Daniel Ramage, Susan T. Dumais, and Daniel J. Liebling. 2010. Characterizing microblogs with topic models. In ICWSM. Press, 130--137.Google Scholar
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Neurocomputing: Foundations of research, J. A. Anderson and E. Rosenfeld (Eds.). 696--699.Google Scholar
- Vivek Kumar Rangarajan Sridhar. 2015. Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing (VS@NAACL-HLT ’15). Association for Computational Linguistics, 192--200.Google ScholarCross Ref
- Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Ferhatosmanoglu, and Murat Demirbas. 2010. Short text classification in twitter to improve information filtering. In SIGIR. ACM, 841--842. Google ScholarDigital Library
- T. P. Straatsma, H. J. C. Berendsen, and A. J. Stam. 1986. Estimation of statistical errors in molecular simulation calculations. Molecular Physics 57, 1 (1986), 89--95.Google ScholarCross Ref
- Aixin Sun. 2012. Short text classification using very few words. In SIGIR. ACM, 1145--1146. Google ScholarDigital Library
- Chong Wang and David M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In SIGKDD. ACM, 448--456. Google ScholarDigital Library
- Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. 2007. Mining correlated bursty topic patterns from coordinated text streams. In SIGKDD. ACM, 784--793. Google ScholarDigital Library
- Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. Twitterrank: Finding topic-sensitive influential twitterers. In WSDM. ACM, 261--270. Google ScholarDigital Library
- Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Chen. 2013. A biterm topic model for short texts. In WWW. International World Wide Web Conferences Steering Committee / ACM, 1445--1456. Google ScholarDigital Library
- Jianhua Yin, and Jianyong Wang. 2014. A Dirichlet multinomial mixture model-based approach for short text clustering. In SIGKDD. ACM, 233--242. Google ScholarDigital Library
- Xia Yunqing, Tang Nan, Hussain Amir, and Cambria Erik. 2015. Discriminative bi-term topic model for headline-based social news clustering. In AAAI. AAAI Press, 311--316.Google Scholar
- Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. In ECIR. Springer, 338--349. Google ScholarDigital Library
- Guoqing Zheng and Jamie Callan. 2015. Learning to reweight terms with distributed representations. In SIGIR. ACM, 575--584. Google ScholarDigital Library
- Yuan Zuo, Junjie Wu, Hui Zhang, Hao Lin, Fei Wang, Ke Xu, and Hui Xiong. 2016a. Topic modeling of short texts: A pseudo-document view. In KDD. ACM, 2105--2114. Google ScholarDigital Library
- Yuan Zuo, Jichang Zhao, and Ke Xu. 2016b. Word network topic model: A simple but general solution for short and imbalanced texts. Knowledge Information Systems 48, 2 (2016), 379--398. Google ScholarDigital Library
Index Terms
- Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
Recommendations
Topic Modeling for Short Texts with Auxiliary Word Embeddings
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalFor many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive ...
Topic Modeling of Short Texts: A Pseudo-Document View
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningRecent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-...
A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide WebUncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
Comments