ABSTRACT
Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).
- Loulwah AlSumait, Daniel Barbara, James Gentle, and Carlotta Domeniconi. 2009. Topic significance ranking of LDA generative models. In ECML. Google ScholarDigital Library
- David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25--32. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January. Google ScholarDigital Library
- K. R. Canini, L. Shi, and T. L. Griffiths. 2009. Online inference of topics with latent Dirichlet allocation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.Google Scholar
- Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22, pages 288--296.Google Scholar
- Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 6(1):22--29. Google ScholarDigital Library
- Gabriel Doyle and Charles Elkan. 2009. Accounting for burstiness in topic models. In ICML. Google ScholarDigital Library
- S. Geman and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and Machine Intelligence 6, pages 721--741.Google Scholar
- Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228--5235.Google ScholarCross Ref
- Matthew Hoffman, David Blei, and Francis Bach. 2010. Online learning for latent dirichlet allocation. In NIPS.Google Scholar
- Hosan Mahmoud. 2008. Pólya Urn Models. Chapman & Hall/CRC Texts in Statistical Science.Google Scholar
- Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 490--499. Google ScholarDigital Library
- David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Google ScholarDigital Library
- Yee Whye Teh, Dave Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for lat ent Dirichlet allocation. In Advances in Neural Information Processing Systems 18.Google Scholar
- Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th Interational Conference on Machine Learning. Google ScholarDigital Library
- Xing Wei and Bruce Croft. 2006. LDA-based document models for ad-hoc retrival. In Proceedings of the 29th Annual International SIGIR Conference. Google ScholarDigital Library
- Optimizing semantic coherence in topic models
Recommendations
Aggregated topic models for increasing social media topic coherence
AbstractThis research presents a novel aggregating method for constructing an aggregated topic model that is composed of the topics with greater coherence than individual models. When generating a topic model, a number of parameters have to be specified. ...
Improving topic coherence with regularized topic models
NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing SystemsTopic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set ...
A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementMining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short ...
Comments