research-article

Free Access

Optimizing semantic coherence in topic models

Authors:
David Mimno

Princeton University, Princeton, NJ

Princeton University, Princeton, NJ
View Profile

,
Hanna M. Wallach

University of Massachusetts, Amherst, Amherst, MA

University of Massachusetts, Amherst, Amherst, MA
View Profile

,
Edmund Talley

National Institutes of Health, Bethesda, MD

National Institutes of Health, Bethesda, MD
View Profile

,
Miriam Leenders

National Institutes of Health, Bethesda, MD

National Institutes of Health, Bethesda, MD
View Profile

,
Andrew McCallum

University of Massachusetts, Amherst, Amherst, MA

University of Massachusetts, Amherst, Amherst, MA
View Profile

Authors Info & Claims

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language ProcessingJuly 2011Pages 262–272

Published:27 July 2011Publication History

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

Pages 262–272

ABSTRACT

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

References

Loulwah AlSumait, Daniel Barbara, James Gentle, and Carlotta Domeniconi. 2009. Topic significance ranking of LDA generative models. In ECML. Google ScholarDigital Library
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25--32. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January. Google ScholarDigital Library
K. R. Canini, L. Shi, and T. L. Griffiths. 2009. Online inference of topics with latent Dirichlet allocation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.Google Scholar
Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22, pages 288--296.Google Scholar
Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 6(1):22--29. Google ScholarDigital Library
Gabriel Doyle and Charles Elkan. 2009. Accounting for burstiness in topic models. In ICML. Google ScholarDigital Library
S. Geman and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and Machine Intelligence 6, pages 721--741.Google Scholar
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228--5235.Google ScholarCross Ref
Matthew Hoffman, David Blei, and Francis Bach. 2010. Online learning for latent dirichlet allocation. In NIPS.Google Scholar
Hosan Mahmoud. 2008. Pólya Urn Models. Chapman & Hall/CRC Texts in Statistical Science.Google Scholar
Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 490--499. Google ScholarDigital Library
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Google ScholarDigital Library
Yee Whye Teh, Dave Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for lat ent Dirichlet allocation. In Advances in Neural Information Processing Systems 18.Google Scholar
Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th Interational Conference on Machine Learning. Google ScholarDigital Library
Xing Wei and Bruce Croft. 2006. LDA-based document models for ad-hoc retrival. In Proceedings of the 29th Annual International SIGIR Conference. Google ScholarDigital Library

Optimizing semantic coherence in topic models
1. Computing methodologies

Recommendations

Aggregated topic models for increasing social media topic coherence
Abstract
This research presents a novel aggregating method for constructing an aggregated topic model that is composed of the topics with greater coherence than individual models. When generating a topic model, a number of parameters have to be specified. ...
Read More
Improving topic coherence with regularized topic models
NIPS'11: Proceedings of the 24th International Conference on Neural Information Processing Systems

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set ...
Read More
A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
July 2011
1647 pages
ISBN:9781937284114
General Chair:
Paola Merlo
University of Geneva
,
Program Chairs:
Regina Barzilay
Massachusetts Institute of Technology
,
Mark Johnson
Macquarie University
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 27 July 2011
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate73of234submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 173
  Total Citations
  View Citations
- 6,019
  Total Downloads
- Downloads (Last 12 months)747
- Downloads (Last 6 weeks)62
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing semantic coherence in topic models

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Aggregated topic models for increasing social media topic coherence

Improving topic coherence with regularized topic models

A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimizing semantic coherence in topic models

EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing

ABSTRACT

References

Cited By

Recommendations

Aggregated topic models for increasing social media topic coherence

Improving topic coherence with regularized topic models

A Non-Parametric Topic Model for Short Texts Incorporating Word Coherence Knowledge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media