skip to main content
10.5555/2145432.2145462dlproceedingsArticle/Chapter ViewAbstractPublication PagesemnlpConference Proceedingsconference-collections
research-article
Free Access

Optimizing semantic coherence in topic models

Published:27 July 2011Publication History

ABSTRACT

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

References

  1. Loulwah AlSumait, Daniel Barbara, James Gentle, and Carlotta Domeniconi. 2009. Topic significance ranking of LDA generative models. In ECML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. R. Canini, L. Shi, and T. L. Griffiths. 2009. Online inference of topics with latent Dirichlet allocation. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  5. Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems 22, pages 288--296.Google ScholarGoogle Scholar
  6. Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 6(1):22--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gabriel Doyle and Charles Elkan. 2009. Accounting for burstiness in topic models. In ICML. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Geman and D. Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transaction on Pattern Analysis and Machine Intelligence 6, pages 721--741.Google ScholarGoogle Scholar
  9. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl. 1):5228--5235.Google ScholarGoogle ScholarCross RefCross Ref
  10. Matthew Hoffman, David Blei, and Francis Bach. 2010. Online learning for latent dirichlet allocation. In NIPS.Google ScholarGoogle Scholar
  11. Hosan Mahmoud. 2008. Pólya Urn Models. Chapman & Hall/CRC Texts in Statistical Science.Google ScholarGoogle Scholar
  12. Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. 2007. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 490--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yee Whye Teh, Dave Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for lat ent Dirichlet allocation. In Advances in Neural Information Processing Systems 18.Google ScholarGoogle Scholar
  15. Hanna Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. In Proceedings of the 26th Interational Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xing Wei and Bruce Croft. 2006. LDA-based document models for ad-hoc retrival. In Proceedings of the 29th Annual International SIGIR Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Optimizing semantic coherence in topic models

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image DL Hosted proceedings
      EMNLP '11: Proceedings of the Conference on Empirical Methods in Natural Language Processing
      July 2011
      1647 pages
      ISBN:9781937284114

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      • Published: 27 July 2011

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate73of234submissions,31%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader