research-article

Evaluating topic models for digital libraries

Authors:
David Newman

University of California, Irvine, Irvine, CA, USA

University of California, Irvine, Irvine, CA, USA
View Profile

,
Youn Noh

Yale University, New Haven, CT, USA

Yale University, New Haven, CT, USA
View Profile

,
Edmund Talley

NIH, Washington, DC, USA

NIH, Washington, DC, USA
View Profile

,
Sarvnaz Karimi

NICTA, Melbourne, Australia

NICTA, Melbourne, Australia
View Profile

,
Timothy Baldwin

University of Melbourne, Melbourne, Australia

University of Melbourne, Melbourne, Australia
View Profile

JCDL '10: Proceedings of the 10th annual joint conference on Digital librariesJune 2010Pages 215–224https://doi.org/10.1145/1816123.1816156

Published:21 June 2010Publication History

JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries

Pages 215–224

ABSTRACT

Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries

References

L. AlSumait, D. Barbará, J. Gentle, and C. Domeniconi. Topic significance ranking of LDA generative models. In ECML/PKDD (1), pages 67--82, 2009.Google Scholar
D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML, page 4, 2009. Google ScholarDigital Library
T. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: ad-hoc retrieval results since 1998. In CIKM, pages 601--610, 2009. Google ScholarDigital Library
D. Blei and J. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006. Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In UAI, pages 59--66, Banff, Canada, 2004. Google ScholarDigital Library
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288--296, 2009.Google ScholarDigital Library
T. Griffiths and M. Steyvers. Finding scientific topics. In PNAS, volume 101, pages 5228--5235, 2004.Google ScholarCross Ref
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In SIGKDD, pages 490--499, 2007. Google ScholarDigital Library
D. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007. Google ScholarDigital Library
D. Newman, T. Baldwin, L. Cavedon, S. Karimi, D. Martinez, and J. Zobel. Visualizing document collections and search results using topic mapping. Journal of Web Semantics, to appear.Google Scholar
D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth. Subject metadata enrichment using statistical topic models. In JCDL, pages 366--375, 2007. Google ScholarDigital Library
D. Newman, S. Karimi, and L. Cavedon. External evaluation of topic models. In ADCS, pages 11--18, 2009.Google Scholar
D. Newman, J. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL HLT 2010, Los Angeles, USA, to appear. Google ScholarDigital Library
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101(476):1566--1581, 2006.Google Scholar
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973--1981, 2009.Google ScholarDigital Library

Index Terms

Evaluating topic models for digital libraries
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document collection models
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
    2. Digital libraries and archives

Recommendations

Topic modelling for qualitative studies

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation LDA. However, examples of qualitative studies that ...
Read More
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
Read More
Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA
Abstract
Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries
June 2010
424 pages
ISBN:9781450300858
DOI:10.1145/1816123
General Chair:
Jane Hunter
The University of Queensland, Australia
,
Program Chairs:
Carl Lagoze
Cornell University, USA
,
Lee Giles
Pennsylvania State University, USA
,
Yuan-Fang Li
The University of Queensland, Australia
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation
topic models
topic quality
user studies
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 116
  Total Citations
  View Citations
- 1,169
  Total Downloads
- Downloads (Last 12 months)83
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating topic models for digital libraries

JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic modelling for qualitative studies

Topic sentiment mixture: modeling facets and opinions in weblogs

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating topic models for digital libraries

JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic modelling for qualitative studies

Topic sentiment mixture: modeling facets and opinions in weblogs

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media