article

Free Access

Latent dirichlet allocation

Authors:
David M. Blei

Computer Science Division, University of California, Berkeley, CA

Computer Science Division, University of California, Berkeley, CA
View Profile

,
Andrew Y. Ng

Computer Science Department, Stanford University, Stanford, CA

Computer Science Department, Stanford University, Stanford, CA
View Profile

,
Michael I. Jordan

Computer Science Division and Department of Statistics, University of California, Berkeley, CA

Computer Science Division and Department of Statistics, University of California, Berkeley, CA
View Profile

Authors Info & Claims

The Journal of Machine Learning Research Volume 3pp 993–1022

Published:01 March 2003Publication History

The Journal of Machine Learning Research

Abstract

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

References

M. Abramowitz and I. Stegun, editors. Handbook of Mathematical Functions. Dover, New York, 1970. Google Scholar
D. Aldous. Exchangeability and related topics. In École d'été de probabilités de Saint-Flour, XIII-- 1983, pages 1-198. Springer, Berlin, 1985.Google Scholar
H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 12, 2000.Google Scholar
L. Avery. Caenorrhabditis genetic center bibliography. 2002. URL http://elegans.swmed.edu/wli/cgcbib.Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Google Scholar
D. Blei and M. Jordan. Modeling annotated data. Technical Report UCB//CSD-02-1202, U.C. Berkeley Computer Science Division, 2002.Google Scholar
B. de Finetti. Theory of probability. Vol. 1-2. John Wiley & Sons Ltd., Chichester, 1990. Reprint of the 1975 translation.Google Scholar
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391-407, 1990.Google Scholar
P. Diaconis. Recent progress on de Finetti's notions of exchangeability. In Bayesian statistics, 3 (Valencia, 1987), pages 111-125. Oxford Univ. Press, New York, 1988.Google Scholar
J. Dickey. Multiple hypergeometric functions: Probabilistic interpretations and statistical uses. Journal of the American Statistical Association, 78: 628-637, 1983.Google Scholar
J. Dickey, J. Jiang, and J. Kadane. Bayesian methods for censored categorical data. Journal of the American Statistical Association, 82: 773-781, 1987.Google Scholar
A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall, London, 1995.Google Scholar
T. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, 2002.Google Scholar
D. Harman. Overview of the first text retrieval conference (TREC-1). In Proceedings of the First Text Retrieval Conference (TREC-1), pages 1-20, 1992.Google Scholar
D. Heckerman and M. Meila. An experimental comparison of several clustering and initialization methods. Machine Learning, 42: 9-29, 2001. Google Scholar
T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999. Google Scholar
F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1997. Google Scholar
T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. M.I.T. Press, 1999. Google Scholar
M. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999. Google Scholar
M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37: 183-233, 1999. Google Scholar
R. Kass and D. Steffey. Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). Journal of the American Statistical Association, 84 (407): 717-726, 1989.Google Scholar
M. Leisink and H. Kappen. General lower bounds based on computer generated higher order expansions. In Uncertainty in Artificial Intelligence, Proceedings of the Eighteenth Conference, 2002. Google Scholar
T. Minka. Estimating a Dirichlet distribution. Technical report, M.I.T., 2000.Google Scholar
T. P. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Uncertainty in Artificial Intelligence (UAI), 2002. Google Scholar
C. Morris. Parametric empirical Bayes inference: Theory and applications. Journal of the American Statistical Association, 78(381): 47-65, 1983. With discussion.Google Scholar
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61-67, 1999.Google Scholar
K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3): 103-134, 2000. Google Scholar
C. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic indexing: A probabilistic analysis. pages 159-168, 1998. Google Scholar
A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Uncertainty in Artificial Intelligence, Proceedings of the Seventeenth Conference, 2001. Google Scholar
J. Rennie. Improving multi-class text classification with naive Bayes. Technical Report AITR-2001- 004, M.I.T., 2001.Google Scholar
G. Ronning. Maximum likelihood estimation of Dirichlet distributions. Journal of Statistcal Computation and Simulation, 34(4): 215-221, 1989.Google Scholar
G. Salton and M. McGill, editors. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google Scholar

Index Terms

Latent dirichlet allocation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning settings
    2. Machine learning approaches
      1. Neural networks

Recommendations

Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Read More
Sequential latent Dirichlet allocation

Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant ...
Read More
Sequential latent Dirichlet allocation

Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

The Journal of Machine Learning Research Volume 3, Issue
3/1/2003
1437 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
JMLR.org
Publication History
- Published: 1 March 2003
Published in jmlr Volume 3, Issue
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7,594
  Total Citations
  View Citations
- 33,223
  Total Downloads
- Downloads (Last 12 months)2,726
- Downloads (Last 6 weeks)438
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Latent dirichlet allocation

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

Latent dirichlet allocation based multi-document summarization

Sequential latent Dirichlet allocation

Sequential latent Dirichlet allocation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Latent dirichlet allocation

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

Latent dirichlet allocation based multi-document summarization

Sequential latent Dirichlet allocation

Sequential latent Dirichlet allocation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media