Article

LDA-based document models for ad-hoc retrieval

Authors:
Xing Wei

University of Massachusetts Amherst, Amherst, MA

University of Massachusetts Amherst, Amherst, MA
View Profile

,
W. Bruce Croft

University of Massachusetts Amherst, Amherst, MA

University of Massachusetts Amherst, Amherst, MA
View Profile

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2006Pages 178–185https://doi.org/10.1145/1148170.1148204

Published:06 August 2006Publication History

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 178–185

ABSTRACT

Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

References

Azzopardi, L., Girolami, M and van Rijsbergen, C.J. Topic Based Language Models for ad hoc Information Retrieval. In Proceedings of the International Joint Conference on Neural Networks, Budapest,Hungary, 2004.Google ScholarCross Ref
Berger, A. and Lafferty, J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 222--229. Google ScholarDigital Library
Blei, D. M., Ng, A. Y., and Jordan, M. J. Latent Dirichlet allocation. In Journal of Machine Learning Research, 3, 2003, 993--1022. Google ScholarDigital Library
Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems 16, Cambridge, MA, MIT Press, 2004. Google ScholarDigital Library
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 1990, 391--407.Google ScholarCross Ref
Geman, S., and Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 1984, 721--741.Google ScholarDigital Library
Girolami, M. and Kaban, A. Sequential activity profiling: latent Dirichlet allocation of Markov chains. Data Mining and Knowledge Discovery, 10, 2005, 175--196.Google ScholarCross Ref
Girolami, M. and Kaban, A. On an equivalence between PLSI and LDA. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, 433--434. Google ScholarDigital Library
Griffiths, T. L., and Steyvers, M. Finding scientific topics. In Proceeding of the National Academy of Sciences, 2004, 5228--5235.Google Scholar
Griffiths, T. L., Steyvers, M., Blei, D. and Tenenbaum, J. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, 2005Google Scholar
Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 50--57. Google ScholarDigital Library
Lavrenko, V. and Croft, W. B. Relevance-based language models. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 120--127. Google ScholarDigital Library
Li, W. and McCallum, A. DAG-Structured Mixture Models of Topic Correlations. To appear in Proceedings of the 23rd International Conference on Machine Learning (ICML-06), Pittsburgh, Pennsylvania, USA, 2006. Google ScholarDigital Library
Liu, X. and Croft, W. B. Cluster-based retrieval using language models. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development Information Retrieval, 2004, 186--193. Google ScholarDigital Library
McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 workshop on Text Learning, 1999.Google Scholar
Ponte, J. and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development Information Retrieval, 1998, 275--281. Google ScholarDigital Library
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Alberta, Canada, 2004. Google ScholarDigital Library
Sparck Jones, K. Automatic keyword classification for information retrieval. Butterworths, London, 1971.Google Scholar
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC Berkeley, 2004.Google Scholar
Zhai, C. and Lafferty, J. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 334--342. Google ScholarDigital Library

Index Terms

LDA-based document models for ad-hoc retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

An empirical study of SLDA for information retrieval
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

A common limitation of many language modeling approaches is that retrieval scores are mainly based on exact matching of terms in the queries and documents, ignoring the semantic relations among terms. Latent Dirichlet Allocation (LDA) is an approach ...
Read More
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Read More
Cluster-based retrieval using language models
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document model
information retrieval
language model
latent dirichlet allocation (LDA)
topic model
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 651
  Total Citations
  View Citations
- 5,685
  Total Downloads
- Downloads (Last 12 months)230
- Downloads (Last 6 weeks)44
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LDA-based document models for ad-hoc retrieval

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

An empirical study of SLDA for information retrieval

Research on Multi-document Summarization Based on LDA Topic Model

Cluster-based retrieval using language models