ABSTRACT
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
- Azzopardi, L., Girolami, M and van Rijsbergen, C.J. Topic Based Language Models for ad hoc Information Retrieval. In Proceedings of the International Joint Conference on Neural Networks, Budapest,Hungary, 2004.Google ScholarCross Ref
- Berger, A. and Lafferty, J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 222--229. Google ScholarDigital Library
- Blei, D. M., Ng, A. Y., and Jordan, M. J. Latent Dirichlet allocation. In Journal of Machine Learning Research, 3, 2003, 993--1022. Google ScholarDigital Library
- Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems 16, Cambridge, MA, MIT Press, 2004. Google ScholarDigital Library
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 1990, 391--407.Google ScholarCross Ref
- Geman, S., and Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 1984, 721--741.Google ScholarDigital Library
- Girolami, M. and Kaban, A. Sequential activity profiling: latent Dirichlet allocation of Markov chains. Data Mining and Knowledge Discovery, 10, 2005, 175--196.Google ScholarCross Ref
- Girolami, M. and Kaban, A. On an equivalence between PLSI and LDA. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, 433--434. Google ScholarDigital Library
- Griffiths, T. L., and Steyvers, M. Finding scientific topics. In Proceeding of the National Academy of Sciences, 2004, 5228--5235.Google Scholar
- Griffiths, T. L., Steyvers, M., Blei, D. and Tenenbaum, J. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, 2005Google Scholar
- Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 50--57. Google ScholarDigital Library
- Lavrenko, V. and Croft, W. B. Relevance-based language models. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 120--127. Google ScholarDigital Library
- Li, W. and McCallum, A. DAG-Structured Mixture Models of Topic Correlations. To appear in Proceedings of the 23rd International Conference on Machine Learning (ICML-06), Pittsburgh, Pennsylvania, USA, 2006. Google ScholarDigital Library
- Liu, X. and Croft, W. B. Cluster-based retrieval using language models. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development Information Retrieval, 2004, 186--193. Google ScholarDigital Library
- McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 workshop on Text Learning, 1999.Google Scholar
- Ponte, J. and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development Information Retrieval, 1998, 275--281. Google ScholarDigital Library
- Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Alberta, Canada, 2004. Google ScholarDigital Library
- Sparck Jones, K. Automatic keyword classification for information retrieval. Butterworths, London, 1971.Google Scholar
- Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC Berkeley, 2004.Google Scholar
- Zhai, C. and Lafferty, J. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 334--342. Google ScholarDigital Library
Index Terms
- LDA-based document models for ad-hoc retrieval
Recommendations
An empirical study of SLDA for information retrieval
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval TechnologyA common limitation of many language modeling approaches is that retrieval scores are mainly based on exact matching of terms in the queries and documents, ignoring the semantic relations among terms. Latent Dirichlet Allocation (LDA) is an approach ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Cluster-based retrieval using language models
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalPrevious research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine ...
Comments