skip to main content
10.1145/1148170.1148204acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

LDA-based document models for ad-hoc retrieval

Published:06 August 2006Publication History

ABSTRACT

Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.

References

  1. Azzopardi, L., Girolami, M and van Rijsbergen, C.J. Topic Based Language Models for ad hoc Information Retrieval. In Proceedings of the International Joint Conference on Neural Networks, Budapest,Hungary, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  2. Berger, A. and Lafferty, J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Blei, D. M., Ng, A. Y., and Jordan, M. J. Latent Dirichlet allocation. In Journal of Machine Learning Research, 3, 2003, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems 16, Cambridge, MA, MIT Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 1990, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  6. Geman, S., and Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 1984, 721--741.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Girolami, M. and Kaban, A. Sequential activity profiling: latent Dirichlet allocation of Markov chains. Data Mining and Knowledge Discovery, 10, 2005, 175--196.Google ScholarGoogle ScholarCross RefCross Ref
  8. Girolami, M. and Kaban, A. On an equivalence between PLSI and LDA. In Proceedings of the 26th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, 433--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Griffiths, T. L., and Steyvers, M. Finding scientific topics. In Proceeding of the National Academy of Sciences, 2004, 5228--5235.Google ScholarGoogle Scholar
  10. Griffiths, T. L., Steyvers, M., Blei, D. and Tenenbaum, J. Integrating topics and syntax. In Advances in Neural Information Processing Systems 17, 2005Google ScholarGoogle Scholar
  11. Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lavrenko, V. and Croft, W. B. Relevance-based language models. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 120--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Li, W. and McCallum, A. DAG-Structured Mixture Models of Topic Correlations. To appear in Proceedings of the 23rd International Conference on Machine Learning (ICML-06), Pittsburgh, Pennsylvania, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Liu, X. and Croft, W. B. Cluster-based retrieval using language models. In Proceedings of the 27th International ACM SIGIR Conference on Research and Development Information Retrieval, 2004, 186--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. McCallum, A. Multi-label text classification with a mixture model trained by EM. In AAAI'99 workshop on Text Learning, 1999.Google ScholarGoogle Scholar
  16. Ponte, J. and Croft, W.B. A language modeling approach to information retrieval. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development Information Retrieval, 1998, 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff, Alberta, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sparck Jones, K. Automatic keyword classification for information retrieval. Butterworths, London, 1971.Google ScholarGoogle Scholar
  19. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Technical Report, Department of Statistics, UC Berkeley, 2004.Google ScholarGoogle Scholar
  20. Zhai, C. and Lafferty, J. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, 334--342. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LDA-based document models for ad-hoc retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
      August 2006
      768 pages
      ISBN:1595933697
      DOI:10.1145/1148170

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 August 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader