skip to main content
article

A study of smoothing methods for language models applied to information retrieval

Published:01 April 2004Publication History
Skip Abstract Section

Abstract

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and to then rank documents by the likelihood of the query according to the estimated language model. A central issue in language model estimation is smoothing, the problem of adjusting the maximum likelihood estimator to compensate for data sparseness. In this article, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. Experimental results show that not only is the retrieval performance generally sensitive to the smoothing parameters, but also the sensitivity pattern is affected by the query type, with performance being more sensitive to smoothing for verbose queries than for keyword queries. Verbose queries also generally require more aggressive smoothing to achieve optimal performance. This suggests that smoothing plays two different role---to make the estimated document language model more accurate and to "explain" the noninformative words in the query. In order to decouple these two distinct roles of smoothing, we propose a two-stage smoothing strategy, which yields better sensitivity patterns and facilitates the setting of smoothing parameters automatically. We further propose methods for estimating the smoothing parameters automatically. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

References

  1. Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 222--229. Google ScholarGoogle Scholar
  2. Chen, S. F. and Goodman, J. 1998. An empirical study of smoothing techniques for language modeling. Tech. Rep. TR-10-98, Harvard University.Google ScholarGoogle Scholar
  3. Fuhr, N. 1992. Probabilistic models in information retrieval. Comput J. 35, 3, 243--255. Google ScholarGoogle Scholar
  4. Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, parts 3, 4, 237--264.Google ScholarGoogle Scholar
  5. Hiemstra, D. and Kraaij, W. 1999. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proceedings of 7th Text REtrieval Conference (TREC-7). 227--238.Google ScholarGoogle Scholar
  6. Jelinek, F. and Mercer, R. 1980. Interpolated estimation of markov sourceparameters from sparse data. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 381--402.Google ScholarGoogle Scholar
  7. Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics, Speech and Signal Processing (ASSP) 35 400--401.Google ScholarGoogle Scholar
  8. Kneser, R. and Ney, H. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Computer Society Press, Los Alamitos, Calif., 181--184.Google ScholarGoogle Scholar
  9. Kraaij, W., Westerveld, T., and Hiemstra, D. 2002. The importance of prior probabilities for entry page search. In Proceedings of SIGIR'02. ACM, New York, 27--34. Google ScholarGoogle Scholar
  10. Kwok, K. and Chan, M. 1998. Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR'98. ACM, New York, 250--256. Google ScholarGoogle Scholar
  11. Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR'01. ACM, New York, 111--119. Google ScholarGoogle Scholar
  12. Lavrenko, V. and Croft, B. 2001. Relevance-based language models. In Proceedings of SIGIR'01. ACM, New York, 120--127. Google ScholarGoogle Scholar
  13. MacKay, D. and Peto, L. 1995. A hierarchical Dirichlet language model. Nat. Lang. Eng. 1, 3, 289--307.Google ScholarGoogle Scholar
  14. Miller, D. H., Leek, T., and Schwartz, R. 1999. A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 214--221. Google ScholarGoogle Scholar
  15. Ney, H., Essen, U., and Kneser, R. 1994. On structuring probabilistic dependencies in stochastic language modeling. Comput. Speech Lang. 8, 1--38.Google ScholarGoogle Scholar
  16. Ney, H., Essen, U., and Kneser, R. 1995. On the estimation of 'small' probabilities by leaving-one-out. IEEE Trans. Pattern Anal. Machine Intel. 17, 12, 1202--1212. Google ScholarGoogle Scholar
  17. Ponte, J. 1998. A language modeling approach to information retrieval. Ph.D. dissertation. Univ. of Massachusetts at Amherst. Google ScholarGoogle Scholar
  18. Ponte, J. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98. ACM, New York, 275--281. Google ScholarGoogle Scholar
  19. Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, R. N. Oddy et al., Eds. Butterworths, 35--56. Google ScholarGoogle Scholar
  20. Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3), D. K. Harman, Ed. 109--126.Google ScholarGoogle Scholar
  21. Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 513--523. Google ScholarGoogle Scholar
  22. Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Amer. Soc. Inf. Sci. 44, 4, 288--297.Google ScholarGoogle Scholar
  23. Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613--620. Google ScholarGoogle Scholar
  24. Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. In Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 21--29. Google ScholarGoogle Scholar
  25. Song, F. and Croft, B. 1999. A general language model for information retrieval. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 279--280. Google ScholarGoogle Scholar
  26. Sparck Jones, K. and Willett, P., Eds. 1997. Readings in Information Retrieval. Morgan-Kaufmann, San Mateo, Calif. Google ScholarGoogle Scholar
  27. van Rijsbergen, C. J. 1986. A non-classical logic for information retrieval. Comput. J. 29, 6, 481--485.Google ScholarGoogle Scholar
  28. Voorhees, E. and Harman, D., Eds. 2001. Proceedings of Text REtrieval Conference (TREC1-9). NIST Special Publications. http://trec.nist.gov/pubs.html. Google ScholarGoogle Scholar
  29. Wong, S. K. M. and Yao, Y. Y. 1995. On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. 13, 1, 69--99. Google ScholarGoogle Scholar
  30. Zhai, C. and Lafferty, J. 2001. Model-based feedback in the KL-divergence retrieval model. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM 2001). 403--410. Google ScholarGoogle Scholar

Index Terms

  1. A study of smoothing methods for language models applied to information retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader