article

A study of smoothing methods for language models applied to information retrieval

Authors:
Chengxiang Zhai

Carnegie Mellon University, Urbana, IL

Carnegie Mellon University, Urbana, IL
View Profile

,
John Lafferty

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 22 Issue 2pp 179–214https://doi.org/10.1145/984321.984322

Published:01 April 2004Publication History

ACM Transactions on Information Systems

Abstract

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and to then rank documents by the likelihood of the query according to the estimated language model. A central issue in language model estimation is smoothing, the problem of adjusting the maximum likelihood estimator to compensate for data sparseness. In this article, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. Experimental results show that not only is the retrieval performance generally sensitive to the smoothing parameters, but also the sensitivity pattern is affected by the query type, with performance being more sensitive to smoothing for verbose queries than for keyword queries. Verbose queries also generally require more aggressive smoothing to achieve optimal performance. This suggests that smoothing plays two different role---to make the estimated document language model more accurate and to "explain" the noninformative words in the query. In order to decouple these two distinct roles of smoothing, we propose a two-stage smoothing strategy, which yields better sensitivity patterns and facilitates the setting of smoothing parameters automatically. We further propose methods for estimating the smoothing parameters automatically. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

References

Berger, A. and Lafferty, J. 1999. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 222--229. Google Scholar
Chen, S. F. and Goodman, J. 1998. An empirical study of smoothing techniques for language modeling. Tech. Rep. TR-10-98, Harvard University.Google Scholar
Fuhr, N. 1992. Probabilistic models in information retrieval. Comput J. 35, 3, 243--255. Google Scholar
Good, I. J. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, parts 3, 4, 237--264.Google Scholar
Hiemstra, D. and Kraaij, W. 1999. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proceedings of 7th Text REtrieval Conference (TREC-7). 227--238.Google Scholar
Jelinek, F. and Mercer, R. 1980. Interpolated estimation of markov sourceparameters from sparse data. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 381--402.Google Scholar
Katz, S. M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoustics, Speech and Signal Processing (ASSP) 35 400--401.Google Scholar
Kneser, R. and Ney, H. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Computer Society Press, Los Alamitos, Calif., 181--184.Google Scholar
Kraaij, W., Westerveld, T., and Hiemstra, D. 2002. The importance of prior probabilities for entry page search. In Proceedings of SIGIR'02. ACM, New York, 27--34. Google Scholar
Kwok, K. and Chan, M. 1998. Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR'98. ACM, New York, 250--256. Google Scholar
Lafferty, J. and Zhai, C. 2001. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR'01. ACM, New York, 111--119. Google Scholar
Lavrenko, V. and Croft, B. 2001. Relevance-based language models. In Proceedings of SIGIR'01. ACM, New York, 120--127. Google Scholar
MacKay, D. and Peto, L. 1995. A hierarchical Dirichlet language model. Nat. Lang. Eng. 1, 3, 289--307.Google Scholar
Miller, D. H., Leek, T., and Schwartz, R. 1999. A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 214--221. Google Scholar
Ney, H., Essen, U., and Kneser, R. 1994. On structuring probabilistic dependencies in stochastic language modeling. Comput. Speech Lang. 8, 1--38.Google Scholar
Ney, H., Essen, U., and Kneser, R. 1995. On the estimation of 'small' probabilities by leaving-one-out. IEEE Trans. Pattern Anal. Machine Intel. 17, 12, 1202--1212. Google Scholar
Ponte, J. 1998. A language modeling approach to information retrieval. Ph.D. dissertation. Univ. of Massachusetts at Amherst. Google Scholar
Ponte, J. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98. ACM, New York, 275--281. Google Scholar
Robertson, S. E., van Rijsbergen, C. J., and Porter, M. F. 1981. Probabilistic models of indexing and searching. In Information Retrieval Research, R. N. Oddy et al., Eds. Butterworths, 35--56. Google Scholar
Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3), D. K. Harman, Ed. 109--126.Google Scholar
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 513--523. Google Scholar
Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Amer. Soc. Inf. Sci. 44, 4, 288--297.Google Scholar
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11, 613--620. Google Scholar
Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. In Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 21--29. Google Scholar
Song, F. and Croft, B. 1999. A general language model for information retrieval. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 279--280. Google Scholar
Sparck Jones, K. and Willett, P., Eds. 1997. Readings in Information Retrieval. Morgan-Kaufmann, San Mateo, Calif. Google Scholar
van Rijsbergen, C. J. 1986. A non-classical logic for information retrieval. Comput. J. 29, 6, 481--485.Google Scholar
Voorhees, E. and Harman, D., Eds. 2001. Proceedings of Text REtrieval Conference (TREC1-9). NIST Special Publications. http://trec.nist.gov/pubs.html. Google Scholar
Wong, S. K. M. and Yao, Y. Y. 1995. On modeling information retrieval with probabilistic inference. ACM Trans. Inf. Syst. 13, 1, 69--99. Google Scholar
Zhai, C. and Lafferty, J. 2001. Model-based feedback in the KL-divergence retrieval model. In Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM 2001). 403--410. Google Scholar

Index Terms

A study of smoothing methods for language models applied to information retrieval
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval models and ranking

Recommendations

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval
SIGIR Test-of-Time Awardees 1978-2001

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
Read More
A study of smoothing methods for language models applied to Ad Hoc information retrieval
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
Read More
Two-stage language models for information retrieval
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 22, Issue 2
April 2004
178 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/984321
Issue’s Table of Contents

Copyright © 2004 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2004
Published in tois Volume 22, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dirichlet prior smoothing
EM algorithm
Jelinek--Mercer smoothing
Statistical language models
TF-IDF weighting
absolute discounting smoothing
backoff smoothing
interpolation smoothing
leave-one-out
risk minimization
term weighting
two-stage smoothing
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 889
  Total Citations
  View Citations
- 6,744
  Total Downloads
- Downloads (Last 12 months)150
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

A study of smoothing methods for language models applied to Ad Hoc information retrieval

Two-stage language models for information retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval

A study of smoothing methods for language models applied to Ad Hoc information retrieval

Two-stage language models for information retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media