research-article

TF-IDF uncovered: a study of theories and probabilities

Authors:
Thomas Roelleke

Queen Mary, University of London, London, United Kngdm

Queen Mary, University of London, London, United Kngdm
View Profile

,
Jun Wang

Queen Mary, University of London, London, United Kngdm

Queen Mary, University of London, London, United Kngdm
View Profile

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalJuly 2008Pages 435–442https://doi.org/10.1145/1390334.1390409

Published:20 July 2008Publication History

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 435–442

ABSTRACT

Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.

References

Akiko Aizawa. An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39:45--65, January 2003. Google ScholarDigital Library
Gianni Amati and C. J. van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM TOIS, 20(4):357--389, October 2002. Google ScholarDigital Library
K. Church and W Gale. Inverse document frequency (idf): A measure of deviation from poisson. In Third Workshop on Very Large Corpora, pages 121--130, 1995.Google Scholar
W.B. Croft and D.J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979.Google ScholarCross Ref
Arjen de Vries and Thomas Roelleke. Relevance information: A loss of entropy but a gain for idf? In ACM SIGIR, Salvador, Brazil, 2005. Google ScholarDigital Library
David A. Grossman and Ophir Frieder. Information Retrieval. Algorithms and Heuristics, 2nd ed., volume 15 of The Information Retrieval Series. Springer, 2004. Google ScholarDigital Library
Djoerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2):131--139, 2000.Google ScholarCross Ref
John Lafferty and ChengXiang Zhai. Probabilistic Relevance Models Based on Document and Query Generation, chapter 1. Kluwer, 2003.Google Scholar
Qiaozhu Mei, Hui Fang, and ChengXiang Zhai. A study of Poisson query generation model for information retrieval. In ACM SIGIR, pages 319--326, New York, 2007. Google ScholarDigital Library
J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. ACM SIGIR, pages 275--281, 1998. Google ScholarDigital Library
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. ACM SIGIR, pages 232--241, 1994. Google ScholarDigital Library
S.E. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:503--520, 2004.Google ScholarCross Ref
S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976.Google ScholarCross Ref
Thomas Roelleke. A frequency-based and a Poisson-based probability of being informative. In ACM SIGIR, pages 227--234, Toronto, Canada, 2003. Google ScholarDigital Library
Thomas Roelleke and Jun Wang. A parallel derivation of probabilistic information retrieval models. In ACM SIGIR, pages 107--114, Seattle, USA, 2006. Google ScholarDigital Library
S.K.M. Wong and Y.Y. Yao. On modeling information retrieval with probabilistic inference. ACM TOIS, 13(1):38--68, 1995. Google ScholarDigital Library
Hugo Zaragoza, Djoerd Hiemstra, and Michael E. Tipping. Bayesian extension to the language model for ad hoc information retrieval. In ACM SIGIR, pages 4--9, Toronto, Canada, 2003. Google ScholarDigital Library

Index Terms

TF-IDF uncovered: a study of theories and probabilities
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Interpreting TF-IDF term weights as making relevance decisions

A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” ...
Read More
R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization
SKG '11: Proceedings of the 2011 Seventh International Conference on Semantics, Knowledge and Grids

Term weighting strategy plays an essential role in the areas related to text processing such as text categorization and information retrieval. In such systems, term frequency, inverse document frequency, and document length normalization are important ...
Read More
An information-theoretic perspective of tf—idf measures

This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
TF-IDF interpretations
derivative of logarithm
document-query-independence
integral
probability theory
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 2,693
  Total Downloads
- Downloads (Last 12 months)146
- Downloads (Last 6 weeks)29
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TF-IDF uncovered: a study of theories and probabilities

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Interpreting TF-IDF term weights as making relevance decisions

R-tfidf, a Variety of tf-idf Term Weighting Strategy in Document Categorization

An information-theoretic perspective of tf—idf measures