skip to main content
10.1145/1390334.1390409acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

TF-IDF uncovered: a study of theories and probabilities

Authors Info & Claims
Published:20 July 2008Publication History

ABSTRACT

Interpretations of TF-IDF are based on binary independence retrieval, Poisson, information theory, and language modelling. This paper contributes a review of existing interpretations, and then, TF-IDF is systematically related to the probabilities P(q|d) and P(d|q). Two approaches are explored: a space of independent, and a space of disjoint terms. For independent terms, an "extreme" query/non-query term assumption uncovers TF-IDF, and an analogy of P(d|q) and the probabilistic odds O(r|d, q) mirrors relevance feedback. For disjoint terms, a relationship between probability theory and TF-IDF is established through the integral + 1/x dx = log x. This study uncovers components such as divergence from randomness and pivoted document length to be inherent parts of a document-query independence (DQI) measure, and interestingly, an integral of the DQI over the term occurrence probability leads to TF-IDF.

References

  1. Akiko Aizawa. An information-theoretic perspective of tf-idf measures. Information Processing and Management, 39:45--65, January 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gianni Amati and C. J. van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM TOIS, 20(4):357--389, October 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Church and W Gale. Inverse document frequency (idf): A measure of deviation from poisson. In Third Workshop on Very Large Corpora, pages 121--130, 1995.Google ScholarGoogle Scholar
  4. W.B. Croft and D.J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  5. Arjen de Vries and Thomas Roelleke. Relevance information: A loss of entropy but a gain for idf? In ACM SIGIR, Salvador, Brazil, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David A. Grossman and Ophir Frieder. Information Retrieval. Algorithms and Heuristics, 2nd ed., volume 15 of The Information Retrieval Series. Springer, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Djoerd Hiemstra. A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2):131--139, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  8. John Lafferty and ChengXiang Zhai. Probabilistic Relevance Models Based on Document and Query Generation, chapter 1. Kluwer, 2003.Google ScholarGoogle Scholar
  9. Qiaozhu Mei, Hui Fang, and ChengXiang Zhai. A study of Poisson query generation model for information retrieval. In ACM SIGIR, pages 319--326, New York, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. ACM SIGIR, pages 275--281, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. ACM SIGIR, pages 232--241, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S.E. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60:503--520, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. S.E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  14. Thomas Roelleke. A frequency-based and a Poisson-based probability of being informative. In ACM SIGIR, pages 227--234, Toronto, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Thomas Roelleke and Jun Wang. A parallel derivation of probabilistic information retrieval models. In ACM SIGIR, pages 107--114, Seattle, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S.K.M. Wong and Y.Y. Yao. On modeling information retrieval with probabilistic inference. ACM TOIS, 13(1):38--68, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hugo Zaragoza, Djoerd Hiemstra, and Michael E. Tipping. Bayesian extension to the language model for ad hoc information retrieval. In ACM SIGIR, pages 4--9, Toronto, Canada, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TF-IDF uncovered: a study of theories and probabilities

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
      July 2008
      934 pages
      ISBN:9781605581644
      DOI:10.1145/1390334

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 July 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader