skip to main content
10.1145/2484028.2484066acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Copulas for information retrieval

Published:28 July 2013Publication History

ABSTRACT

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

References

  1. Alias-i. LingPipe 3.9.2. http://alias-i.com/lingpipe, 2013.Google ScholarGoogle Scholar
  2. M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online media. In SIGCHI 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. TW Anderson and D.A. Darling. A test of goodness of fit. Journal of the American Statistical Association, 49, 1954.Google ScholarGoogle Scholar
  4. Avi Arampatzis and Stephen Robertson. Modeling score distributions in information retrieval. Information Retrieval, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J.A. Aslam and M. Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In Proceedings of SIGIR 2000, pages 379--381. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of SIGIR 2006, pages 43--50. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Bordogna and G. Pasi. A model for a SOft Fusion of Information Accesses on the web. Fuzzy Sets and Systems, 148(1):105--118, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. P. Borlund. The concept of relevance in IR. JASIST, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J.P. Bouchaud and M. Potters. Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Charpentier, J.D. Fermanian, and O. Scaillet. The estimation of copulas: Theory and practice. Copulas: From theory to Application in Finance. Risk Publications, 2007.Google ScholarGoogle Scholar
  12. K. Collins-Thompson, P.N. Bennett, R.W. White, S. de la Chica, and D. Sontag. Personalizing web search results by reading level. In CIKM 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proceedings of SIGIR 2005, pages 416--423. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ronan Cummins. Measuring the ability of score distributions to model relevance. In Information Retrieval Technology. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. da Costa Pereira, M. Dragoni, and G. Pasi. Multidimensional relevance: A new aggregation criterion. ECIR 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Druin, E. Foss, L. Hatley, E. Golub, M.L. Guha, J. Fails, and H. Hutchinson. How children search the internet with keyword interfaces. In Proceedings of the 8th International Conference on Interaction Design and Children, pages 89--96. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Eickhoff, P. Serdyukov, and A.P. de Vries. A combined topical/non-topical approach to identifying web sites for children. In WSDM 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Embrechts, F. Lindskog, and A. McNeil. Modelling dependence with copulas and applications to risk management. Handbook of heavy tailed distributions in finance, 8(329--384):1, 2003.Google ScholarGoogle Scholar
  19. E. Fox and J. Shaw. Combination of multiple searches. NIST Special Pub., 1994.Google ScholarGoogle Scholar
  20. E.W. Frees and E.A. Valdez. Understanding relationships using copulas. North American actuarial journal, 2(1), 1998.Google ScholarGoogle Scholar
  21. S. Gerani, C.X. Zhai, and F. Crestani. Score transformation in linear combination for multi-criteria relevance ranking. ECIR 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S.P. Harter. Psychological relevance and information science. JASIS, 43(9):602--615, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  23. W. Höffding. Scale-invariant correlation theory. Schriften des Mathematischen Instituts und des Instituts fur Angewandte Mathematik der Universitäat Berlin, 5(3):181--233, 1940.Google ScholarGoogle Scholar
  24. X. Huang and W.B. Croft. A unified relevance model for opinion retrieval. In Proceeding of CIKM 2009, pages 947--956. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In SIGIR. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Lavrenko and W.B. Croft. Relevance based language models. In Proceedings of SIGIR 2001, pages 120--127. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Lavrenko and W.B. Croft. Relevance models in information retrieval. Language modeling for information retrieval, pages 11--56, 2003.Google ScholarGoogle Scholar
  30. T.Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009.Google ScholarGoogle Scholar
  31. W. Lu, S. Robertson, and A. MacFarlane. Field-weighted xml retrieval based on bm25. Advances in XML Information Retrieval and Evaluation, pages 161--171, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Macdonald, R.L.T. Santos, I. Ounis, and I. Soboroff. Blog track research at trec. In SIGIR Forum 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Manmatha, Toni M. Rath, and Fangfang Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Mizzaro. Relevance: The whole history. JASIS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Montague and J.A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of CIKM 2002, pages 538--548. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Montague and J.A. Aslam. Relevance score normalization for metasearch. In CIKM 2001. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Onken, S. Grünewälder, M.H.J. Munk, and K. Obermayer. Analyzing short-term noise dependencies of spike-counts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology, 5(11):e1000577, 2009.Google ScholarGoogle Scholar
  38. M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR 1998, pages 275--281. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In SIGKDD, pages 239--248. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. B. Renard and M. Lang. Use of a gaussian copula for multivariate extreme value analysis: Some case studies in hydrology. Advances in Water Resources, 30(4):897--912, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  42. S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S.E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. Gaithersburgh, MD, 1994.Google ScholarGoogle Scholar
  44. T. Saracevic. Relevance reconsidered. In Conference on Conceptions of Library and Information Science, 1996.Google ScholarGoogle Scholar
  45. L. Schamber, M.B. Eisenberg, and M.S. Nilan. A re-examination of relevance: toward a dynamic, situational definition. IPM, 26(6):755--776, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. Schmidt. Coping with copulas. Risk Books: Copulas from Theory to Applications in Finance, 2007.Google ScholarGoogle Scholar
  47. C. Schoelzel, P. Friederichs, et al. Multivariate non-normally distributed random variables in climate research--introduction to the copula approach. Nonlin. Processes Geophys., 15(5):761--772, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  48. A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8(1):11, 1959.Google ScholarGoogle Scholar
  49. T. Tsikrika and M. Lalmas. Combining evidence for relevance criteria: a framework and experiments in web retrieval. ECIR 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. C.C. Vogt and G.W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151--173, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Wu and F. Crestani. Data fusion with estimated weights. In CIKM 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Copulas for information retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
      July 2013
      1188 pages
      ISBN:9781450320344
      DOI:10.1145/2484028

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 July 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader