ABSTRACT
In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.
- Alias-i. LingPipe 3.9.2. http://alias-i.com/lingpipe, 2013.Google Scholar
- M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online media. In SIGCHI 2007. ACM. Google ScholarDigital Library
- TW Anderson and D.A. Darling. A test of goodness of fit. Journal of the American Statistical Association, 49, 1954.Google Scholar
- Avi Arampatzis and Stephen Robertson. Modeling score distributions in information retrieval. Information Retrieval, 2011. Google ScholarDigital Library
- J.A. Aslam and M. Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In Proceedings of SIGIR 2000, pages 379--381. ACM. Google ScholarDigital Library
- K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of SIGIR 2006, pages 43--50. ACM. Google ScholarDigital Library
- G. Bordogna and G. Pasi. A model for a SOft Fusion of Information Accesses on the web. Fuzzy Sets and Systems, 148(1):105--118, 2004.Google ScholarCross Ref
- P. Borlund. The concept of relevance in IR. JASIST, 2003. Google ScholarDigital Library
- J.P. Bouchaud and M. Potters. Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press, 2003.Google ScholarCross Ref
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96. ACM, 2005. Google ScholarDigital Library
- A. Charpentier, J.D. Fermanian, and O. Scaillet. The estimation of copulas: Theory and practice. Copulas: From theory to Application in Finance. Risk Publications, 2007.Google Scholar
- K. Collins-Thompson, P.N. Bennett, R.W. White, S. de la Chica, and D. Sontag. Personalizing web search results by reading level. In CIKM 2011. ACM. Google ScholarDigital Library
- N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proceedings of SIGIR 2005, pages 416--423. ACM. Google ScholarDigital Library
- Ronan Cummins. Measuring the ability of score distributions to model relevance. In Information Retrieval Technology. Springer, 2011. Google ScholarDigital Library
- C. da Costa Pereira, M. Dragoni, and G. Pasi. Multidimensional relevance: A new aggregation criterion. ECIR 2009. Google ScholarDigital Library
- A. Druin, E. Foss, L. Hatley, E. Golub, M.L. Guha, J. Fails, and H. Hutchinson. How children search the internet with keyword interfaces. In Proceedings of the 8th International Conference on Interaction Design and Children, pages 89--96. ACM, 2009. Google ScholarDigital Library
- C. Eickhoff, P. Serdyukov, and A.P. de Vries. A combined topical/non-topical approach to identifying web sites for children. In WSDM 2011. ACM. Google ScholarDigital Library
- P. Embrechts, F. Lindskog, and A. McNeil. Modelling dependence with copulas and applications to risk management. Handbook of heavy tailed distributions in finance, 8(329--384):1, 2003.Google Scholar
- E. Fox and J. Shaw. Combination of multiple searches. NIST Special Pub., 1994.Google Scholar
- E.W. Frees and E.A. Valdez. Understanding relationships using copulas. North American actuarial journal, 2(1), 1998.Google Scholar
- S. Gerani, C.X. Zhai, and F. Crestani. Score transformation in linear combination for multi-criteria relevance ranking. ECIR 2012. Google ScholarDigital Library
- S.P. Harter. Psychological relevance and information science. JASIS, 43(9):602--615, 1992.Google ScholarCross Ref
- W. Höffding. Scale-invariant correlation theory. Schriften des Mathematischen Instituts und des Instituts fur Angewandte Mathematik der Universitäat Berlin, 5(3):181--233, 1940.Google Scholar
- X. Huang and W.B. Croft. A unified relevance model for opinion retrieval. In Proceeding of CIKM 2009, pages 947--956. ACM. Google ScholarDigital Library
- Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR 2010. ACM. Google ScholarDigital Library
- J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999. Google ScholarDigital Library
- W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In SIGIR. ACM, 2002. Google ScholarDigital Library
- V. Lavrenko and W.B. Croft. Relevance based language models. In Proceedings of SIGIR 2001, pages 120--127. ACM. Google ScholarDigital Library
- V. Lavrenko and W.B. Croft. Relevance models in information retrieval. Language modeling for information retrieval, pages 11--56, 2003.Google Scholar
- T.Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009.Google Scholar
- W. Lu, S. Robertson, and A. MacFarlane. Field-weighted xml retrieval based on bm25. Advances in XML Information Retrieval and Evaluation, pages 161--171, 2006. Google ScholarDigital Library
- C. Macdonald, R.L.T. Santos, I. Ounis, and I. Soboroff. Blog track research at trec. In SIGIR Forum 2010. ACM. Google ScholarDigital Library
- R. Manmatha, Toni M. Rath, and Fangfang Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR 2001. Google ScholarDigital Library
- S. Mizzaro. Relevance: The whole history. JASIS, 1997. Google ScholarDigital Library
- M. Montague and J.A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of CIKM 2002, pages 538--548. ACM. Google ScholarDigital Library
- M. Montague and J.A. Aslam. Relevance score normalization for metasearch. In CIKM 2001. ACM. Google ScholarDigital Library
- A. Onken, S. Grünewälder, M.H.J. Munk, and K. Obermayer. Analyzing short-term noise dependencies of spike-counts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology, 5(11):e1000577, 2009.Google Scholar
- M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996. Google ScholarDigital Library
- J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR 1998, pages 275--281. ACM. Google ScholarDigital Library
- F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In SIGKDD, pages 239--248. ACM, 2005. Google ScholarDigital Library
- B. Renard and M. Lang. Use of a gaussian copula for multivariate extreme value analysis: Some case studies in hydrology. Advances in Water Resources, 30(4):897--912, 2007.Google ScholarCross Ref
- S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM 2004. Google ScholarDigital Library
- S.E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. Gaithersburgh, MD, 1994.Google Scholar
- T. Saracevic. Relevance reconsidered. In Conference on Conceptions of Library and Information Science, 1996.Google Scholar
- L. Schamber, M.B. Eisenberg, and M.S. Nilan. A re-examination of relevance: toward a dynamic, situational definition. IPM, 26(6):755--776, 1990. Google ScholarDigital Library
- T. Schmidt. Coping with copulas. Risk Books: Copulas from Theory to Applications in Finance, 2007.Google Scholar
- C. Schoelzel, P. Friederichs, et al. Multivariate non-normally distributed random variables in climate research--introduction to the copula approach. Nonlin. Processes Geophys., 15(5):761--772, 2008.Google ScholarCross Ref
- A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8(1):11, 1959.Google Scholar
- T. Tsikrika and M. Lalmas. Combining evidence for relevance criteria: a framework and experiments in web retrieval. ECIR 2007. Google ScholarDigital Library
- D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR 2012. ACM. Google ScholarDigital Library
- C.C. Vogt and G.W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151--173, 1999. Google ScholarDigital Library
- S. Wu and F. Crestani. Data fusion with estimated weights. In CIKM 2002. ACM. Google ScholarDigital Library
Index Terms
- Copulas for information retrieval
Recommendations
Modelling Term Dependence with Copulas
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalMany generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually ...
Modelling Complex Relevance Spaces with Copulas
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementModern relevance models consider a wide range of criteria in order to identify those documents that are expected to satisfy the user's information need. With growing dimensionality of the underlying relevance spaces the need for sophisticated score ...
Enhancing relevance models with adaptive passage retrieval
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrievalPassage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous ...
Comments