research-article

Copulas for information retrieval

Authors:
Carsten Eickhoff

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Arjen P. de Vries

Centrum Wiskunde & Informatica, Amsterdam, Netherlands

Centrum Wiskunde & Informatica, Amsterdam, Netherlands
View Profile

,
Kevyn Collins-Thompson

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalJuly 2013Pages 663–672https://doi.org/10.1145/2484028.2484066

Published:28 July 2013Publication History

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Pages 663–672

ABSTRACT

In many domains of information retrieval, system estimates of document relevance are based on multidimensional quality criteria that have to be accommodated in a unidimensional result ranking. Current solutions to this challenge are often inconsistent with the formal probabilistic framework in which constituent scores were estimated, or use sophisticated learning methods that make it difficult for humans to understand the origin of the final ranking. To address these issues, we introduce the use of copulas, a powerful statistical framework for modeling complex multi-dimensional dependencies, to information retrieval tasks. We provide a formal background to copulas and demonstrate their effectiveness on standard IR tasks such as combining multidimensional relevance estimates and fusion of results from multiple search engines. We introduce copula-based versions of standard relevance estimators and fusion methods and show that these lead to significant performance improvements on several tasks, as evaluated on large-scale standard corpora, compared to their non-copula counterparts. We also investigate criteria for understanding the likely effect of using copula models in a given retrieval scenario.

References

Alias-i. LingPipe 3.9.2. http://alias-i.com/lingpipe, 2013.Google Scholar
M. Ames and M. Naaman. Why we tag: motivations for annotation in mobile and online media. In SIGCHI 2007. ACM. Google ScholarDigital Library
TW Anderson and D.A. Darling. A test of goodness of fit. Journal of the American Statistical Association, 49, 1954.Google Scholar
Avi Arampatzis and Stephen Robertson. Modeling score distributions in information retrieval. Information Retrieval, 2011. Google ScholarDigital Library
J.A. Aslam and M. Montague. Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session). In Proceedings of SIGIR 2000, pages 379--381. ACM. Google ScholarDigital Library
K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of SIGIR 2006, pages 43--50. ACM. Google ScholarDigital Library
G. Bordogna and G. Pasi. A model for a SOft Fusion of Information Accesses on the web. Fuzzy Sets and Systems, 148(1):105--118, 2004.Google ScholarCross Ref
P. Borlund. The concept of relevance in IR. JASIST, 2003. Google ScholarDigital Library
J.P. Bouchaud and M. Potters. Theory of financial risk and derivative pricing: from statistical physics to risk management. Cambridge University Press, 2003.Google ScholarCross Ref
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML, pages 89--96. ACM, 2005. Google ScholarDigital Library
A. Charpentier, J.D. Fermanian, and O. Scaillet. The estimation of copulas: Theory and practice. Copulas: From theory to Application in Finance. Risk Publications, 2007.Google Scholar
K. Collins-Thompson, P.N. Bennett, R.W. White, S. de la Chica, and D. Sontag. Personalizing web search results by reading level. In CIKM 2011. ACM. Google ScholarDigital Library
N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor. Relevance weighting for query independent evidence. In Proceedings of SIGIR 2005, pages 416--423. ACM. Google ScholarDigital Library
Ronan Cummins. Measuring the ability of score distributions to model relevance. In Information Retrieval Technology. Springer, 2011. Google ScholarDigital Library
C. da Costa Pereira, M. Dragoni, and G. Pasi. Multidimensional relevance: A new aggregation criterion. ECIR 2009. Google ScholarDigital Library
A. Druin, E. Foss, L. Hatley, E. Golub, M.L. Guha, J. Fails, and H. Hutchinson. How children search the internet with keyword interfaces. In Proceedings of the 8th International Conference on Interaction Design and Children, pages 89--96. ACM, 2009. Google ScholarDigital Library
C. Eickhoff, P. Serdyukov, and A.P. de Vries. A combined topical/non-topical approach to identifying web sites for children. In WSDM 2011. ACM. Google ScholarDigital Library
P. Embrechts, F. Lindskog, and A. McNeil. Modelling dependence with copulas and applications to risk management. Handbook of heavy tailed distributions in finance, 8(329--384):1, 2003.Google Scholar
E. Fox and J. Shaw. Combination of multiple searches. NIST Special Pub., 1994.Google Scholar
E.W. Frees and E.A. Valdez. Understanding relationships using copulas. North American actuarial journal, 2(1), 1998.Google Scholar
S. Gerani, C.X. Zhai, and F. Crestani. Score transformation in linear combination for multi-criteria relevance ranking. ECIR 2012. Google ScholarDigital Library
S.P. Harter. Psychological relevance and information science. JASIS, 43(9):602--615, 1992.Google ScholarCross Ref
W. Höffding. Scale-invariant correlation theory. Schriften des Mathematischen Instituts und des Instituts fur Angewandte Mathematik der Universitäat Berlin, 5(3):181--233, 1940.Google Scholar
X. Huang and W.B. Croft. A unified relevance model for opinion retrieval. In Proceeding of CIKM 2009, pages 947--956. ACM. Google ScholarDigital Library
Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, and Javed A Aslam. Score distribution models: assumptions, intuition, and robustness to score manipulation. In SIGIR 2010. ACM. Google ScholarDigital Library
J.M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604--632, 1999. Google ScholarDigital Library
W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry page search. In SIGIR. ACM, 2002. Google ScholarDigital Library
V. Lavrenko and W.B. Croft. Relevance based language models. In Proceedings of SIGIR 2001, pages 120--127. ACM. Google ScholarDigital Library
V. Lavrenko and W.B. Croft. Relevance models in information retrieval. Language modeling for information retrieval, pages 11--56, 2003.Google Scholar
T.Y. Liu. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 2009.Google Scholar
W. Lu, S. Robertson, and A. MacFarlane. Field-weighted xml retrieval based on bm25. Advances in XML Information Retrieval and Evaluation, pages 161--171, 2006. Google ScholarDigital Library
C. Macdonald, R.L.T. Santos, I. Ounis, and I. Soboroff. Blog track research at trec. In SIGIR Forum 2010. ACM. Google ScholarDigital Library
R. Manmatha, Toni M. Rath, and Fangfang Feng. Modeling score distributions for combining the outputs of search engines. In SIGIR 2001. Google ScholarDigital Library
S. Mizzaro. Relevance: The whole history. JASIS, 1997. Google ScholarDigital Library
M. Montague and J.A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of CIKM 2002, pages 538--548. ACM. Google ScholarDigital Library
M. Montague and J.A. Aslam. Relevance score normalization for metasearch. In CIKM 2001. ACM. Google ScholarDigital Library
A. Onken, S. Grünewälder, M.H.J. Munk, and K. Obermayer. Analyzing short-term noise dependencies of spike-counts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology, 5(11):e1000577, 2009.Google Scholar
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. JASIS, 47(10):749--764, 1996. Google ScholarDigital Library
J.M. Ponte and W.B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR 1998, pages 275--281. ACM. Google ScholarDigital Library
F. Radlinski and T. Joachims. Query chains: learning to rank from implicit feedback. In SIGKDD, pages 239--248. ACM, 2005. Google ScholarDigital Library
B. Renard and M. Lang. Use of a gaussian copula for multivariate extreme value analysis: Some case studies in hydrology. Advances in Water Resources, 30(4):897--912, 2007.Google ScholarCross Ref
S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In CIKM 2004. Google ScholarDigital Library
S.E. Robertson, S. Walker, M.M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. Gaithersburgh, MD, 1994.Google Scholar
T. Saracevic. Relevance reconsidered. In Conference on Conceptions of Library and Information Science, 1996.Google Scholar
L. Schamber, M.B. Eisenberg, and M.S. Nilan. A re-examination of relevance: toward a dynamic, situational definition. IPM, 26(6):755--776, 1990. Google ScholarDigital Library
T. Schmidt. Coping with copulas. Risk Books: Copulas from Theory to Applications in Finance, 2007.Google Scholar
C. Schoelzel, P. Friederichs, et al. Multivariate non-normally distributed random variables in climate research--introduction to the copula approach. Nonlin. Processes Geophys., 15(5):761--772, 2008.Google ScholarCross Ref
A. Sklar. Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris, 8(1):11, 1959.Google Scholar
T. Tsikrika and M. Lalmas. Combining evidence for relevance criteria: a framework and experiments in web retrieval. ECIR 2007. Google ScholarDigital Library
D. Vallet and P. Castells. Personalized diversification of search results. In SIGIR 2012. ACM. Google ScholarDigital Library
C.C. Vogt and G.W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151--173, 1999. Google ScholarDigital Library
S. Wu and F. Crestani. Data fusion with estimated weights. In CIKM 2002. ACM. Google ScholarDigital Library

Index Terms

Copulas for information retrieval
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Modelling Term Dependence with Copulas
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Many generative language and relevance models assume conditional independence between the likelihood of observing individual terms. This assumption is obviously naive, but also hard to replace or relax. There are only very few term pairs that actually ...
Read More
Modelling Complex Relevance Spaces with Copulas
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Modern relevance models consider a wide range of criteria in order to identify those documents that are expected to satisfy the user's information need. With growing dimensionality of the underlying relevance spaces the need for sophisticated score ...
Read More
Enhancing relevance models with adaptive passage retrieval
ECIR'08: Proceedings of the IR research, 30th European conference on Advances in information retrieval

Passage retrieval and pseudo relevance feedback/query expansion have been reported as two effective means for improving document retrieval in literature. Relevance models, while improving retrieval in most cases, hurts performance on some heterogeneous ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
July 2013
1188 pages
ISBN:9781450320344
DOI:10.1145/2484028
General Chairs:
Gareth J.F. Jones
Dublin City University, Ireland
,
Páraic Sheridan
Dublin City University, Ireland
,
Program Chairs:
Diane Kelly
University of North Carolina, Chapel Hill, USA
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Tetsuya Sakai
Microsoft Research Asia, China
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data fusion
multivariate relevance
probabilistic framework
ranking
relevance models
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '13 Paper Acceptance Rate73of366submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 487
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Copulas for information retrieval

SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modelling Term Dependence with Copulas

Modelling Complex Relevance Spaces with Copulas

Enhancing relevance models with adaptive passage retrieval