Article

A statistical method for system evaluation using incomplete judgments

Authors:
Javed A. Aslam

Northeastern University, Boston, MA

Northeastern University, Boston, MA
View Profile

,
Virgil Pavlu

Northeastern University, Boston, MA

Northeastern University, Boston, MA
View Profile

,
Emine Yilmaz

Northeastern University, Boston, MA

Northeastern University, Boston, MA
View Profile

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2006Pages 541–548https://doi.org/10.1145/1148170.1148263

Published:06 August 2006Publication History

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 541–548

ABSTRACT

We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.

References

E. C. Anderson. Monte carlo methods and importance sampling. Lecture Notes for Statistical Genetics, October 1999.Google Scholar
J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch, pooling, and system evaluation. In O. Frieder, J. Hammer, S. Quershi, and L. Seligman, editors, Proceedings of he Twelfth International Conference on Information and Knowledge Management, pages 484--491. ACM Press, November 2003. Google ScholarDigital Library
J. A. Aslam, V. Pavlu, and E. Yilmaz. Measure-based metasearch. In G. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 571--572. ACM Press, August 2005. Google ScholarDigital Library
J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, August 2005. Copyright held by authors.Google Scholar
C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25--32, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
G. V. Cormack, C. R. Palmer, andC. L. A. Clarke. Efficient construction of large test collections. In Croft et al. {7}, pages 282--289. Google ScholarDigital Library
W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, Aug. 1998. ACM Press, New York. Google Scholar
D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1--19, Gaithersburg, MD, USA, Apr. 1995. U. S. Government Printing Office, Washington D. C. Google ScholarDigital Library
P. Kantor, M.-H. Kim, U. Ibraev, and K. Atasoy. Estimating the number of relevant documents in enormous collections. In D. Cfd, editor, Proceedings of tthe 62nd Annual Meeting of the American Sociaty for Information Science, volume 36, pages 507--514, 1999.Google Scholar
J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth and Brooks/Cole, 1988.Google Scholar
E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of he Seventh Text REtrieval Conference (TREC-7), pages 1--24, 1999. Google ScholarDigital Library
J. Zobel. How reliable are the results of large-scale retrieval experiments? In Croft et al. {7}, pages 307--314. Google ScholarDigital Library

Index Terms

A statistical method for system evaluation using incomplete judgments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

A simple and efficient sampling method for estimating AP and NDCG
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

We consider the problem of large scale retrieval evaluation. Recently two methods based on random sampling were proposed as a solution to the extensive effort required to judge tens of thousands of documents. While the first method proposed by Aslam et ...
Read More
Estimating average precision with incomplete and imperfect judgments
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

We consider the problem of evaluating retrieval systems using incomplete judgment information. Buckley and Voorhees recently demonstrated that retrieval systems can be efficiently and effectively evaluated using incomplete judgments via the bpref ...
Read More
Estimating average precision when judgments are incomplete

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
General Chair:
Efthimis N. Efthimiadis
University of Washington
,
Program Chairs:
Susan Dumais
Microsoft Research, Redmond
,
David Hawking
CSIRO ICT Centre, Canberra, Australia
,
Kalervo Järvelin,
University of Tampere, Finland
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
average precision
evaluation
sampling
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 109
  Total Citations
  View Citations
- 1,414
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A statistical method for system evaluation using incomplete judgments

SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A simple and efficient sampling method for estimating AP and NDCG

Estimating average precision with incomplete and imperfect judgments

Estimating average precision when judgments are incomplete