Article

Information retrieval system evaluation: effort, sensitivity, and reliability

Authors:
Mark Sanderson

University of Sheffield, Sheffield, UK

University of Sheffield, Sheffield, UK
View Profile

,
Justin Zobel

RMIT, Melbourne, Australia

RMIT, Melbourne, Australia
View Profile

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2005Pages 162–169https://doi.org/10.1145/1076034.1076064

Published:15 August 2005Publication History

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 162–169

ABSTRACT

The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.

References

Buckley, C., Voorhees, E.M. (2000) Evaluating evaluation measure stability, Proc. ACM SIGIR, 33--40. Google ScholarDigital Library
Buckley, C., Voorhees, E.M. (2004) Retrieval evaluation with incomplete information, in Proc. ACM SIGIR, 25--32. Google ScholarDigital Library
Dunlop, M.D. (1997) Time Relevance and Interaction Modeling for Information Retrieval, in Proc. ACM SIGIR, 206--213. Google ScholarDigital Library
Hull, D. (1993) Using statistical testing in the evaluation of retrieval experiments, in Proc. of ACM SIGIR, 329--338. Google ScholarDigital Library
Järvelin, K. & Kekääläinen, J. (2000) IR evaluation methods for retrieving highly relevant documents, in Proc. ACM SIGIR, 41--48. Google ScholarDigital Library
Matthews, R. (2003) The numbers don't add up, New Scientist, March, p. 28, issue 2385.Google Scholar
Savoy, J. (1997) Statistical inference in retrieval effectiveness evaluation, Information Processing & Management, 33(4):495--512. Google ScholarDigital Library
Spärck Jones, K. (1974) Automatic indexing. Journal of Documentation, 30:393--432, 1974.Google ScholarCross Ref
Spärck Jones, K., Van Rijsbergen, C.J. (1975) Report on the need for and provision of an 'ideal' information retrieval test collection, British Library Research and Development Report 5266, University Computer Laboratory, Cambridge.Google Scholar
Tague-Sutcliffe, J., Blustein (1994) A Statistical Analysis of the TREC-3 Data, in Proc. TREC-3, 385--398.Google Scholar
Van Rijsbergen, C.J. (1979) Information Retrieval, London: Butterworths. Google ScholarDigital Library
Voorhees, E.M., Buckley, C. (2002) The effect of topic set size on retrieval experiment error, in Proc. ACM SIGIR, 316--323. Google ScholarDigital Library
Voorhees, E.M., Harman, D. (1999) Overview of the 8th Text REtrieval Conference (TREC-8), in Proc. 8th Text REtrieval Conf.Google Scholar
Zobel, J. (1998) How reliable are the results of large-scale information retrieval experiments?, in Proc. ACM SIGIR, 307--31. Google ScholarDigital Library

Index Terms

Information retrieval system evaluation: effort, sensitivity, and reliability
1. Information systems
  1. Information retrieval

Recommendations

Statistical precision of information retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

We introduce and validate bootstrap techniques to compute confidence intervals that quantify the effect of test-collection variability on average precision (AP) and mean average precision (MAP) IR effectiveness measures. We consider the test collection ...
Read More
A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Read More
Mean Normalized Retrieval Order (MNRO): a new content-based image retrieval performance measure

The results of a content based image retrieval system can be evaluated by several performance measures, each one employing different evaluation criteria. Many of the methods used in the field of information retrieval have been adopted for use in image ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
August 2005
708 pages
ISBN:1595930345
DOI:10.1145/1076034
General Chairs:
Ricardo Baeza-Yates
University of Chile, Chile
,
Nivio Ziviani
Federal University of Minas Gerais, Brazil
,
Program Chairs:
Gary Marchionini
University of North Carolina, USA
,
Alistair Moffat
University of Melbourne, Australia
,
John Tait
University of Sunderland, UK
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mean average precision
precision at 10
significance tests
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 224
  Total Citations
  View Citations
- 3,038
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Information retrieval system evaluation: effort, sensitivity, and reliability

SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Statistical precision of information retrieval evaluation

A comparison of statistical significance tests for information retrieval evaluation

Mean Normalized Retrieval Order (MNRO): a new content-based image retrieval performance measure