ABSTRACT
The effectiveness of information retrieval systems is measured by comparing performance on a common set of queries and documents. Significance tests are often used to evaluate the reliability of such comparisons. Previous work has examined such tests, but produced results with limited application. Other work established an alternative benchmark for significance, but the resulting test was too stringent. In this paper, we revisit the question of how such tests should be used. We find that the t-test is highly reliable (more so than the sign or Wilcoxon test), and is far more reliable than simply showing a large percentage difference in effectiveness measures between IR systems. Our results show that past empirical work on significance tests over-estimated the error of such tests. We also re-consider comparisons between the reliability of precision at rank 10 and mean average precision, arguing that past comparisons did not consider the assessor effort required to compute such measures. This investigation shows that assessor effort would be better spent building test collections with more topics, each assessed in less detail.
- Buckley, C., Voorhees, E.M. (2000) Evaluating evaluation measure stability, Proc. ACM SIGIR, 33--40. Google ScholarDigital Library
- Buckley, C., Voorhees, E.M. (2004) Retrieval evaluation with incomplete information, in Proc. ACM SIGIR, 25--32. Google ScholarDigital Library
- Dunlop, M.D. (1997) Time Relevance and Interaction Modeling for Information Retrieval, in Proc. ACM SIGIR, 206--213. Google ScholarDigital Library
- Hull, D. (1993) Using statistical testing in the evaluation of retrieval experiments, in Proc. of ACM SIGIR, 329--338. Google ScholarDigital Library
- Järvelin, K. & Kekääläinen, J. (2000) IR evaluation methods for retrieving highly relevant documents, in Proc. ACM SIGIR, 41--48. Google ScholarDigital Library
- Matthews, R. (2003) The numbers don't add up, New Scientist, March, p. 28, issue 2385.Google Scholar
- Savoy, J. (1997) Statistical inference in retrieval effectiveness evaluation, Information Processing & Management, 33(4):495--512. Google ScholarDigital Library
- Spärck Jones, K. (1974) Automatic indexing. Journal of Documentation, 30:393--432, 1974.Google ScholarCross Ref
- Spärck Jones, K., Van Rijsbergen, C.J. (1975) Report on the need for and provision of an 'ideal' information retrieval test collection, British Library Research and Development Report 5266, University Computer Laboratory, Cambridge.Google Scholar
- Tague-Sutcliffe, J., Blustein (1994) A Statistical Analysis of the TREC-3 Data, in Proc. TREC-3, 385--398.Google Scholar
- Van Rijsbergen, C.J. (1979) Information Retrieval, London: Butterworths. Google ScholarDigital Library
- Voorhees, E.M., Buckley, C. (2002) The effect of topic set size on retrieval experiment error, in Proc. ACM SIGIR, 316--323. Google ScholarDigital Library
- Voorhees, E.M., Harman, D. (1999) Overview of the 8th Text REtrieval Conference (TREC-8), in Proc. 8th Text REtrieval Conf.Google Scholar
- Zobel, J. (1998) How reliable are the results of large-scale information retrieval experiments?, in Proc. ACM SIGIR, 307--31. Google ScholarDigital Library
Index Terms
- Information retrieval system evaluation: effort, sensitivity, and reliability
Recommendations
Statistical precision of information retrieval evaluation
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalWe introduce and validate bootstrap techniques to compute confidence intervals that quantify the effect of test-collection variability on average precision (AP) and mean average precision (MAP) IR effectiveness measures. We consider the test collection ...
A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementInformation retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Mean Normalized Retrieval Order (MNRO): a new content-based image retrieval performance measure
The results of a content based image retrieval system can be evaluated by several performance measures, each one employing different evaluation criteria. Many of the methods used in the field of information retrieval have been adopted for use in image ...
Comments