ABSTRACT
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.
- 1.James Allan, Jamie Callan, Fang-Fang Feng, and Daniella Malin. INQUERY and TREC-8. In Voorhees and Harman {26}.Google Scholar
- 2.Chris Buckley and Janet Walz. SMART in TREC 8. In Voorhees and Harman {26}.Google Scholar
- 3.C. W. Cleverdon, J. Mills, and E. M. Keen. Factors determining the performance of indexing systems. Two volumes, Cranfield, England, 1968.Google Scholar
- 4.William S. Cooper. On selecting a measure of retrieval effectiveness. Part I. In Karen Sparck Jones and Peter Willett, editors, Readings in Information Retrieval, pages 191-204. Morgan Kaufmann, 1997. Google ScholarDigital Library
- 5.Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Croft et al. {6}, pages 282-289. Google ScholarDigital Library
- 6.W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York. Google ScholarCross Ref
- 7.D.K. Harman, editor. Proceedings of the Fourth Text RE- trieval Conference (TREC-4), October 1996. NIST Special Publication 500-236.Google ScholarCross Ref
- 8.Donna Harman. Overview of the fourth Text REtrieval Conference (TREC-4). In Harman {7}, pages 1-23. NIST Special Publication 500-236.Google Scholar
- 9.David Hawking, Peter Bailey, and Nick Craswell. ACSys TREC-8 experiments. In Voorhees and Harman {26}.Google Scholar
- 10.David Hull. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 329-338, 1993. Google ScholarDigital Library
- 11.E. Michael Keen. Presenting results of experimental retrieval comparisons. Information Processing and Management, 28(4):491-502, 1992. Google ScholarDigital Library
- 12.K.L. Kwok, L. Grunfeld, and M. Chart. TREC-8 ad-hoc, query and filtering track experiments using PIRCS. In Voorhees and Harman {26}.Google Scholar
- 13.David D. Lewis. Evaluating and optimizing autonomous text classification systems. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246-- 254, 1995. Google ScholarDigital Library
- 14.David D. Lewis. The TREC-4 filtering track. In Harman {7}, pages 165-180. NIST Special Publication 500-236.Google Scholar
- 15.J. Mayfiled, P. McNamee, and C. Piatko. The JHU/APL HAIRCUT system at TREC-8. In Voorhees and Harman {26}.Google Scholar
- 16.Gerard Salton. The state of retrieval system evaluation. Information Processing and Management, 28(4):441-449, 1992. Google ScholarDigital Library
- 17.K. Sparck Jones and C.J. van Rijsbergen. Information retrieval test collections. Journal of Documentation, 32(1):59-75, 1976.Google ScholarCross Ref
- 18.Karen Sparck Jones. Automatic indexing. Journal of Documentation, 30:393-432, 1974.Google ScholarCross Ref
- 19.Jean M. Tague. The pragmatics of information retrieval experimentation. In Karen Sparck Jones, editor, Information Retrieval Experiment, pages 59-102. Butterworths, 1981.Google Scholar
- 20.Jean Tague-Sutcliffe. The pragmatics of information retrieval experimentation, revisited. Information Processing and Management, 28(4):467-490, 1992. Google ScholarDigital Library
- 21.Jean Tague-Sutcliffe and James Blustein. A statistical analysis of the TREC-3 data. In D. K. Harman, editor, Overview of the Third Text REtrieval Conference (TREC- 3) {Proceedings of TREC-3.}, pages 385-398, April 1995. NIST Special Publication 500-225.Google Scholar
- 22.C.J. vanRijsbergen. Information Retrieval, chapter7. Butterworths, 2 edition, 1979. Google ScholarDigital Library
- 23.Ellen M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Croft et al. {6}, pages 315-323. Google ScholarDigital Library
- 24.Ellen M. Voorhees. Special issue: The sixth Text REtrieval Conference (TREC-6). Information Processing and Management, 36(1), January 2000. Google ScholarDigital Library
- 25.Ellen M. Voorhees and Donna Harman. Overview of the seventh Text REtrieVal Conference (TREC-7). In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 1-23, August 1999. NIST Special Publication 500-242. Electronic version available at http://trec.nist.gov/pubs.html.Google Scholar
- 26.E.M. Voorhees and D.K. Harman, editors. Proceedings of the Eighth Text REtrieval Conference (TREC-8). Electronic version available at http://trec.nist.gov/pubs.html, 2000.Google ScholarCross Ref
- 27.D. Williamson, R. Williamson, and M. Lesk. The Cornell implementation of the Smart system. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 2, pages 43-44. Prentice- Hall, Inc. Englewood Cliffs, New Jersey, 1971.Google Scholar
- 28.Justin Zobel. How reliable are the results of large-scale information retrieval experiments? In Croft et al. {6}, pages 307-314. Google ScholarDigital Library
Index Terms
- Evaluating evaluation measure stability
Recommendations
Evaluating Evaluation Measure Stability
SIGIR Test-of-Time Awardees 1978-2001This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good ...
Visual Quality Evaluation of Image Object Segmentation: Subjective Assessment and Objective Measure
A visual quality evaluation of image object segmentation as one member of the visual quality evaluation family has been studied over the years. Researchers aim at developing the objective measures that can evaluate the visual quality of object ...
Evaluating the evaluation: a case study using the TREC 2002 question answering track
NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. ...
Comments