ABSTRACT
Forming test collection relevance judgments from the pooled output of multiple retrieval systems has become the standard process for creating resources such as the TREC, CLEF, and NTCIR test collections. This paper presents a series of experiments examining three different ways of building test collections where no system pooling is used. First, a collection formation technique combining manual feedback and multiple systems is adapted to work with a single retrieval system. Second, an existing method based on pooling the output of multiple manual searches is re-examined: testing a wider range of searchers and retrieval systems than has been examined before. Third, a new approach is explored where the ranked output of a single automatic search on a single retrieval system is assessed for relevance: no pooling whatsoever. Using established techniques for evaluating the quality of relevance judgments, in all three cases, test collections are formed that are as good as TREC.
- Bland, J. M., Altman, D. G.(1986) Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307--310.Google ScholarCross Ref
- Buckley, C., Voorhees, E. M.(2004), Retrieval Evaluation with Incomplete Information, in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
- Cieri, C., Strassel, S., Graff, D., Martey, N., Rennert, K. and Liberman, M.(2002), Corpora for Topic Detection and Tracking, In: Allan, J.(ed.), Topic Detection and Tracking: Event-based Information Organization, 33--66, Kluwer. Google ScholarDigital Library
- Cormack, G. V., Palmer, C. R. and Clarke, C. L. A.(1998), Efficient Construction of Large Test Collections, in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 282--289. Google ScholarDigital Library
- Fox, E. A. and Shaw, J. A.(1993), Combination of Multiple Searches, in NIST Special Publication 500-215: The 2nd Text REtrieval Conference(TREC-2), Gaithersburg, MD, 243--252.Google Scholar
- Garofolo, J. S., Voorhees, E. M., Stanford, V. M., Spärck Jones, K.(1997), TREC-6 1997 Spoken Document Retrieval Track Overview and Results, in Proceedings of the 6th Text REtrieval Conference(TREC 6), NIST Special Publication 500-240, 83--92.Google Scholar
- Gilbert, H. and Spärck Jones, K.(1979), Statistical bases of relevance assessment for the 'ideal' information retrieval test collection, British Library Research and Development Report 5481, Computer Laboratory, University of Cambridge.Google Scholar
- Harman, D(1996), Panel: building and using test collections, in Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, 335--337. Google ScholarDigital Library
- Harmandas, V., Sanderson, M., Dunlop, M. D.(1997), Image retrieval by hypertext links, in Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, 296--303. Google ScholarDigital Library
- Kuriyama, K., Kando, N., Nozue, T. and Eguchi, K.(2002), Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop, Information Retrieval, 5(1), 41--59. Google ScholarDigital Library
- Lewis, D. D.(1992), An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task in Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, 37--46. Google ScholarDigital Library
- Manmatha, R., Rath, T., Feng, F.(2001): Modeling Score Distributions for Combining the Outputs of Search Engines, in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. Google ScholarDigital Library
- Salton, G., Fox, E. A., Wu, H.(1983): Extended Boolean Information Retrieval, in Communications of the ACM, 26(11): 1022--1036. Google ScholarDigital Library
- Sheridan, P., Wechsler, M., and Schäuble, P.(1997), Cross-Language Speech Retrieval: Establishing a Baseline Performance, in Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, 99--108. Google ScholarDigital Library
- Soboroff, I., Nicholas, C., and Cahan, P.(2001), Ranking retrieval systems without relevance judgments, in Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 66--73. Google ScholarDigital Library
- Soboroff, I. and Robertson, S.(2003), Building a filtering test collection for TREC 2002, in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, 243--250. Google ScholarDigital Library
- Spärck Jones, K.,(1974), Progress in Documentation: Automatic Indexing, Journal of Documentation, 30(4), 393--432.Google ScholarCross Ref
- Spärck Jones, K., Van Rijsbergen, C. J.(1975), Report on the need for and provision of an 'ideal' information retrieval test collection, British Library Research and Development Report 5266, University Computer Laboratory, Cambridge.Google Scholar
- Spärck Jones, K., Bates, R. G.(1977), Report on a design study for the 'ideal' information retrieval test collection, British Library Research and Development Report 5428, Computer Laboratory, University of Cambridge.Google Scholar
- Stuart, A.(1983), Kendall's tau. In Kotz, S and Johnson, N. L., editors, Encyclopedia of Statistical Sciences, vol. 4, 367--369. John Wiley and Sons.Google Scholar
- Sullivan, D.(2002), The Search Engine "Perfect Page", in Search Engine Watch accessed from http://searchenginewatch.com/searchday/02/sd1104-pptest.html.Google Scholar
- Voorhees, E. M.(1998) Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, in Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, 315--323. Google ScholarDigital Library
- Voorhees, E. M., Harman, D.(1998) Overview of the 7 th Text REtrieval Conference(TREC-7), in Proceedings of the 7th Text REtrieval Conference(TREC-7) NIST Special Publication 500-242, 1--24.Google Scholar
- Voorhees, E. M., Harman, D.(1999) Overview of the 8th Text REtrieval Conference(TREC-8), in Proceedings of the 8th Text REtrieval Conference(TREC-8) NIST Special Publication 500-246, 1--24.Google Scholar
- Voorhees, E.(2001) Evaluation by Highly Relevant Documents, in Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 74--82. Google ScholarDigital Library
- Voorhees, E.(2002), Personal Communication.Google Scholar
- Zobel, J.(1998), How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 307--314. Google ScholarDigital Library
Index Terms
- Forming test collections with no system pooling
Recommendations
On the Reusability of Personalized Test Collections
UMAP '17: Adjunct Publication of the 25th Conference on User Modeling, Adaptation and PersonalizationTest collections for offline evaluation remain crucial for information retrieval research and industrial practice, yet reusability of test collections is under threat by different factors such as dynamic nature of data collections and new trends in ...
IR system evaluation using nugget-based test collections
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data miningThe development of information retrieval systems such as search engines relies on good test collections, including assessments of retrieved content. The widely employed Cranfield paradigm dictates that the information relevant to a topic be encoded at ...
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalExisting methods for measuring the quality of search algorithms use a static collection of documents. A set of queries and a mapping from the queries to the relevant documents allow the experimenter to see how well different search engines or engine ...
Comments