Article

Forming test collections with no system pooling

Authors:
Mark Sanderson

University of Sheffield, Sheffield, UK

University of Sheffield, Sheffield, UK
View Profile

,
Hideo Joho

University of Sheffield, Sheffield, UK

University of Sheffield, Sheffield, UK
View Profile

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2004Pages 33–40https://doi.org/10.1145/1008992.1009001

Published:25 July 2004Publication History

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 33–40

ABSTRACT

Forming test collection relevance judgments from the pooled output of multiple retrieval systems has become the standard process for creating resources such as the TREC, CLEF, and NTCIR test collections. This paper presents a series of experiments examining three different ways of building test collections where no system pooling is used. First, a collection formation technique combining manual feedback and multiple systems is adapted to work with a single retrieval system. Second, an existing method based on pooling the output of multiple manual searches is re-examined: testing a wider range of searchers and retrieval systems than has been examined before. Third, a new approach is explored where the ranked output of a single automatic search on a single retrieval system is assessed for relevance: no pooling whatsoever. Using established techniques for evaluating the quality of relevance judgments, in all three cases, test collections are formed that are as good as TREC.

References

Bland, J. M., Altman, D. G.(1986) Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307--310.Google ScholarCross Ref
Buckley, C., Voorhees, E. M.(2004), Retrieval Evaluation with Incomplete Information, in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarDigital Library
Cieri, C., Strassel, S., Graff, D., Martey, N., Rennert, K. and Liberman, M.(2002), Corpora for Topic Detection and Tracking, In: Allan, J.(ed.), Topic Detection and Tracking: Event-based Information Organization, 33--66, Kluwer. Google ScholarDigital Library
Cormack, G. V., Palmer, C. R. and Clarke, C. L. A.(1998), Efficient Construction of Large Test Collections, in Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 282--289. Google ScholarDigital Library
Fox, E. A. and Shaw, J. A.(1993), Combination of Multiple Searches, in NIST Special Publication 500-215: The 2nd Text REtrieval Conference(TREC-2), Gaithersburg, MD, 243--252.Google Scholar
Garofolo, J. S., Voorhees, E. M., Stanford, V. M., Spärck Jones, K.(1997), TREC-6 1997 Spoken Document Retrieval Track Overview and Results, in Proceedings of the 6th Text REtrieval Conference(TREC 6), NIST Special Publication 500-240, 83--92.Google Scholar
Gilbert, H. and Spärck Jones, K.(1979), Statistical bases of relevance assessment for the 'ideal' information retrieval test collection, British Library Research and Development Report 5481, Computer Laboratory, University of Cambridge.Google Scholar
Harman, D(1996), Panel: building and using test collections, in Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, 335--337. Google ScholarDigital Library
Harmandas, V., Sanderson, M., Dunlop, M. D.(1997), Image retrieval by hypertext links, in Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, 296--303. Google ScholarDigital Library
Kuriyama, K., Kando, N., Nozue, T. and Eguchi, K.(2002), Pooling for a Large-Scale Test Collection: An Analysis of the Search Results from the First NTCIR Workshop, Information Retrieval, 5(1), 41--59. Google ScholarDigital Library
Lewis, D. D.(1992), An Evaluation of Phrasal and Clustered Representations on a Text Categorization Task in Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, 37--46. Google ScholarDigital Library
Manmatha, R., Rath, T., Feng, F.(2001): Modeling Score Distributions for Combining the Outputs of Search Engines, in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. Google ScholarDigital Library
Salton, G., Fox, E. A., Wu, H.(1983): Extended Boolean Information Retrieval, in Communications of the ACM, 26(11): 1022--1036. Google ScholarDigital Library
Sheridan, P., Wechsler, M., and Schäuble, P.(1997), Cross-Language Speech Retrieval: Establishing a Baseline Performance, in Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, 99--108. Google ScholarDigital Library
Soboroff, I., Nicholas, C., and Cahan, P.(2001), Ranking retrieval systems without relevance judgments, in Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 66--73. Google ScholarDigital Library
Soboroff, I. and Robertson, S.(2003), Building a filtering test collection for TREC 2002, in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, 243--250. Google ScholarDigital Library
Spärck Jones, K.,(1974), Progress in Documentation: Automatic Indexing, Journal of Documentation, 30(4), 393--432.Google ScholarCross Ref
Spärck Jones, K., Van Rijsbergen, C. J.(1975), Report on the need for and provision of an 'ideal' information retrieval test collection, British Library Research and Development Report 5266, University Computer Laboratory, Cambridge.Google Scholar
Spärck Jones, K., Bates, R. G.(1977), Report on a design study for the 'ideal' information retrieval test collection, British Library Research and Development Report 5428, Computer Laboratory, University of Cambridge.Google Scholar
Stuart, A.(1983), Kendall's tau. In Kotz, S and Johnson, N. L., editors, Encyclopedia of Statistical Sciences, vol. 4, 367--369. John Wiley and Sons.Google Scholar
Sullivan, D.(2002), The Search Engine "Perfect Page", in Search Engine Watch accessed from http://searchenginewatch.com/searchday/02/sd1104-pptest.html.Google Scholar
Voorhees, E. M.(1998) Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, in Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, 315--323. Google ScholarDigital Library
Voorhees, E. M., Harman, D.(1998) Overview of the 7 th Text REtrieval Conference(TREC-7), in Proceedings of the 7th Text REtrieval Conference(TREC-7) NIST Special Publication 500-242, 1--24.Google Scholar
Voorhees, E. M., Harman, D.(1999) Overview of the 8th Text REtrieval Conference(TREC-8), in Proceedings of the 8th Text REtrieval Conference(TREC-8) NIST Special Publication 500-246, 1--24.Google Scholar
Voorhees, E.(2001) Evaluation by Highly Relevant Documents, in Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, 74--82. Google ScholarDigital Library
Voorhees, E.(2002), Personal Communication.Google Scholar
Zobel, J.(1998), How reliable are the results of large-scale information retrieval experiments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 307--314. Google ScholarDigital Library

Index Terms

Forming test collections with no system pooling
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

On the Reusability of Personalized Test Collections
UMAP '17: Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization

Test collections for offline evaluation remain crucial for information retrieval research and industrial practice, yet reusability of test collections is under threat by different factors such as dynamic nature of data collections and new trends in ...
Read More
IR system evaluation using nugget-based test collections
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

The development of information retrieval systems such as search engines relies on good test collections, including assessments of retrieved content. The widely employed Cranfield paradigm dictates that the information relevant to a topic be encoded at ...
Read More
Dynamic test collections: measuring search effectiveness on the live web
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Existing methods for measuring the quality of search algorithms use a static collection of documents. A set of queries and a mapping from the queries to the relevant documents allow the experimenter to see how well different search engines or engine ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation of qrel sets
test collection formation
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 998
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Forming test collections with no system pooling

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

On the Reusability of Personalized Test Collections

IR system evaluation using nugget-based test collections

Dynamic test collections: measuring search effectiveness on the live web