skip to main content
10.1145/1148170.1148263acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A statistical method for system evaluation using incomplete judgments

Published:06 August 2006Publication History

ABSTRACT

We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.

References

  1. E. C. Anderson. Monte carlo methods and importance sampling. Lecture Notes for Statistical Genetics, October 1999.Google ScholarGoogle Scholar
  2. J. A. Aslam, V. Pavlu, and R. Savell. A unified model for metasearch, pooling, and system evaluation. In O. Frieder, J. Hammer, S. Quershi, and L. Seligman, editors, Proceedings of he Twelfth International Conference on Information and Knowledge Management, pages 484--491. ACM Press, November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. A. Aslam, V. Pavlu, and E. Yilmaz. Measure-based metasearch. In G. Marchionini, A. Moffat, J. Tait, R. Baeza-Yates, and N. Ziviani, editors, Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 571--572. ACM Press, August 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. A. Aslam, V. Pavlu, and E. Yilmaz. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the 22nd ICML Workshop on Learning with Partially Classified Training Data, August 2005. Copyright held by authors.Google ScholarGoogle Scholar
  5. C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 25--32, New York, NY, USA, 2004. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. V. Cormack, C. R. Palmer, andC. L. A. Clarke. Efficient construction of large test collections. In Croft et al. {7}, pages 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, Aug. 1998. ACM Press, New York. Google ScholarGoogle Scholar
  8. D. Harman. Overview of the third text REtreival conference (TREC-3). In D. Harman, editor, Overview of the Third Text REtrieval Conference (TREC-3), pages 1--19, Gaithersburg, MD, USA, Apr. 1995. U. S. Government Printing Office, Washington D. C. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Kantor, M.-H. Kim, U. Ibraev, and K. Atasoy. Estimating the number of relevant documents in enormous collections. In D. Cfd, editor, Proceedings of tthe 62nd Annual Meeting of the American Sociaty for Information Science, volume 36, pages 507--514, 1999.Google ScholarGoogle Scholar
  10. J. A. Rice. Mathematical Statistics and Data Analysis. Wadsworth and Brooks/Cole, 1988.Google ScholarGoogle Scholar
  11. E. M. Voorhees and D. Harman. Overview of the seventh text retrieval conference (TREC-7). In Proceedings of he Seventh Text REtrieval Conference (TREC-7), pages 1--24, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Zobel. How reliable are the results of large-scale retrieval experiments? In Croft et al. {7}, pages 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A statistical method for system evaluation using incomplete judgments

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
      August 2006
      768 pages
      ISBN:1595933697
      DOI:10.1145/1148170

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 August 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader