skip to main content
research-article
Public Access

Guilt-free data reuse

Published:24 March 2017Publication History
Skip Abstract Section

Abstract

Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. Yet the practice of data analysis is an intrinsically interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused.

In this work, we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. We demonstrate new approaches for addressing the challenges of adaptivity that are based on techniques developed in privacy-preserving data analysis.<!-- END_PAGE_1 -->

As an application of our techniques we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced adaptively by a learning algorithm operating on a training set.

References

  1. Aschwanden, C. Science isn't broken.Google ScholarGoogle Scholar
  2. Bassily, R., Nissim, K., Smith, A.D., Steinke, T., Stemmer, U., Ullman, J. Algorithmic stability for adaptive data analysis. In STOC, Cambridge, MA, USA (2016), 1046--1059. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Benjamini, Y., Hochberg, Y. Controlling the false discovery rate - A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57 (1995), 289--4300.Google ScholarGoogle ScholarCross RefCross Ref
  4. Blum, A., Dwork, C., McSherry, F., Nissim, K. Practical privacy: The SuLQ framework. In PODS, Baltimore, Maryland, USA (2005), 128--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Blum, A., Hardt, M. The ladder: A reliable leaderboard for machine learning competitions. In ICML, Lille, France (2015), 1006--1014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bousquet, O., Elisseeff, A. Stability and generalization. JMLR 2 (2002), 499--526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cawley, G.C., Talbot. N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11 (2010), 2079--2107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dwork, C. A firm foundation for private data analysis. CACM 54, 1 (2011), 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014. Extended abstract in STOC 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A. Generalization in adaptive data analysis and holdout reuse. CoRR, abs/1506, 2015. Extended abstract in NIPS 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dwork, C., McSherry, F., Nissim, K., Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, New York, NY, USA (2006), 265--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dwork, C., Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 34 (2014), 211--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Freedman, D.A. A note on screening regression equations. Am. Statist. 37, 2 (1983), 152--155.Google ScholarGoogle Scholar
  14. Gelman, A., Loken, E. The statistical crisis in science. Am. Statist. 102, 6 (2014), 460.Google ScholarGoogle Scholar
  15. Hardt, M., Rothblum, G. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, Las Vegas, Nevada, USA (2010), 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hardt, M., Ullman, J. Preventing false discovery in interactive data analysis is hard. In FOCS, Philadelphia, PA, USA (2014), 454--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer-Verlag, New York (2009).Google ScholarGoogle Scholar
  18. Ioannidis, J.P.A. Why most published research findings are false. PLoS Med. 2, 8 (Aug. 2005), 124.Google ScholarGoogle ScholarCross RefCross Ref
  19. Kearns, M. Efficient noise-tolerant learning from statistical queries. J. ACM 45, 6 (1998), 983--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math. 25, 1-3 (2006), 161--193.Google ScholarGoogle ScholarCross RefCross Ref
  21. Nissim, K., Stemmer, U. On the generalization properties of differential privacy. CoRR (2015), abs/1504.05800.Google ScholarGoogle Scholar
  22. Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach Learn Res. 3 (2003), 1371--1382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Shalev-Shwartz, S., Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 32 Avenue of the Americas, New York, NY 10013-2473, USA (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K. Learnability, stability and uniform convergence. J. Mach Learn Res. 11 (2010), 2635--2670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steinke, T., Ullman, J. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, Paris, France (2015), 1588--1628.Google ScholarGoogle Scholar

Index Terms

  1. Guilt-free data reuse

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Communications of the ACM
      Communications of the ACM  Volume 60, Issue 4
      April 2017
      86 pages
      ISSN:0001-0782
      EISSN:1557-7317
      DOI:10.1145/3069398
      • Editor:
      • Moshe Y. Vardi
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 March 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format