Guilt-free data reuse

Authors:
Cynthia Dwork

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

,
Vitaly Feldman

IBM Almaden Research Center, CA

IBM Almaden Research Center, CA
View Profile

,
Moritz Hardt

Google Research, Mountain View, CA

Google Research, Mountain View, CA
View Profile

,
Toniann Pitassi

University of Toronto, Toronto, Ontario, Canada

University of Toronto, Toronto, Ontario, Canada
View Profile

,
Omer Reingold

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Aaron Roth

University of Pennsylvania, Philadelphia, PA

University of Pennsylvania, Philadelphia, PA
View Profile

Authors Info & Claims

Communications of the ACM Volume 60 Issue 4April 2017pp 86–93https://doi.org/10.1145/3051088

Published:24 March 2017Publication History

Communications of the ACM

Abstract

Existing approaches to ensuring the validity of inferences drawn from data assume a fixed procedure to be performed, selected before the data are examined. Yet the practice of data analysis is an intrinsically interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused.

In this work, we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. We demonstrate new approaches for addressing the challenges of adaptivity that are based on techniques developed in privacy-preserving data analysis.

As an application of our techniques we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced adaptively by a learning algorithm operating on a training set.

References

Aschwanden, C. Science isn't broken.Google Scholar
Bassily, R., Nissim, K., Smith, A.D., Steinke, T., Stemmer, U., Ullman, J. Algorithmic stability for adaptive data analysis. In STOC, Cambridge, MA, USA (2016), 1046--1059. Google ScholarDigital Library
Benjamini, Y., Hochberg, Y. Controlling the false discovery rate - A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57 (1995), 289--4300.Google ScholarCross Ref
Blum, A., Dwork, C., McSherry, F., Nissim, K. Practical privacy: The SuLQ framework. In PODS, Baltimore, Maryland, USA (2005), 128--138. Google ScholarDigital Library
Blum, A., Hardt, M. The ladder: A reliable leaderboard for machine learning competitions. In ICML, Lille, France (2015), 1006--1014. Google ScholarDigital Library
Bousquet, O., Elisseeff, A. Stability and generalization. JMLR 2 (2002), 499--526. Google ScholarDigital Library
Cawley, G.C., Talbot. N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11 (2010), 2079--2107. Google ScholarDigital Library
Dwork, C. A firm foundation for private data analysis. CACM 54, 1 (2011), 86--95. Google ScholarDigital Library
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A. Preserving statistical validity in adaptive data analysis. CoRR, abs/1411.2664, 2014. Extended abstract in STOC 2015. Google ScholarDigital Library
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A. Generalization in adaptive data analysis and holdout reuse. CoRR, abs/1506, 2015. Extended abstract in NIPS 2015. Google ScholarDigital Library
Dwork, C., McSherry, F., Nissim, K., Smith, A. Calibrating noise to sensitivity in private data analysis. In TCC, New York, NY, USA (2006), 265--284. Google ScholarDigital Library
Dwork, C., Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 34 (2014), 211--407. Google ScholarDigital Library
Freedman, D.A. A note on screening regression equations. Am. Statist. 37, 2 (1983), 152--155.Google Scholar
Gelman, A., Loken, E. The statistical crisis in science. Am. Statist. 102, 6 (2014), 460.Google Scholar
Hardt, M., Rothblum, G. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, Las Vegas, Nevada, USA (2010), 61--70. Google ScholarDigital Library
Hardt, M., Ullman, J. Preventing false discovery in interactive data analysis is hard. In FOCS, Philadelphia, PA, USA (2014), 454--463. Google ScholarDigital Library
Hastie, T., Tibshirani, R., Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer-Verlag, New York (2009).Google Scholar
Ioannidis, J.P.A. Why most published research findings are false. PLoS Med. 2, 8 (Aug. 2005), 124.Google ScholarCross Ref
Kearns, M. Efficient noise-tolerant learning from statistical queries. J. ACM 45, 6 (1998), 983--1006. Google ScholarDigital Library
Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R. Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math. 25, 1-3 (2006), 161--193.Google ScholarCross Ref
Nissim, K., Stemmer, U. On the generalization properties of differential privacy. CoRR (2015), abs/1504.05800.Google Scholar
Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach Learn Res. 3 (2003), 1371--1382. Google ScholarDigital Library
Shalev-Shwartz, S., Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 32 Avenue of the Americas, New York, NY 10013-2473, USA (2014). Google ScholarDigital Library
Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K. Learnability, stability and uniform convergence. J. Mach Learn Res. 11 (2010), 2635--2670. Google ScholarDigital Library
Steinke, T., Ullman, J. Interactive fingerprinting codes and the hardness of preventing false discovery. In COLT, Paris, France (2015), 1588--1628.Google Scholar

Index Terms

Guilt-free data reuse
1. Theory of computation

Recommendations

Viewpoint-centred reuse: bridging the gap between reusability and the needs of the reuser

The "Reuse without Reusers" SyndromeThe problems of low productivity, late delivery and poor software quality are, unfortunately, characteristic of the software industry (Horowitz and Munson 1989). Increasingly, software developers are looking towards ...
Read More
Reuse repositories and reuse: the realities
OOPSLA '03: Companion of the 18th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications

This panel (part of the 2003 Onward! program) will discuss libraries, repositories, and reuse. While there is so much hype and noise about repositories and metadata - so little understood about how hard it is to design for reuse and to encourage ...
Read More
Unanticipated reuse of large-scale software features
ICSE '06: Proceedings of the 28th international conference on Software engineering

Software reuse has been endorsed as a way to reduce development times and costs while increasing software quality and reliability. Techniques designed to encourage software reuse have concentrated on creating reusable software in the form of frameworks, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 60, Issue 4
April 2017
86 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3069398
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 March 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 14,751
  Total Downloads
- Downloads (Last 12 months)4,804
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Guilt-free data reuse

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Viewpoint-centred reuse: bridging the gap between reusability and the needs of the reuser

Reuse repositories and reuse: the realities

Unanticipated reuse of large-scale software features