research-article

Public Access

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Authors:
Yeounoh Chung

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Michael Lind Mortensen

Aarhus University, Aarhus, Denmark

Aarhus University, Aarhus, Denmark
View Profile

,
Carsten Binnig

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Tim Kraska

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 861–876https://doi.org/10.1145/2882903.2882909

Published:14 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 861–876

ABSTRACT

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?

In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

References

P. D. Allison. Handling missing data by maximum likelihood. In SAS global forum, pages 1--21, 2012.Google Scholar
S. Amer-Yahia, A. Doan, J. Kleinberg, N. Koudas, and M. Franklin. Crowds, clouds, and algorithms: Exploring the human side of "big data" applications. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, 2010. Google ScholarDigital Library
J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association, 88(421), 1993.Google Scholar
K. P. Burnham and W. S. Overton. Estimation of the Size of a Closed Population when Capture Probabilities vary Among Animals. Biometrika, 65(3), 1978.Google Scholar
A. Chao. Nonparametric Estimation of the Number of Classes in a Population. SJS, 11(4), 1984.Google Scholar
A. Chao. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd Edition, pages 7907--7916. Wiley, New York, 2005.Google Scholar
A. Chao and S. Lee. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association, 87(417):210--217, 1992.Google ScholarCross Ref
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 268--279. ACM, 2000. Google ScholarDigital Library
R. B. D'Agostino Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association, 95(451):749--759, 2000.Google ScholarCross Ref
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.Google Scholar
A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011. Google ScholarDigital Library
D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB '97, 1997. Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 2011. Google ScholarDigital Library
I. J. Good. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 1953.Google Scholar
Google. Freebase. https://www.freebase.com, 2015. Accessed: 2015-07-08.Google Scholar
D. Haas, M. Greenstein, K. Kamalov, A. Marcus, M. Olszewski, and M. Piette. Reducing error in context-sensitive crowdsourced tasks. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.Google Scholar
P. J. Haas. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM, 1996.Google Scholar
P. J. Haas et al. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of VLDB, 1995. Google ScholarDigital Library
A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.Google Scholar
L. Kish. Survey sampling. John Wiley and Sons, 1965.Google Scholar
W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1275--1286. ACM, 2014. Google ScholarDigital Library
U. Leser and F. Naumann. Query planning with information quality bounds. In H. Larsen, T. Andreasen, H. Christiansen, J. Kacprzyk, and S. Zadrozny, editors, Flexible Query Answering Systems, volume 7 of Advances in Soft Computing, pages 85--94. Physica-Verlag HD, 2001.Google ScholarCross Ref
M. Lexa. Useful facts about the kullback-leibler discrimination distance. Houston, Texas, 2004.Google Scholar
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108. Google ScholarDigital Library
J. Liang. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf, 2008.Google Scholar
J. Lu and D. Li. Estimating deep web data source size by capture--recapture method. Inf. Retr., 13(1):70--95, Feb. 2010. Google ScholarDigital Library
R. Lynch and B. Kim. Sample size, the margin of error and the coefficient of variation. InterStat, 2010.Google Scholar
M. Magnani and D. Montesi. A survey on uncertainty management in data integration. J. Data and Information Quality, 2(1):5:1--5:33, July 2010. Google ScholarDigital Library
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 1315--1318, 2011. Google ScholarDigital Library
A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings, pages 211--214, 2011.Google Scholar
D. A. McAllester and R. E. Schapire. On the convergence rate of good-turing estimators. In COLT, pages 1--6. Citeseer, 2000. Google ScholarDigital Library
J. McClave and T. Sincich. Statistics. Pearson, 2013.Google Scholar
W. Meng, K.-L. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 146--153, Mar 1999. Google ScholarDigital Library
F. Naumann, J.-C. Freytag, and U. Leser. Completeness of integrated information sources. Inf. Syst., 29(7):583--615, Sept. 2004. Google ScholarDigital Library
M. T. Neiling and H.-J. Lenz. Data integration by means of object identification in information systems. In In Proceedings of European Conference on Information Systems, 2000.Google Scholar
F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, volume 86, pages 25--28, 1986. Google ScholarDigital Library
J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2012.Google Scholar
A. Parameswaran and N. Polyzotis. Answering Queries using Humans, Algorithms and Databases. In Proc. of CIDR, 2011.Google Scholar
Pew Research Center. How u.s. tech-sector jobs have grown, changed in 15 years. http://pewrsr.ch/PtqZDA, 2014. Accessed: 2015-07-08.Google Scholar
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. Identifying the extent of completeness of query answers over partially complete databases.Google Scholar
J. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.Google Scholar
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, 1976.Google ScholarCross Ref
B. Saha and D. Srivastava. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 1294--1297, 2014.Google ScholarCross Ref
R. Sapsford. Survey Research. SAGE Publications, 1999.Google Scholar
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarDigital Library
K. I. Ugland, J. S. Gray, and K. E. Ellingsen. The species--accumulation curve and estimation of species richness. Journal of Animal Ecology, 72(5):888--897, 2003.Google ScholarCross Ref
G. Valiant and P. Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685--694. ACM, 2011. Google ScholarDigital Library
Wikipedia. 68-95-99.7 rule. https://en.wikipedia.org/wiki/68--95--99.7_rule, 2015. Accessed: 2015-07-08.Google Scholar
Wikipedia. List of u.s. states by gdp. https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP, 2015. Accessed: 2015-07-08.Google Scholar
T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys '10, pages 77--90, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
Y. C. Yuan. Multiple imputation for missing data: Concepts and new development (version 9.0). SAS Institute Inc, Rockville, MD, 2010.Google Scholar

Index Terms

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results
Best of SIGMOD 2016 Papers and Regular Papers

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete?...
Read More
An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example
Advances in Computer Graphics
Abstract
“Unknown unknowns” are instances predicted models assign incorrect labels with high confidence, greatly reducing the generalization ability of models. In practical applications, unknown unknowns may lead to significant decision-making mistakes and ...
Read More
What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition
WWW '22: Proceedings of the ACM Web Conference 2022

Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
aggregate query
data integration
open-world assumption
species estimation
unknown unknowns
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 566
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example

What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example

What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media