research-article

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Authors:
Yeounoh Chung

Brown University, Providence, RI

Brown University, Providence, RI
View Profile

,
Michael Lind Mortensen

Aarhus University, Aarhus C, Denmark

Aarhus University, Aarhus C, Denmark
View Profile

,
Carsten Binnig

Brown University, Providence, RI

Brown University, Providence, RI
View Profile

,
Tim Kraska

Brown University, Providence, RI

Brown University, Providence, RI
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 43 Issue 1Article No.: 3pp 1–37https://doi.org/10.1145/3167970

Published:06 March 2018Publication History

ACM Transactions on Database Systems

Abstract

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? and (2) What is the impact of any unknown (i.e., unobserved) data on query results?

In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution; we also propose a parametric model that can be used instead when the data sources are imbalanced. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

Supplemental Material

Available for Download

zip

chung.zip (748 B)

Supplemental movie and image files for, Estimating the Impact of Unknown Unknowns on Aggregate Query Results

References

Paul D. Allison. 2012. Handling missing data by maximum likelihood. In Proceedings of the SAS Global Forum. 1--21.Google Scholar
Sihem Amer-Yahia, AnHai Doan, Jon Kleinberg, Nick Koudas, and Michael Franklin. 2010. Crowds, clouds, and algorithms: Exploring the human side of “big data” applications. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’10). 1259--1260. Google ScholarDigital Library
John Bunge and M. Fitzpatrick. 1993. Estimating the number of species: A review. J. Amer. Stat. Assoc. 88, 421 (1993), 364--373.Google ScholarCross Ref
K. P. Burnham and W. S. Overton. 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65, 3 (1978), 625--633.Google ScholarCross Ref
Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. (1984), 265--270.Google Scholar
Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 11, 4 (1984).Google Scholar
Anne Chao. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics (1987), 783--791.Google Scholar
Anne Chao. 2005. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd ed.. Wiley, New York, 7907--7916.Google Scholar
Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. J. Amer. Stat. Assoc. 87, 417 (1992), 210--217.Google ScholarCross Ref
Anne Chao and Tsung-Jen Shen. 2003. Nonparametric estimation of Shannons index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10, 4 (2003), 429--443.Google ScholarCross Ref
Anne Chao and Mark C. K. Yang. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80, 1 (1993), 193--201.Google ScholarCross Ref
Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’00). 268--279. Google ScholarDigital Library
Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, and Tim Kraska. 2016. Estimating the impact of unknown unknowns on aggregate query results. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’16). 861--876. Google ScholarDigital Library
Ralph B. D’Agostino Jr. and Donald B. Rubin. 2000. Estimating and using propensity scores with partially missing data. J. Amer. Stat. Assoc. 95, 451 (2000), 749--759.Google ScholarCross Ref
Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodological) (1977), 1--38.Google Scholar
Anhai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. 2011. Crowdsourcing systems on the world-wide web. Commun. ACM 54, 4 (April 2011), 86--96. Google ScholarDigital Library
Daniela Florescu, Daphne Koller, and Alon Y. Levy. 1997. Using probabilistic information in data integration. In Proceedings of the Conference on Very Large Data Bases (VLDB’97). 216--225. Google ScholarDigital Library
Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 61--72. Google ScholarDigital Library
Michael Fu et al. 2015. Handbook of Simulation Optimization. Vol. 216. Springer. Google ScholarDigital Library
Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, and Jonathan Goldberg-Kidon. 2010. Google fusion tables: Web-centered data management and collaboration. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’10). 1061--1066. Google ScholarDigital Library
Irving J. Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3/4 (1953), 237--264.Google ScholarCross Ref
Google. 2015. Freebase. Retrieved from https://www.freebase.com.Google Scholar
Nicholas J. Gotelli and Robert K. Colwell. 2011. Estimating species richness. Biol. Divers.: Front. Measure. Assess. 12 (2011), 39--54.Google Scholar
Daniel Haas, Matthew Greenstein, Kainar Kamalov, Adam Marcus, Marek Olszewski, and Marc Piette. 2013. Reducing error in context-sensitive crowdsourced tasks. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
Peter J. Haas. 1996. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM.Google Scholar
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the Conference on Very Large Databases (VLDB’95). 311--322. Google ScholarDigital Library
W. Keith Hastings. 1970. Monte carlo sampling methods using Markov chains and their applications. Biometrika 57, 1 (1970), 97--109.Google ScholarCross Ref
Daniel G. Horvitz and Donovan J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. J. Amer. Stat. Assoc. 47, 260 (1952), 663--685.Google ScholarCross Ref
Leslie Kish. 1965. Survey Sampling. John Wiley and Sons.Google Scholar
Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’14). 1275--1286. Google ScholarDigital Library
Ulf Leser and Felix Naumann. 2001. Query planning with information quality bounds. Flex. Query Answer. Syst. (2001), 85--94.Google Scholar
Michael Lexa. 2004. Useful Facts about the Kullback-Leibler discrimination distance. Retrieved from https://scholarship.rice.edu/bitstream/handle/1911/20061/Lex2004Dec8UsefulFact.PDF?sequence=1&isAllowed===y.Google Scholar
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth finding on the deep web: Is the problem solved? In Proceedings of the Conference on Very Large Data bases (VLDB’12). 97--108.Google ScholarDigital Library
Jie Liang. 2008. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. Retrieved from http://cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf.Google Scholar
Jianguo Lu and Dingding Li. 2010. Estimating deep web data source size by capture-Recapture method. Inf. Retr. 13, 1 (Feb. 2010), 70--95. Google ScholarDigital Library
Robert Lynch and Brian Kim. 2010. Sample size, the margin of error and the coefficient of variation. InterStat (2010). Retrieved from http://interstat.statjournals.net/YEAR/2010/articles/1001004.pdf.Google Scholar
Matteo Magnani and Danilo Montesi. 2010. A survey on uncertainty management in data integration. J. Data Info. Qual. 2, 1 (2010), 1--33. Google ScholarDigital Library
Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011. Demonstration of qurk: A query processor for humanoperators. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 1315--1318. Google ScholarDigital Library
Adam Marcus, Eugene Wu, Samuel Madden, and Robert C. Miller. 2011. Crowdsourced databases: Query processing with people. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11). 211--214.Google Scholar
David A. McAllester and Robert E. Schapire. 2000. On the convergence rate of good-turing estimators. In Proceedings of the Conference on Learning Theory (COLT’00). 1--6. Google ScholarDigital Library
James T. McClave, P. George Benson, and Terry Sincich. 2014. Statistics for Business and Economics. Pearson Essex.Google Scholar
Weiyi Meng, King-Lup Liu, Clement Yu, Wensheng Wu, and N. Naphtali Rishe. 1999. Estimating the usefulness of search engines. In Proceedings of International Conference on Data Engineering (ICDE’99). 146--153. Google ScholarDigital Library
Felix Naumann, Johann-Christoph Freytag, and Ulf Leser. 2004. Completeness of integrated information sources. Inf. Syst. 29, 7 (Sept. 2004), 583--615. Google ScholarDigital Library
Mat Tis Neiling and Hans-Joachim Lenz. 2000. Data integration by means of object identification in information systems. In Proceedings of European Conference on Information Systems. 69.Google Scholar
Frank Olken and Doron Rotem. 1986. Simple random sampling from relational databases. In Proceedings of the Conference on Very Large data Bases (VLDB’86), Vol. 86. 25--28. Google ScholarDigital Library
Jason W. Osborne. 2012. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to do Before and After Collecting Your Data. Sage.Google Scholar
Aditya Parameswaran and Neoklis Polyzotis. 2011. Answering queries using humans, algorithms and databases. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11).Google Scholar
Pew Research Center. 2014. How U.S. tech-sector jobs have grown, changed in 15 years. Retrieved from http://pewrsr.ch/PtqZDA.Google Scholar
Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.Google Scholar
Harshana Rajakaruna, D. Andrew R. Drake, Farrah T. Chan, and Sarah A. Bailey. 2016. Optimizing performance of nonparametric species richness estimators under constrained sampling. Ecol. Evol. 6, 20 (2016), 7311--7322.Google ScholarCross Ref
Simon Razniewski, Flip Korn, Werner Nutt, and Divesh Srivastava. 2015. Identifying the extent of completeness of query answers over partially complete databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). 561--576. Google ScholarDigital Library
John Rice. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.Google Scholar
Donald B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581--592.Google ScholarCross Ref
Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In Proceedings of the International Conference on Data Engineering (ICDE’14). 1294--1297.Google ScholarCross Ref
Roger Sapsford. 2006. Survey Research. Sage.Google Scholar
Beth Trushkowsky, Tim Kraska, Michael J. Franklin, and Purnamrita Sarkar. 2013. Crowdsourced enumeration queries. In Prodeedings of the International Conference on Data Engineering (ICDE’13). 673--684. Google ScholarDigital Library
Karl I. Ugland, John S. Gray, and Kari E. Ellingsen. 2003. The species--accumulation curve and estimation of species richness. J. Anim. Ecol. 72, 5 (2003), 888--897.Google ScholarCross Ref
Gregory Valiant and Paul Valiant. 2011. Estimating the unseen: An n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. ACM, 685--694. Google ScholarDigital Library
Wikipedia. 2015. List of U.S. states by GDP. Retrieved from https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP.Google Scholar
Tingxin Yan, Vikas Kumar, and Deepak Ganesan. 2010. CrowdSearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of MobiSys. ACM, New York, NY, 77--90. Google ScholarDigital Library
Yang C. Yuan. 2010. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Institute Inc, Rockville, MD (2010).Google Scholar
Samuel Zahl. 1977. Jackknifing an index of diversity. Ecology 58, 4 (1977), 907--913.Google ScholarCross Ref

Index Terms

Estimating the Impact of Unknown Unknowns on Aggregate Query Results
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. World Wide Web
    1. Web applications
      1. Crowdsourcing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Incomplete, inconsistent, and uncertain databases

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete ...
Read More
An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example
Advances in Computer Graphics
Abstract
“Unknown unknowns” are instances predicted models assign incorrect labels with high confidence, greatly reducing the generalization ability of models. In practical applications, unknown unknowns may lead to significant decision-making mistakes and ...
Read More
What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition
WWW '22: Proceedings of the ACM Web Conference 2022

Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 43, Issue 1
Best of SIGMOD 2016 Papers and Regular Papers
March 2018
227 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3194314
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2018
- Revised: 1 November 2017
- Accepted: 1 November 2017
- Received: 1 December 2016
Published in tods Volume 43, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Aggregate query processing
crowdsourcing
species estimation
unknown unknowns
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 428
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example

What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example

What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media