Abstract
It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? and (2) What is the impact of any unknown (i.e., unobserved) data on query results?
In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution; we also propose a parametric model that can be used instead when the data sources are imbalanced. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.
Supplemental Material
Available for Download
Supplemental movie and image files for, Estimating the Impact of Unknown Unknowns on Aggregate Query Results
- Paul D. Allison. 2012. Handling missing data by maximum likelihood. In Proceedings of the SAS Global Forum. 1--21.Google Scholar
- Sihem Amer-Yahia, AnHai Doan, Jon Kleinberg, Nick Koudas, and Michael Franklin. 2010. Crowds, clouds, and algorithms: Exploring the human side of “big data” applications. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’10). 1259--1260. Google ScholarDigital Library
- John Bunge and M. Fitzpatrick. 1993. Estimating the number of species: A review. J. Amer. Stat. Assoc. 88, 421 (1993), 364--373.Google ScholarCross Ref
- K. P. Burnham and W. S. Overton. 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65, 3 (1978), 625--633.Google ScholarCross Ref
- Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. (1984), 265--270.Google Scholar
- Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 11, 4 (1984).Google Scholar
- Anne Chao. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics (1987), 783--791.Google Scholar
- Anne Chao. 2005. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd ed.. Wiley, New York, 7907--7916.Google Scholar
- Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. J. Amer. Stat. Assoc. 87, 417 (1992), 210--217.Google ScholarCross Ref
- Anne Chao and Tsung-Jen Shen. 2003. Nonparametric estimation of Shannons index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10, 4 (2003), 429--443.Google ScholarCross Ref
- Anne Chao and Mark C. K. Yang. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80, 1 (1993), 193--201.Google ScholarCross Ref
- Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’00). 268--279. Google ScholarDigital Library
- Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, and Tim Kraska. 2016. Estimating the impact of unknown unknowns on aggregate query results. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’16). 861--876. Google ScholarDigital Library
- Ralph B. D’Agostino Jr. and Donald B. Rubin. 2000. Estimating and using propensity scores with partially missing data. J. Amer. Stat. Assoc. 95, 451 (2000), 749--759.Google ScholarCross Ref
- Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodological) (1977), 1--38.Google Scholar
- Anhai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. 2011. Crowdsourcing systems on the world-wide web. Commun. ACM 54, 4 (April 2011), 86--96. Google ScholarDigital Library
- Daniela Florescu, Daphne Koller, and Alon Y. Levy. 1997. Using probabilistic information in data integration. In Proceedings of the Conference on Very Large Data Bases (VLDB’97). 216--225. Google ScholarDigital Library
- Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 61--72. Google ScholarDigital Library
- Michael Fu et al. 2015. Handbook of Simulation Optimization. Vol. 216. Springer. Google ScholarDigital Library
- Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, and Jonathan Goldberg-Kidon. 2010. Google fusion tables: Web-centered data management and collaboration. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’10). 1061--1066. Google ScholarDigital Library
- Irving J. Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3/4 (1953), 237--264.Google ScholarCross Ref
- Google. 2015. Freebase. Retrieved from https://www.freebase.com.Google Scholar
- Nicholas J. Gotelli and Robert K. Colwell. 2011. Estimating species richness. Biol. Divers.: Front. Measure. Assess. 12 (2011), 39--54.Google Scholar
- Daniel Haas, Matthew Greenstein, Kainar Kamalov, Adam Marcus, Marek Olszewski, and Marc Piette. 2013. Reducing error in context-sensitive crowdsourced tasks. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.Google Scholar
- Peter J. Haas. 1996. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM.Google Scholar
- Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the Conference on Very Large Databases (VLDB’95). 311--322. Google ScholarDigital Library
- W. Keith Hastings. 1970. Monte carlo sampling methods using Markov chains and their applications. Biometrika 57, 1 (1970), 97--109.Google ScholarCross Ref
- Daniel G. Horvitz and Donovan J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. J. Amer. Stat. Assoc. 47, 260 (1952), 663--685.Google ScholarCross Ref
- Leslie Kish. 1965. Survey Sampling. John Wiley and Sons.Google Scholar
- Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’14). 1275--1286. Google ScholarDigital Library
- Ulf Leser and Felix Naumann. 2001. Query planning with information quality bounds. Flex. Query Answer. Syst. (2001), 85--94.Google Scholar
- Michael Lexa. 2004. Useful Facts about the Kullback-Leibler discrimination distance. Retrieved from https://scholarship.rice.edu/bitstream/handle/1911/20061/Lex2004Dec8UsefulFact.PDF?sequence=1&isAllowed===y.Google Scholar
- Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth finding on the deep web: Is the problem solved? In Proceedings of the Conference on Very Large Data bases (VLDB’12). 97--108.Google ScholarDigital Library
- Jie Liang. 2008. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. Retrieved from http://cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf.Google Scholar
- Jianguo Lu and Dingding Li. 2010. Estimating deep web data source size by capture-Recapture method. Inf. Retr. 13, 1 (Feb. 2010), 70--95. Google ScholarDigital Library
- Robert Lynch and Brian Kim. 2010. Sample size, the margin of error and the coefficient of variation. InterStat (2010). Retrieved from http://interstat.statjournals.net/YEAR/2010/articles/1001004.pdf.Google Scholar
- Matteo Magnani and Danilo Montesi. 2010. A survey on uncertainty management in data integration. J. Data Info. Qual. 2, 1 (2010), 1--33. Google ScholarDigital Library
- Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011. Demonstration of qurk: A query processor for humanoperators. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 1315--1318. Google ScholarDigital Library
- Adam Marcus, Eugene Wu, Samuel Madden, and Robert C. Miller. 2011. Crowdsourced databases: Query processing with people. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11). 211--214.Google Scholar
- David A. McAllester and Robert E. Schapire. 2000. On the convergence rate of good-turing estimators. In Proceedings of the Conference on Learning Theory (COLT’00). 1--6. Google ScholarDigital Library
- James T. McClave, P. George Benson, and Terry Sincich. 2014. Statistics for Business and Economics. Pearson Essex.Google Scholar
- Weiyi Meng, King-Lup Liu, Clement Yu, Wensheng Wu, and N. Naphtali Rishe. 1999. Estimating the usefulness of search engines. In Proceedings of International Conference on Data Engineering (ICDE’99). 146--153. Google ScholarDigital Library
- Felix Naumann, Johann-Christoph Freytag, and Ulf Leser. 2004. Completeness of integrated information sources. Inf. Syst. 29, 7 (Sept. 2004), 583--615. Google ScholarDigital Library
- Mat Tis Neiling and Hans-Joachim Lenz. 2000. Data integration by means of object identification in information systems. In Proceedings of European Conference on Information Systems. 69.Google Scholar
- Frank Olken and Doron Rotem. 1986. Simple random sampling from relational databases. In Proceedings of the Conference on Very Large data Bases (VLDB’86), Vol. 86. 25--28. Google ScholarDigital Library
- Jason W. Osborne. 2012. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to do Before and After Collecting Your Data. Sage.Google Scholar
- Aditya Parameswaran and Neoklis Polyzotis. 2011. Answering queries using humans, algorithms and databases. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11).Google Scholar
- Pew Research Center. 2014. How U.S. tech-sector jobs have grown, changed in 15 years. Retrieved from http://pewrsr.ch/PtqZDA.Google Scholar
- Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.Google Scholar
- Harshana Rajakaruna, D. Andrew R. Drake, Farrah T. Chan, and Sarah A. Bailey. 2016. Optimizing performance of nonparametric species richness estimators under constrained sampling. Ecol. Evol. 6, 20 (2016), 7311--7322.Google ScholarCross Ref
- Simon Razniewski, Flip Korn, Werner Nutt, and Divesh Srivastava. 2015. Identifying the extent of completeness of query answers over partially complete databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). 561--576. Google ScholarDigital Library
- John Rice. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.Google Scholar
- Donald B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581--592.Google ScholarCross Ref
- Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In Proceedings of the International Conference on Data Engineering (ICDE’14). 1294--1297.Google ScholarCross Ref
- Roger Sapsford. 2006. Survey Research. Sage.Google Scholar
- Beth Trushkowsky, Tim Kraska, Michael J. Franklin, and Purnamrita Sarkar. 2013. Crowdsourced enumeration queries. In Prodeedings of the International Conference on Data Engineering (ICDE’13). 673--684. Google ScholarDigital Library
- Karl I. Ugland, John S. Gray, and Kari E. Ellingsen. 2003. The species--accumulation curve and estimation of species richness. J. Anim. Ecol. 72, 5 (2003), 888--897.Google ScholarCross Ref
- Gregory Valiant and Paul Valiant. 2011. Estimating the unseen: An n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. ACM, 685--694. Google ScholarDigital Library
- Wikipedia. 2015. List of U.S. states by GDP. Retrieved from https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP.Google Scholar
- Tingxin Yan, Vikas Kumar, and Deepak Ganesan. 2010. CrowdSearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of MobiSys. ACM, New York, NY, 77--90. Google ScholarDigital Library
- Yang C. Yuan. 2010. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Institute Inc, Rockville, MD (2010).Google Scholar
- Samuel Zahl. 1977. Jackknifing an index of diversity. Ecology 58, 4 (1977), 907--913.Google ScholarCross Ref
Index Terms
- Estimating the Impact of Unknown Unknowns on Aggregate Query Results
Recommendations
Estimating the Impact of Unknown Unknowns on Aggregate Query Results
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataIt is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete ...
An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example
Advances in Computer GraphicsAbstract“Unknown unknowns” are instances predicted models assign incorrect labels with high confidence, greatly reducing the generalization ability of models. In practical applications, unknown unknowns may lead to significant decision-making mistakes and ...
What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition
WWW '22: Proceedings of the ACM Web Conference 2022Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a ...
Comments