ABSTRACT
It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?
In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.
- P. D. Allison. Handling missing data by maximum likelihood. In SAS global forum, pages 1--21, 2012.Google Scholar
- S. Amer-Yahia, A. Doan, J. Kleinberg, N. Koudas, and M. Franklin. Crowds, clouds, and algorithms: Exploring the human side of "big data" applications. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, 2010. Google ScholarDigital Library
- J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association, 88(421), 1993.Google Scholar
- K. P. Burnham and W. S. Overton. Estimation of the Size of a Closed Population when Capture Probabilities vary Among Animals. Biometrika, 65(3), 1978.Google Scholar
- A. Chao. Nonparametric Estimation of the Number of Classes in a Population. SJS, 11(4), 1984.Google Scholar
- A. Chao. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd Edition, pages 7907--7916. Wiley, New York, 2005.Google Scholar
- A. Chao and S. Lee. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association, 87(417):210--217, 1992.Google ScholarCross Ref
- M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 268--279. ACM, 2000. Google ScholarDigital Library
- R. B. D'Agostino Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association, 95(451):749--759, 2000.Google ScholarCross Ref
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.Google Scholar
- A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011. Google ScholarDigital Library
- D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB '97, 1997. Google ScholarDigital Library
- M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 2011. Google ScholarDigital Library
- I. J. Good. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 1953.Google Scholar
- Google. Freebase. https://www.freebase.com, 2015. Accessed: 2015-07-08.Google Scholar
- D. Haas, M. Greenstein, K. Kamalov, A. Marcus, M. Olszewski, and M. Piette. Reducing error in context-sensitive crowdsourced tasks. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.Google Scholar
- P. J. Haas. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM, 1996.Google Scholar
- P. J. Haas et al. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of VLDB, 1995. Google ScholarDigital Library
- A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.Google Scholar
- L. Kish. Survey sampling. John Wiley and Sons, 1965.Google Scholar
- W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1275--1286. ACM, 2014. Google ScholarDigital Library
- U. Leser and F. Naumann. Query planning with information quality bounds. In H. Larsen, T. Andreasen, H. Christiansen, J. Kacprzyk, and S. Zadrozny, editors, Flexible Query Answering Systems, volume 7 of Advances in Soft Computing, pages 85--94. Physica-Verlag HD, 2001.Google ScholarCross Ref
- M. Lexa. Useful facts about the kullback-leibler discrimination distance. Houston, Texas, 2004.Google Scholar
- X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108. Google ScholarDigital Library
- J. Liang. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf, 2008.Google Scholar
- J. Lu and D. Li. Estimating deep web data source size by capture--recapture method. Inf. Retr., 13(1):70--95, Feb. 2010. Google ScholarDigital Library
- R. Lynch and B. Kim. Sample size, the margin of error and the coefficient of variation. InterStat, 2010.Google Scholar
- M. Magnani and D. Montesi. A survey on uncertainty management in data integration. J. Data and Information Quality, 2(1):5:1--5:33, July 2010. Google ScholarDigital Library
- A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 1315--1318, 2011. Google ScholarDigital Library
- A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings, pages 211--214, 2011.Google Scholar
- D. A. McAllester and R. E. Schapire. On the convergence rate of good-turing estimators. In COLT, pages 1--6. Citeseer, 2000. Google ScholarDigital Library
- J. McClave and T. Sincich. Statistics. Pearson, 2013.Google Scholar
- W. Meng, K.-L. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 146--153, Mar 1999. Google ScholarDigital Library
- F. Naumann, J.-C. Freytag, and U. Leser. Completeness of integrated information sources. Inf. Syst., 29(7):583--615, Sept. 2004. Google ScholarDigital Library
- M. T. Neiling and H.-J. Lenz. Data integration by means of object identification in information systems. In In Proceedings of European Conference on Information Systems, 2000.Google Scholar
- F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, volume 86, pages 25--28, 1986. Google ScholarDigital Library
- J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2012.Google Scholar
- A. Parameswaran and N. Polyzotis. Answering Queries using Humans, Algorithms and Databases. In Proc. of CIDR, 2011.Google Scholar
- Pew Research Center. How u.s. tech-sector jobs have grown, changed in 15 years. http://pewrsr.ch/PtqZDA, 2014. Accessed: 2015-07-08.Google Scholar
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
- S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. Identifying the extent of completeness of query answers over partially complete databases.Google Scholar
- J. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.Google Scholar
- D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, 1976.Google ScholarCross Ref
- B. Saha and D. Srivastava. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 1294--1297, 2014.Google ScholarCross Ref
- R. Sapsford. Survey Research. SAGE Publications, 1999.Google Scholar
- B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarDigital Library
- K. I. Ugland, J. S. Gray, and K. E. Ellingsen. The species--accumulation curve and estimation of species richness. Journal of Animal Ecology, 72(5):888--897, 2003.Google ScholarCross Ref
- G. Valiant and P. Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685--694. ACM, 2011. Google ScholarDigital Library
- Wikipedia. 68-95-99.7 rule. https://en.wikipedia.org/wiki/68--95--99.7_rule, 2015. Accessed: 2015-07-08.Google Scholar
- Wikipedia. List of u.s. states by gdp. https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP, 2015. Accessed: 2015-07-08.Google Scholar
- T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys '10, pages 77--90, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Y. C. Yuan. Multiple imputation for missing data: Concepts and new development (version 9.0). SAS Institute Inc, Rockville, MD, 2010.Google Scholar
Index Terms
- Estimating the Impact of Unknown Unknowns on Aggregate Query Results
Recommendations
Estimating the Impact of Unknown Unknowns on Aggregate Query Results
Best of SIGMOD 2016 Papers and Regular PapersIt is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete?...
An Interpretability Case Study of Unknown Unknowns Taking Clothes Image Classification CNNs as an Example
Advances in Computer GraphicsAbstract“Unknown unknowns” are instances predicted models assign incorrect labels with high confidence, greatly reducing the generalization ability of models. In practical applications, unknown unknowns may lead to significant decision-making mistakes and ...
What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition
WWW '22: Proceedings of the ACM Web Conference 2022Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a ...
Comments