skip to main content
research-article

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Published:06 March 2018Publication History
Skip Abstract Section

Abstract

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) Is the integrated data set complete? and (2) What is the impact of any unknown (i.e., unobserved) data on query results?

In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution; we also propose a parametric model that can be used instead when the data sources are imbalanced. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

Skip Supplemental Material Section

Supplemental Material

References

  1. Paul D. Allison. 2012. Handling missing data by maximum likelihood. In Proceedings of the SAS Global Forum. 1--21.Google ScholarGoogle Scholar
  2. Sihem Amer-Yahia, AnHai Doan, Jon Kleinberg, Nick Koudas, and Michael Franklin. 2010. Crowds, clouds, and algorithms: Exploring the human side of “big data” applications. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’10). 1259--1260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. John Bunge and M. Fitzpatrick. 1993. Estimating the number of species: A review. J. Amer. Stat. Assoc. 88, 421 (1993), 364--373.Google ScholarGoogle ScholarCross RefCross Ref
  4. K. P. Burnham and W. S. Overton. 1978. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65, 3 (1978), 625--633.Google ScholarGoogle ScholarCross RefCross Ref
  5. Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. (1984), 265--270.Google ScholarGoogle Scholar
  6. Anne Chao. 1984. Nonparametric estimation of the number of classes in a population. Scand. J. Stat. 11, 4 (1984).Google ScholarGoogle Scholar
  7. Anne Chao. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics (1987), 783--791.Google ScholarGoogle Scholar
  8. Anne Chao. 2005. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd ed.. Wiley, New York, 7907--7916.Google ScholarGoogle Scholar
  9. Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. J. Amer. Stat. Assoc. 87, 417 (1992), 210--217.Google ScholarGoogle ScholarCross RefCross Ref
  10. Anne Chao and Tsung-Jen Shen. 2003. Nonparametric estimation of Shannons index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 10, 4 (2003), 429--443.Google ScholarGoogle ScholarCross RefCross Ref
  11. Anne Chao and Mark C. K. Yang. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80, 1 (1993), 193--201.Google ScholarGoogle ScholarCross RefCross Ref
  12. Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS’00). 268--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yeounoh Chung, Michael L. Mortensen, Carsten Binnig, and Tim Kraska. 2016. Estimating the impact of unknown unknowns on aggregate query results. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’16). 861--876. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ralph B. D’Agostino Jr. and Donald B. Rubin. 2000. Estimating and using propensity scores with partially missing data. J. Amer. Stat. Assoc. 95, 451 (2000), 749--759.Google ScholarGoogle ScholarCross RefCross Ref
  15. Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodological) (1977), 1--38.Google ScholarGoogle Scholar
  16. Anhai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. 2011. Crowdsourcing systems on the world-wide web. Commun. ACM 54, 4 (April 2011), 86--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Daniela Florescu, Daphne Koller, and Alon Y. Levy. 1997. Using probabilistic information in data integration. In Proceedings of the Conference on Very Large Data Bases (VLDB’97). 216--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. 2011. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 61--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Michael Fu et al. 2015. Handbook of Simulation Optimization. Vol. 216. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, and Jonathan Goldberg-Kidon. 2010. Google fusion tables: Web-centered data management and collaboration. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’10). 1061--1066. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Irving J. Good. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3/4 (1953), 237--264.Google ScholarGoogle ScholarCross RefCross Ref
  22. Google. 2015. Freebase. Retrieved from https://www.freebase.com.Google ScholarGoogle Scholar
  23. Nicholas J. Gotelli and Robert K. Colwell. 2011. Estimating species richness. Biol. Divers.: Front. Measure. Assess. 12 (2011), 39--54.Google ScholarGoogle Scholar
  24. Daniel Haas, Matthew Greenstein, Kainar Kamalov, Adam Marcus, Marek Olszewski, and Marc Piette. 2013. Reducing error in context-sensitive crowdsourced tasks. In Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing.Google ScholarGoogle Scholar
  25. Peter J. Haas. 1996. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM.Google ScholarGoogle Scholar
  26. Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the Conference on Very Large Databases (VLDB’95). 311--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Keith Hastings. 1970. Monte carlo sampling methods using Markov chains and their applications. Biometrika 57, 1 (1970), 97--109.Google ScholarGoogle ScholarCross RefCross Ref
  28. Daniel G. Horvitz and Donovan J. Thompson. 1952. A generalization of sampling without replacement from a finite universe. J. Amer. Stat. Assoc. 47, 260 (1952), 663--685.Google ScholarGoogle ScholarCross RefCross Ref
  29. Leslie Kish. 1965. Survey Sampling. John Wiley and Sons.Google ScholarGoogle Scholar
  30. Willis Lang, Rimma V. Nehme, Eric Robinson, and Jeffrey F. Naughton. 2014. Partial results in database systems. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’14). 1275--1286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ulf Leser and Felix Naumann. 2001. Query planning with information quality bounds. Flex. Query Answer. Syst. (2001), 85--94.Google ScholarGoogle Scholar
  32. Michael Lexa. 2004. Useful Facts about the Kullback-Leibler discrimination distance. Retrieved from https://scholarship.rice.edu/bitstream/handle/1911/20061/Lex2004Dec8UsefulFact.PDF?sequence=1&isAllowed===y.Google ScholarGoogle Scholar
  33. Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth finding on the deep web: Is the problem solved? In Proceedings of the Conference on Very Large Data bases (VLDB’12). 97--108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jie Liang. 2008. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. Retrieved from http://cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf.Google ScholarGoogle Scholar
  35. Jianguo Lu and Dingding Li. 2010. Estimating deep web data source size by capture-Recapture method. Inf. Retr. 13, 1 (Feb. 2010), 70--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Robert Lynch and Brian Kim. 2010. Sample size, the margin of error and the coefficient of variation. InterStat (2010). Retrieved from http://interstat.statjournals.net/YEAR/2010/articles/1001004.pdf.Google ScholarGoogle Scholar
  37. Matteo Magnani and Danilo Montesi. 2010. A survey on uncertainty management in data integration. J. Data Info. Qual. 2, 1 (2010), 1--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Adam Marcus, Eugene Wu, David R. Karger, Samuel Madden, and Robert C. Miller. 2011. Demonstration of qurk: A query processor for humanoperators. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’11). 1315--1318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Adam Marcus, Eugene Wu, Samuel Madden, and Robert C. Miller. 2011. Crowdsourced databases: Query processing with people. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11). 211--214.Google ScholarGoogle Scholar
  40. David A. McAllester and Robert E. Schapire. 2000. On the convergence rate of good-turing estimators. In Proceedings of the Conference on Learning Theory (COLT’00). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. James T. McClave, P. George Benson, and Terry Sincich. 2014. Statistics for Business and Economics. Pearson Essex.Google ScholarGoogle Scholar
  42. Weiyi Meng, King-Lup Liu, Clement Yu, Wensheng Wu, and N. Naphtali Rishe. 1999. Estimating the usefulness of search engines. In Proceedings of International Conference on Data Engineering (ICDE’99). 146--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Felix Naumann, Johann-Christoph Freytag, and Ulf Leser. 2004. Completeness of integrated information sources. Inf. Syst. 29, 7 (Sept. 2004), 583--615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mat Tis Neiling and Hans-Joachim Lenz. 2000. Data integration by means of object identification in information systems. In Proceedings of European Conference on Information Systems. 69.Google ScholarGoogle Scholar
  45. Frank Olken and Doron Rotem. 1986. Simple random sampling from relational databases. In Proceedings of the Conference on Very Large data Bases (VLDB’86), Vol. 86. 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jason W. Osborne. 2012. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to do Before and After Collecting Your Data. Sage.Google ScholarGoogle Scholar
  47. Aditya Parameswaran and Neoklis Polyzotis. 2011. Answering queries using humans, algorithms and databases. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11).Google ScholarGoogle Scholar
  48. Pew Research Center. 2014. How U.S. tech-sector jobs have grown, changed in 15 years. Retrieved from http://pewrsr.ch/PtqZDA.Google ScholarGoogle Scholar
  49. Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3--13.Google ScholarGoogle Scholar
  50. Harshana Rajakaruna, D. Andrew R. Drake, Farrah T. Chan, and Sarah A. Bailey. 2016. Optimizing performance of nonparametric species richness estimators under constrained sampling. Ecol. Evol. 6, 20 (2016), 7311--7322.Google ScholarGoogle ScholarCross RefCross Ref
  51. Simon Razniewski, Flip Korn, Werner Nutt, and Divesh Srivastava. 2015. Identifying the extent of completeness of query answers over partially complete databases. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’15). 561--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. John Rice. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.Google ScholarGoogle Scholar
  53. Donald B. Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581--592.Google ScholarGoogle ScholarCross RefCross Ref
  54. Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In Proceedings of the International Conference on Data Engineering (ICDE’14). 1294--1297.Google ScholarGoogle ScholarCross RefCross Ref
  55. Roger Sapsford. 2006. Survey Research. Sage.Google ScholarGoogle Scholar
  56. Beth Trushkowsky, Tim Kraska, Michael J. Franklin, and Purnamrita Sarkar. 2013. Crowdsourced enumeration queries. In Prodeedings of the International Conference on Data Engineering (ICDE’13). 673--684. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Karl I. Ugland, John S. Gray, and Kari E. Ellingsen. 2003. The species--accumulation curve and estimation of species richness. J. Anim. Ecol. 72, 5 (2003), 888--897.Google ScholarGoogle ScholarCross RefCross Ref
  58. Gregory Valiant and Paul Valiant. 2011. Estimating the unseen: An n/log (n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. ACM, 685--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Wikipedia. 2015. List of U.S. states by GDP. Retrieved from https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP.Google ScholarGoogle Scholar
  60. Tingxin Yan, Vikas Kumar, and Deepak Ganesan. 2010. CrowdSearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of MobiSys. ACM, New York, NY, 77--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yang C. Yuan. 2010. Multiple imputation for missing data: Concepts and new development (Version 9.0). SAS Institute Inc, Rockville, MD (2010).Google ScholarGoogle Scholar
  62. Samuel Zahl. 1977. Jackknifing an index of diversity. Ecology 58, 4 (1977), 907--913.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Estimating the Impact of Unknown Unknowns on Aggregate Query Results

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 43, Issue 1
          Best of SIGMOD 2016 Papers and Regular Papers
          March 2018
          227 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/3194314
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 March 2018
          • Revised: 1 November 2017
          • Accepted: 1 November 2017
          • Received: 1 December 2016
          Published in tods Volume 43, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader