skip to main content
10.1145/2882903.2882909acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

Published:14 June 2016Publication History

ABSTRACT

It is common practice for data scientists to acquire and integrate disparate data sources to achieve higher quality results. But even with a perfectly cleaned and merged data set, two fundamental questions remain: (1) is the integrated data set complete and (2) what is the impact of any unknown (i.e., unobserved) data on query results?

In this work, we develop and analyze techniques to estimate the impact of the unknown data (a.k.a., unknown unknowns) on simple aggregate queries. The key idea is that the overlap between different data sources enables us to estimate the number and values of the missing data items. Our main techniques are parameter-free and do not assume prior knowledge about the distribution. Through a series of experiments, we show that estimating the impact of unknown unknowns is invaluable to better assess the results of aggregate queries over integrated data sources.

References

  1. P. D. Allison. Handling missing data by maximum likelihood. In SAS global forum, pages 1--21, 2012.Google ScholarGoogle Scholar
  2. S. Amer-Yahia, A. Doan, J. Kleinberg, N. Koudas, and M. Franklin. Crowds, clouds, and algorithms: Exploring the human side of "big data" applications. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD '10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Bunge and M. Fitzpatrick. Estimating the Number of Species: A Review. Journal of the American Statistical Association, 88(421), 1993.Google ScholarGoogle Scholar
  4. K. P. Burnham and W. S. Overton. Estimation of the Size of a Closed Population when Capture Probabilities vary Among Animals. Biometrika, 65(3), 1978.Google ScholarGoogle Scholar
  5. A. Chao. Nonparametric Estimation of the Number of Classes in a Population. SJS, 11(4), 1984.Google ScholarGoogle Scholar
  6. A. Chao. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd Edition, pages 7907--7916. Wiley, New York, 2005.Google ScholarGoogle Scholar
  7. A. Chao and S. Lee. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association, 87(417):210--217, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 268--279. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. B. D'Agostino Jr and D. B. Rubin. Estimating and using propensity scores with partially missing data. Journal of the American Statistical Association, 95(451):749--759, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1--38, 1977.Google ScholarGoogle Scholar
  11. A. Doan, R. Ramakrishnan, and A. Y. Halevy. Crowdsourcing systems on the world-wide web. Commun. ACM, 54(4):86--96, Apr. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB '97, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: Answering queries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. J. Good. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika, 40(3/4), 1953.Google ScholarGoogle Scholar
  15. Google. Freebase. https://www.freebase.com, 2015. Accessed: 2015-07-08.Google ScholarGoogle Scholar
  16. D. Haas, M. Greenstein, K. Kamalov, A. Marcus, M. Olszewski, and M. Piette. Reducing error in context-sensitive crowdsourced tasks. In First AAAI Conference on Human Computation and Crowdsourcing, 2013.Google ScholarGoogle Scholar
  17. P. J. Haas. Hoeffding Inequalities for Join Selectivity Estimation and Online Aggregation. IBM, 1996.Google ScholarGoogle Scholar
  18. P. J. Haas et al. Sampling-based estimation of the number of distinct values of an attribute. In Proc. of VLDB, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR, 2013.Google ScholarGoogle Scholar
  20. L. Kish. Survey sampling. John Wiley and Sons, 1965.Google ScholarGoogle Scholar
  21. W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. Partial results in database systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1275--1286. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. U. Leser and F. Naumann. Query planning with information quality bounds. In H. Larsen, T. Andreasen, H. Christiansen, J. Kacprzyk, and S. Zadrozny, editors, Flexible Query Answering Systems, volume 7 of Advances in Soft Computing, pages 85--94. Physica-Verlag HD, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  23. M. Lexa. Useful facts about the kullback-leibler discrimination distance. Houston, Texas, 2004.Google ScholarGoogle Scholar
  24. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Liang. Estimation Methods for the Size of Deep Web Textural Data Source: A Survey. cs.uwindsor.ca/richard/cs510/survey_jie_liang.pdf, 2008.Google ScholarGoogle Scholar
  26. J. Lu and D. Li. Estimating deep web data source size by capture--recapture method. Inf. Retr., 13(1):70--95, Feb. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Lynch and B. Kim. Sample size, the margin of error and the coefficient of variation. InterStat, 2010.Google ScholarGoogle Scholar
  28. M. Magnani and D. Montesi. A survey on uncertainty management in data integration. J. Data and Information Quality, 2(1):5:1--5:33, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Demonstration of qurk: a query processor for humanoperators. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12--16, 2011, pages 1315--1318, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 9--12, 2011, Online Proceedings, pages 211--214, 2011.Google ScholarGoogle Scholar
  31. D. A. McAllester and R. E. Schapire. On the convergence rate of good-turing estimators. In COLT, pages 1--6. Citeseer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. McClave and T. Sincich. Statistics. Pearson, 2013.Google ScholarGoogle Scholar
  33. W. Meng, K.-L. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the usefulness of search engines. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 146--153, Mar 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. F. Naumann, J.-C. Freytag, and U. Leser. Completeness of integrated information sources. Inf. Syst., 29(7):583--615, Sept. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. T. Neiling and H.-J. Lenz. Data integration by means of object identification in information systems. In In Proceedings of European Conference on Information Systems, 2000.Google ScholarGoogle Scholar
  36. F. Olken and D. Rotem. Simple random sampling from relational databases. In VLDB, volume 86, pages 25--28, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2012.Google ScholarGoogle Scholar
  38. A. Parameswaran and N. Polyzotis. Answering Queries using Humans, Algorithms and Databases. In Proc. of CIDR, 2011.Google ScholarGoogle Scholar
  39. Pew Research Center. How u.s. tech-sector jobs have grown, changed in 15 years. http://pewrsr.ch/PtqZDA, 2014. Accessed: 2015-07-08.Google ScholarGoogle Scholar
  40. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  41. S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. Identifying the extent of completeness of query answers over partially complete databases.Google ScholarGoogle Scholar
  42. J. Rice. Mathematical statistics and data analysis. Cengage Learning, 2006.Google ScholarGoogle Scholar
  43. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581--592, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  44. B. Saha and D. Srivastava. Data quality: The other face of big data. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 1294--1297, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  45. R. Sapsford. Survey Research. SAGE Publications, 1999.Google ScholarGoogle Scholar
  46. B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. K. I. Ugland, J. S. Gray, and K. E. Ellingsen. The species--accumulation curve and estimation of species richness. Journal of Animal Ecology, 72(5):888--897, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  48. G. Valiant and P. Valiant. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685--694. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Wikipedia. 68-95-99.7 rule. https://en.wikipedia.org/wiki/68--95--99.7_rule, 2015. Accessed: 2015-07-08.Google ScholarGoogle Scholar
  50. Wikipedia. List of u.s. states by gdp. https://en.wikipedia.org/wiki/List_of_U.S._states_by_GDP, 2015. Accessed: 2015-07-08.Google ScholarGoogle Scholar
  51. T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: Exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys '10, pages 77--90, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Y. C. Yuan. Multiple imputation for missing data: Concepts and new development (version 9.0). SAS Institute Inc, Rockville, MD, 2010.Google ScholarGoogle Scholar

Index Terms

  1. Estimating the Impact of Unknown Unknowns on Aggregate Query Results

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
          June 2016
          2300 pages
          ISBN:9781450335317
          DOI:10.1145/2882903

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 June 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader