skip to main content
research-article
Free Access

Answering enumeration queries with the crowd

Published:21 December 2015Publication History
First page image

References

  1. Bar-Yossef, Z., Gurevich, M. Efficient search engine measurements. ACM Trans. Web 5, 4 (Oct. 2011), 18:1--18:48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Broder, A., Fontura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., Xu, Y. Estimating corpus size via queries. In Proceedings of CIKM (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bunge, J., Fitzpatrick, M. Estimating the number of species: A review. J. Am. Stat. Assoc. 88, 421 (1993), 364--373.Google ScholarGoogle Scholar
  4. Bunge, J., et al. Comparison of three estimators of the number of species. J. Appl. Stat. 22, 1 (1995), 45--59.Google ScholarGoogle ScholarCross RefCross Ref
  5. Burnham, K.P., Overton, W.S. Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65, 3 (1978), 625--633.Google ScholarGoogle ScholarCross RefCross Ref
  6. Chao, A. Species estimation and applications. In Encyclopedia of Statistical Sciences, 2nd edn. N. Balakrishnan, C.B. Read, and B. Vidakovic, eds. Wiley, New York, 2005, 7907--7916.Google ScholarGoogle Scholar
  7. Chao, A., Lee, S. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87, 417 (1992), 210--217.Google ScholarGoogle ScholarCross RefCross Ref
  8. Charikar, M., et al. Towards estimation error guarantees for distinct values. In Proceedings of the PODS (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Colwell, R.K., Coddington, J.A. Estimating terrestrial biodiversity through extrapolation. Philos. Trans. Biol. Sci. 345, 1311 (1994), 101--118.Google ScholarGoogle Scholar
  10. Doan, A., et al. Crowdsourcing applications and platforms: A data management perspective. PVLDB 4, 12 (2011), 1508--1509.Google ScholarGoogle Scholar
  11. Franklin, M.J., et al. CrowdDB: Answering queries with crowdsourcing. In Proceedings of the SIGMOD (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 40, 3/4 (1953), 237--264.Google ScholarGoogle ScholarCross RefCross Ref
  13. Gray, J., et al. Quickly generating billion-record synthetic databases. In Proceedings of the SIGMOD (1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Haas, P.J., et al. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the VLDB (1995). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Heer, J., et al. Crowdsourcing graphical perception: Using mechanical turk to assess visualization design. In Proceedings of the CHI (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ipeirotis, P.G., Provost, F., Wang, J. Quality management on Amazon mechanical turk. In Proceedings of the HCOMP (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Liu, K.-L., Yu, C., Meng, W. Discovering the representative of a search engine. In Proceedings of the CIKM (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lu, J., Li, D. Estimating deep web data source size by capture--recapture method. Inf. Retr. 13, 1 (Feb. 2010), 70--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marcus, A., Wu, E., Madden, S., Miller, R. Crowdsourced databases: Query processing with people. In Proceedings of the CIDR (2011).Google ScholarGoogle Scholar
  20. Parameswaran, A., Polyzotis, N. Answering queries using humans, algorithms and databases. In Proceedings of the CIDR (2011).Google ScholarGoogle Scholar
  21. Shen, T., et al. Predicting the number of new species in further taxonomic sampling. Ecology 84, 3 (2003).Google ScholarGoogle ScholarCross RefCross Ref
  22. Trushkowsky, B., Kraska, T., Franklin, M.J., Sarkar, P. Crowdsourced enumeration queries. In Proceedings of the ICDE (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wang, J., et al. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the SIGMOD (2014), 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Answering enumeration queries with the crowd

        Recommendations

        Reviews

        Amos O Olagunju

        Human crowds are valuable assets for providing additional responses in real time to cognate query results derived solely from relational database management systems (RDBMSs). But how should query results from human crowds, designed to simultaneously augment database query results, be terminated to provide reliable responses__?__ Trushkowsky and colleagues offer statistical tools for users and developers of RDBMSs to use in scrutinizing the time and cost benefits of the accuracy and inclusiveness of query responses. In an effort to cover the inclusiveness of all behavioral groups, the size of a query result (cardinality) is useful for computing the percentage of each interest survey group. The authors compellingly introduce an effective power law distribution data model that helps to overcome the data sampling problems attributable to cultural and regional biases, and a variety of the knowledgeable uses of web search strategies. Clearly, the paper introduces and evaluates a metric for estimating the stable and convergent cardinality of "human intelligence tasks from Amazon's mechanical Turk (HIT-AMT)." The authors introduce algorithms that help to minimize the influence of individuals who might dominate and bias query responses. They present the concepts of the classes of distributions of coverage and variance of user responses to crowdsourced queries. The experimental results of the test statistics with several thousands of queries in HIT with the AMT RDBMS of United Nations and US data show some significant improvement over well-known studies. List walking is a situation when the total size of a query result is underpredicted due to multiple heavily skewed, or similar, survey responses. The authors propose and validate a heuristic binomial probabilistic algorithm to detect and overcome list walking. The algorithms successfully detected severe list walks in the United Nations database. Undoubtedly, the authors present algorithms for computing cost-benefit tradeoffs of generating precise accounts and estimates of query responses derived from the traditional and real-time crowdsourced RDBMSs. There is no doubt that users should be empowered to contribute and reason about query results in all relational database management searches and retrieval results. In spite of the new light shed on the applications of the well-known power-laws [1] and the binomial distribution in this paper, I encourage all statisticians and database specialists to read and address the outstanding remaining unanswered questions raised by the authors. Is there a clear distinction between operations such as SELECT, JOIN, and PROJECT versus relational operators on real query results__?__ What impacts do human behaviors have on the sampling process in crowdsourced queries__?__ Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Communications of the ACM
          Communications of the ACM  Volume 59, Issue 1
          January 2016
          120 pages
          ISSN:0001-0782
          EISSN:1557-7317
          DOI:10.1145/2859829
          • Editor:
          • Moshe Y. Vardi
          Issue’s Table of Contents

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 December 2015

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDFChinese translation

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format