skip to main content
10.1145/3219819.3219984acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Efficient Attribute Recommendation with Probabilistic Guarantee

Published:19 July 2018Publication History

ABSTRACT

We study how to efficiently solve a primitive data exploration problem: Given two ad-hoc predicates which define two subsets of a relational table, find the top-K attributes whose distributions in the two subsets deviate most from each other. The deviation is measured by $\ell1$ or $\ell2$ distance. The exact approach is to query the full table to calculate the deviation for each attribute and then sort them. It is too expensive for large tables. Researchers have proposed heuristic sampling solutions to avoid accessing the entire table for all attributes. However, these solutions have no theoretical guarantee of correctness and their speedup over the exact approach is limited. In this paper, we develop an adaptive querying solution with probabilistic guarantee of correctness and near-optimal sample complexity. We perform experiments in both synthetic and real-world datasets. Compared to the exact approach implemented with a commercial DBMS, previous sampling solutions achieve up to 2× speedup with erroneous answers. Our solution can produce 25× speedup with near-zero error in the answer.

Skip Supplemental Material Section

Supplemental Material

wang_efficient_attribute_recommendation.mp4

mp4

565.2 MB

References

  1. dbibitemlabel{1}C. C. Aggarwal, and J. Han. Frequent Pattern Mining. Springer. 2014.łabelbook:aggarwal2014%mdk%mdk%mdk-data-line=ref.bib:16 Google ScholarGoogle ScholarCross RefCross Ref
  2. dbibitemlabel{2}N. K. Ahmed, N. Duffield, T. L. Willke, and R. A. Rossi. textquotedblleftOn Sampling from Massive Graph Streams.textquotedblright Proc. VLDB Endow. 10 (11): 1430--1441. 2017.łabelvldb:ahmed2017%mdk%mdk%mdk-data-line=ref.bib:349 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. dbibitemlabel{3}Anushka Anand, and Justin Talbot. textquotedblleftAutomatic Selection of Partitioning Variables for Small Multiple Displays.textquotedblright IEEE Transactions on Visualization & Computer Graphics 22 (1): 669--677. 2016.łabelinfovis:anand2015%mdk%mdk%mdk-data-line=ref.bib:472Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. dbibitemlabel{4}R. Bardenet, and O.-A. Maillard. textquotedblleftConcentration Inequalities for Sampling without Replacement.textquotedblright Bernoulli 21 (3): 1361--1385. Aug. 2015.łabelbernoulli:bardenet2015%mdk%mdk%mdk-data-line=ref.bib:171Google ScholarGoogle ScholarCross RefCross Ref
  5. dbibitemlabel{5}S. Bubeck, T. Wang, and N. Viswanathan. textquotedblleftMultiple Identifications in Multi-Armed Bandits.textquotedblright In ICML'13. 2013.łabelicml:bubeck2013%mdk%mdk%mdk-data-line=ref.bib:483 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. dbibitemlabel{6}S.-O. Chan, I. Diakonikolas, G. Valiant, and P. Valiant. textquotedblleftOptimal Algorithms for Testing Closeness of Discrete Distributions.textquotedblright In SODA'14. 2014.łabelsoda:chan2014%mdk%mdk%mdk-data-line=ref.bib:164 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. dbibitemlabel{7}S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. textquotedblleftCombinatorial Pure Exploration of Multi-Armed Bandits.textquotedblright In NIPS'14. 2014.łabelnips:chen2014%mdk%mdk%mdk-data-line=ref.bib:33 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. dbibitemlabel{8}T. Chen, Y. Sun, Y. Shi, and L. Hong. textquotedblleftOn Sampling Strategies for Neural Network-Based Collaborative Filtering.textquotedblright In KDD'17. 2017.łabelkdd:chen2017%mdk%mdk%mdk-data-line=ref.bib:40 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. dbibitemlabel{9}G. Cormode, and N. Duffield. textquotedblleftSampling for Big Data: A Tutorial.textquotedblright In KDD'14. 2014.łabelkdd:cormode2014%mdk%mdk%mdk-data-line=ref.bib:68 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. dbibitemlabel{10}H. Ehsan, M. A. Sharaf, and P. K. Chrysanthis. textquotedblleftEfficient Recommendation of Aggregate Data Visualizations.textquotedblright IEEE Transactions on Knowledge and Data Engineering 30 (2): 263--277. Feb. 2018.łabeltkde:ehsan2018%mdk%mdk%mdk-data-line=ref.bib:493Google ScholarGoogle ScholarCross RefCross Ref
  11. dbibitemlabel{11}R. El-Yaniv, and D. Pechyony. textquotedblleftStable Transductive Learning.textquotedblright In COLT'06. 2006.łabelcolt:el-yaniv2006%mdk%mdk%mdk-data-line=ref.bib:86 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. dbibitemlabel{12}J. Han, and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. 2001.łabelbook:han2001%mdk%mdk%mdk-data-line=ref.bib:359 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. dbibitemlabel{13}S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. textquotedblleftProfiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment.textquotedblright In Proc. Intl. Working Conf. on Advanced Visual Interfaces. AVI'12. 2012.łabelavi:kandel2012%mdk%mdk%mdk-data-line=ref.bib:430 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. dbibitemlabel{14}A. Key, B. Howe, D. Perry, and C. Aragon. textquotedblleftVizDeck: Self-Organizing Dashboards for Visual Analytics.textquotedblright In SIGMOD'12. 2012.łabelsigmod:key2012%mdk%mdk%mdk-data-line=ref.bib:824 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. dbibitemlabel{15}A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. textquotedblleftRapid Sampling for Visualizations with Ordering Guarantees.textquotedblright PVLDB 8 (5). 2015.łabelpvldb:kimbpimr15%mdk%mdk%mdk-data-line=ref.bib:147 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. dbibitemlabel{16}H. Liu, H. Motoda, and L. Yu. textquotedblleftA Selective Sampling Approach to Active Feature Selection.textquotedblright Artificial Intelligence 159 (1): 49--74. 2004.łabelai:liu2004%mdk%mdk%mdk-data-line=ref.bib:26 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. dbibitemlabel{17}M. Riondato, and E. Upfal. textquotedblleftMining Frequent Itemsets Through Progressive Sampling with Rademacher Averages.textquotedblright In KDD'15. 2015.łabelkdd:riondato2015%mdk%mdk%mdk-data-line=ref.bib:420 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. dbibitemlabel{18}S. Sarawagi. textquotedblleftExplaining Differences in Multidimensional Aggregates.textquotedblright In VLDB'99. 1999.łabelvldb:sarawagi1999%mdk%mdk%mdk-data-line=ref.bib:79 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. dbibitemlabel{19}T. Sellam, and M. Kersten. textquotedblleftFast, Explainable View Detection to Characterize Exploration Queries.textquotedblright In SSDBM'16. 2016.łabelssdbm:sellam2016%mdk%mdk%mdk-data-line=ref.bib:509 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. dbibitemlabel{20}J. Seo, and B. Shneiderman. textquotedblleftA Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data.textquotedblright Information Visualization 4 (2): 96--113. 2005.łabelinfovis:jinwook2005%mdk%mdk%mdk-data-line=ref.bib:276 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. dbibitemlabel{21}B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. textquotedblleftExtracting Top-K Insights from Multi-Dimensional Data.textquotedblright In SIGMOD'17. 2017.łabelsigmod:zhang2017%mdk%mdk%mdk-data-line=ref.bib:519 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. dbibitemlabel{22}M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. textquotedblleftSeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics.textquotedblright Proc. VLDB Endow. 8 (13): 2182--2193. Sep. 2015.łabelvldb:vartak2015%mdk%mdk%mdk-data-line=ref.bib:242 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. dbibitemlabel{23}A. Wasay, M. Athanassoulis, and S. Idreos. textquotedblleftQueriosity: Automated Data Exploration.textquotedblright In Proc. IEEE Intl. Congress on Big Data. 2015.łabelbigdata:wasay2015%mdk%mdk%mdk-data-line=ref.bib:530 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. dbibitemlabel{24}K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe, and J. Heer. textquotedblleftVoyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations.textquotedblright IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). 2016.łabelinfovis:2016-voyager%mdk%mdk%mdk-data-line=ref.bib:284Google ScholarGoogle Scholar
  25. dbibitemlabel{25}Y. Yan, L. J. Chen, and Z. Zhang. textquotedblleftError-Bounded Sampling for Analytics on Big Sparse Data.textquotedblright In VLDB'14. 2014.łabelvldb:yan2014%mdk%mdk Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Attribute Recommendation with Probabilistic Guarantee

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2018
        2925 pages
        ISBN:9781450355520
        DOI:10.1145/3219819

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader