research-article

Efficient Attribute Recommendation with Probabilistic Guarantee

Authors:
Chi Wang

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Kaushik Chakrabarti

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 2387–2396https://doi.org/10.1145/3219819.3219984

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2387–2396

ABSTRACT

We study how to efficiently solve a primitive data exploration problem: Given two ad-hoc predicates which define two subsets of a relational table, find the top-K attributes whose distributions in the two subsets deviate most from each other. The deviation is measured by $\ell1$ or $\ell2$ distance. The exact approach is to query the full table to calculate the deviation for each attribute and then sort them. It is too expensive for large tables. Researchers have proposed heuristic sampling solutions to avoid accessing the entire table for all attributes. However, these solutions have no theoretical guarantee of correctness and their speedup over the exact approach is limited. In this paper, we develop an adaptive querying solution with probabilistic guarantee of correctness and near-optimal sample complexity. We perform experiments in both synthetic and real-world datasets. Compared to the exact approach implemented with a commercial DBMS, previous sampling solutions achieve up to 2× speedup with erroneous answers. Our solution can produce 25× speedup with near-zero error in the answer.

Supplemental Material

wang_efficient_attribute_recommendation.mp4

mp4

565.2 MB

Download

References

dbibitemlabel{1}C. C. Aggarwal, and J. Han. Frequent Pattern Mining. Springer. 2014.łabelbook:aggarwal2014%mdk%mdk%mdk-data-line=ref.bib:16 Google ScholarCross Ref
dbibitemlabel{2}N. K. Ahmed, N. Duffield, T. L. Willke, and R. A. Rossi. textquotedblleftOn Sampling from Massive Graph Streams.textquotedblright Proc. VLDB Endow. 10 (11): 1430--1441. 2017.łabelvldb:ahmed2017%mdk%mdk%mdk-data-line=ref.bib:349 Google ScholarDigital Library
dbibitemlabel{3}Anushka Anand, and Justin Talbot. textquotedblleftAutomatic Selection of Partitioning Variables for Small Multiple Displays.textquotedblright IEEE Transactions on Visualization & Computer Graphics 22 (1): 669--677. 2016.łabelinfovis:anand2015%mdk%mdk%mdk-data-line=ref.bib:472Google ScholarDigital Library
dbibitemlabel{4}R. Bardenet, and O.-A. Maillard. textquotedblleftConcentration Inequalities for Sampling without Replacement.textquotedblright Bernoulli 21 (3): 1361--1385. Aug. 2015.łabelbernoulli:bardenet2015%mdk%mdk%mdk-data-line=ref.bib:171Google ScholarCross Ref
dbibitemlabel{5}S. Bubeck, T. Wang, and N. Viswanathan. textquotedblleftMultiple Identifications in Multi-Armed Bandits.textquotedblright In ICML'13. 2013.łabelicml:bubeck2013%mdk%mdk%mdk-data-line=ref.bib:483 Google ScholarDigital Library
dbibitemlabel{6}S.-O. Chan, I. Diakonikolas, G. Valiant, and P. Valiant. textquotedblleftOptimal Algorithms for Testing Closeness of Discrete Distributions.textquotedblright In SODA'14. 2014.łabelsoda:chan2014%mdk%mdk%mdk-data-line=ref.bib:164 Google ScholarDigital Library
dbibitemlabel{7}S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. textquotedblleftCombinatorial Pure Exploration of Multi-Armed Bandits.textquotedblright In NIPS'14. 2014.łabelnips:chen2014%mdk%mdk%mdk-data-line=ref.bib:33 Google ScholarDigital Library
dbibitemlabel{8}T. Chen, Y. Sun, Y. Shi, and L. Hong. textquotedblleftOn Sampling Strategies for Neural Network-Based Collaborative Filtering.textquotedblright In KDD'17. 2017.łabelkdd:chen2017%mdk%mdk%mdk-data-line=ref.bib:40 Google ScholarDigital Library
dbibitemlabel{9}G. Cormode, and N. Duffield. textquotedblleftSampling for Big Data: A Tutorial.textquotedblright In KDD'14. 2014.łabelkdd:cormode2014%mdk%mdk%mdk-data-line=ref.bib:68 Google ScholarDigital Library
dbibitemlabel{10}H. Ehsan, M. A. Sharaf, and P. K. Chrysanthis. textquotedblleftEfficient Recommendation of Aggregate Data Visualizations.textquotedblright IEEE Transactions on Knowledge and Data Engineering 30 (2): 263--277. Feb. 2018.łabeltkde:ehsan2018%mdk%mdk%mdk-data-line=ref.bib:493Google ScholarCross Ref
dbibitemlabel{11}R. El-Yaniv, and D. Pechyony. textquotedblleftStable Transductive Learning.textquotedblright In COLT'06. 2006.łabelcolt:el-yaniv2006%mdk%mdk%mdk-data-line=ref.bib:86 Google ScholarDigital Library
dbibitemlabel{12}J. Han, and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. 2001.łabelbook:han2001%mdk%mdk%mdk-data-line=ref.bib:359 Google ScholarDigital Library
dbibitemlabel{13}S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. textquotedblleftProfiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment.textquotedblright In Proc. Intl. Working Conf. on Advanced Visual Interfaces. AVI'12. 2012.łabelavi:kandel2012%mdk%mdk%mdk-data-line=ref.bib:430 Google ScholarDigital Library
dbibitemlabel{14}A. Key, B. Howe, D. Perry, and C. Aragon. textquotedblleftVizDeck: Self-Organizing Dashboards for Visual Analytics.textquotedblright In SIGMOD'12. 2012.łabelsigmod:key2012%mdk%mdk%mdk-data-line=ref.bib:824 Google ScholarDigital Library
dbibitemlabel{15}A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. textquotedblleftRapid Sampling for Visualizations with Ordering Guarantees.textquotedblright PVLDB 8 (5). 2015.łabelpvldb:kimbpimr15%mdk%mdk%mdk-data-line=ref.bib:147 Google ScholarDigital Library
dbibitemlabel{16}H. Liu, H. Motoda, and L. Yu. textquotedblleftA Selective Sampling Approach to Active Feature Selection.textquotedblright Artificial Intelligence 159 (1): 49--74. 2004.łabelai:liu2004%mdk%mdk%mdk-data-line=ref.bib:26 Google ScholarDigital Library
dbibitemlabel{17}M. Riondato, and E. Upfal. textquotedblleftMining Frequent Itemsets Through Progressive Sampling with Rademacher Averages.textquotedblright In KDD'15. 2015.łabelkdd:riondato2015%mdk%mdk%mdk-data-line=ref.bib:420 Google ScholarDigital Library
dbibitemlabel{18}S. Sarawagi. textquotedblleftExplaining Differences in Multidimensional Aggregates.textquotedblright In VLDB'99. 1999.łabelvldb:sarawagi1999%mdk%mdk%mdk-data-line=ref.bib:79 Google ScholarDigital Library
dbibitemlabel{19}T. Sellam, and M. Kersten. textquotedblleftFast, Explainable View Detection to Characterize Exploration Queries.textquotedblright In SSDBM'16. 2016.łabelssdbm:sellam2016%mdk%mdk%mdk-data-line=ref.bib:509 Google ScholarDigital Library
dbibitemlabel{20}J. Seo, and B. Shneiderman. textquotedblleftA Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data.textquotedblright Information Visualization 4 (2): 96--113. 2005.łabelinfovis:jinwook2005%mdk%mdk%mdk-data-line=ref.bib:276 Google ScholarDigital Library
dbibitemlabel{21}B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. textquotedblleftExtracting Top-K Insights from Multi-Dimensional Data.textquotedblright In SIGMOD'17. 2017.łabelsigmod:zhang2017%mdk%mdk%mdk-data-line=ref.bib:519 Google ScholarDigital Library
dbibitemlabel{22}M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. textquotedblleftSeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics.textquotedblright Proc. VLDB Endow. 8 (13): 2182--2193. Sep. 2015.łabelvldb:vartak2015%mdk%mdk%mdk-data-line=ref.bib:242 Google ScholarDigital Library
dbibitemlabel{23}A. Wasay, M. Athanassoulis, and S. Idreos. textquotedblleftQueriosity: Automated Data Exploration.textquotedblright In Proc. IEEE Intl. Congress on Big Data. 2015.łabelbigdata:wasay2015%mdk%mdk%mdk-data-line=ref.bib:530 Google ScholarDigital Library
dbibitemlabel{24}K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe, and J. Heer. textquotedblleftVoyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations.textquotedblright IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). 2016.łabelinfovis:2016-voyager%mdk%mdk%mdk-data-line=ref.bib:284Google Scholar
dbibitemlabel{25}Y. Yan, L. J. Chen, and Z. Zhang. textquotedblleftError-Bounded Sampling for Analytics on Big Sparse Data.textquotedblright In VLDB'14. 2014.łabelvldb:yan2014%mdk%mdk Google ScholarDigital Library

Index Terms

Efficient Attribute Recommendation with Probabilistic Guarantee
1. Information systems
  1. Information systems applications
    1. Data mining
    2. Decision support systems
      1. Data analytics

Recommendations

Efficient Approximation Framework for Attribute Recommendation
PACMMOD

Trend analysis is a fundamental type of analytical query in online analytical processing (OLAP) systems. In trend analysis, a key step is to identify k valuable attributes whose distributions in two subsets under different predicates significantly differ ...
Read More
Efficient processing of top-k dominating queries in distributed environments

Due to the recent massive data generation, preference queries are becoming an increasingly important for users because such queries retrieve only a small number of preferable data objects from a huge multi-dimensional dataset. A top-k dominating query, ...
Read More
LAQy: Efficient and Reusable Query Approximations via Lazy Sampling
PACMMOD

Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
exploratory analysis
multi-dimensional data
sampling
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 1,012
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient Attribute Recommendation with Probabilistic Guarantee

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Efficient Approximation Framework for Attribute Recommendation

Efficient processing of top-k dominating queries in distributed environments

LAQy: Efficient and Reusable Query Approximations via Lazy Sampling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient Attribute Recommendation with Probabilistic Guarantee

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Efficient Approximation Framework for Attribute Recommendation

Efficient processing of top-k dominating queries in distributed environments

LAQy: Efficient and Reusable Query Approximations via Lazy Sampling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media