ABSTRACT
We study how to efficiently solve a primitive data exploration problem: Given two ad-hoc predicates which define two subsets of a relational table, find the top-K attributes whose distributions in the two subsets deviate most from each other. The deviation is measured by $\ell1$ or $\ell2$ distance. The exact approach is to query the full table to calculate the deviation for each attribute and then sort them. It is too expensive for large tables. Researchers have proposed heuristic sampling solutions to avoid accessing the entire table for all attributes. However, these solutions have no theoretical guarantee of correctness and their speedup over the exact approach is limited. In this paper, we develop an adaptive querying solution with probabilistic guarantee of correctness and near-optimal sample complexity. We perform experiments in both synthetic and real-world datasets. Compared to the exact approach implemented with a commercial DBMS, previous sampling solutions achieve up to 2× speedup with erroneous answers. Our solution can produce 25× speedup with near-zero error in the answer.
Supplemental Material
- dbibitemlabel{1}C. C. Aggarwal, and J. Han. Frequent Pattern Mining. Springer. 2014.łabelbook:aggarwal2014%mdk%mdk%mdk-data-line=ref.bib:16 Google ScholarCross Ref
- dbibitemlabel{2}N. K. Ahmed, N. Duffield, T. L. Willke, and R. A. Rossi. textquotedblleftOn Sampling from Massive Graph Streams.textquotedblright Proc. VLDB Endow. 10 (11): 1430--1441. 2017.łabelvldb:ahmed2017%mdk%mdk%mdk-data-line=ref.bib:349 Google ScholarDigital Library
- dbibitemlabel{3}Anushka Anand, and Justin Talbot. textquotedblleftAutomatic Selection of Partitioning Variables for Small Multiple Displays.textquotedblright IEEE Transactions on Visualization & Computer Graphics 22 (1): 669--677. 2016.łabelinfovis:anand2015%mdk%mdk%mdk-data-line=ref.bib:472Google ScholarDigital Library
- dbibitemlabel{4}R. Bardenet, and O.-A. Maillard. textquotedblleftConcentration Inequalities for Sampling without Replacement.textquotedblright Bernoulli 21 (3): 1361--1385. Aug. 2015.łabelbernoulli:bardenet2015%mdk%mdk%mdk-data-line=ref.bib:171Google ScholarCross Ref
- dbibitemlabel{5}S. Bubeck, T. Wang, and N. Viswanathan. textquotedblleftMultiple Identifications in Multi-Armed Bandits.textquotedblright In ICML'13. 2013.łabelicml:bubeck2013%mdk%mdk%mdk-data-line=ref.bib:483 Google ScholarDigital Library
- dbibitemlabel{6}S.-O. Chan, I. Diakonikolas, G. Valiant, and P. Valiant. textquotedblleftOptimal Algorithms for Testing Closeness of Discrete Distributions.textquotedblright In SODA'14. 2014.łabelsoda:chan2014%mdk%mdk%mdk-data-line=ref.bib:164 Google ScholarDigital Library
- dbibitemlabel{7}S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. textquotedblleftCombinatorial Pure Exploration of Multi-Armed Bandits.textquotedblright In NIPS'14. 2014.łabelnips:chen2014%mdk%mdk%mdk-data-line=ref.bib:33 Google ScholarDigital Library
- dbibitemlabel{8}T. Chen, Y. Sun, Y. Shi, and L. Hong. textquotedblleftOn Sampling Strategies for Neural Network-Based Collaborative Filtering.textquotedblright In KDD'17. 2017.łabelkdd:chen2017%mdk%mdk%mdk-data-line=ref.bib:40 Google ScholarDigital Library
- dbibitemlabel{9}G. Cormode, and N. Duffield. textquotedblleftSampling for Big Data: A Tutorial.textquotedblright In KDD'14. 2014.łabelkdd:cormode2014%mdk%mdk%mdk-data-line=ref.bib:68 Google ScholarDigital Library
- dbibitemlabel{10}H. Ehsan, M. A. Sharaf, and P. K. Chrysanthis. textquotedblleftEfficient Recommendation of Aggregate Data Visualizations.textquotedblright IEEE Transactions on Knowledge and Data Engineering 30 (2): 263--277. Feb. 2018.łabeltkde:ehsan2018%mdk%mdk%mdk-data-line=ref.bib:493Google ScholarCross Ref
- dbibitemlabel{11}R. El-Yaniv, and D. Pechyony. textquotedblleftStable Transductive Learning.textquotedblright In COLT'06. 2006.łabelcolt:el-yaniv2006%mdk%mdk%mdk-data-line=ref.bib:86 Google ScholarDigital Library
- dbibitemlabel{12}J. Han, and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. 2001.łabelbook:han2001%mdk%mdk%mdk-data-line=ref.bib:359 Google ScholarDigital Library
- dbibitemlabel{13}S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer. textquotedblleftProfiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment.textquotedblright In Proc. Intl. Working Conf. on Advanced Visual Interfaces. AVI'12. 2012.łabelavi:kandel2012%mdk%mdk%mdk-data-line=ref.bib:430 Google ScholarDigital Library
- dbibitemlabel{14}A. Key, B. Howe, D. Perry, and C. Aragon. textquotedblleftVizDeck: Self-Organizing Dashboards for Visual Analytics.textquotedblright In SIGMOD'12. 2012.łabelsigmod:key2012%mdk%mdk%mdk-data-line=ref.bib:824 Google ScholarDigital Library
- dbibitemlabel{15}A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. textquotedblleftRapid Sampling for Visualizations with Ordering Guarantees.textquotedblright PVLDB 8 (5). 2015.łabelpvldb:kimbpimr15%mdk%mdk%mdk-data-line=ref.bib:147 Google ScholarDigital Library
- dbibitemlabel{16}H. Liu, H. Motoda, and L. Yu. textquotedblleftA Selective Sampling Approach to Active Feature Selection.textquotedblright Artificial Intelligence 159 (1): 49--74. 2004.łabelai:liu2004%mdk%mdk%mdk-data-line=ref.bib:26 Google ScholarDigital Library
- dbibitemlabel{17}M. Riondato, and E. Upfal. textquotedblleftMining Frequent Itemsets Through Progressive Sampling with Rademacher Averages.textquotedblright In KDD'15. 2015.łabelkdd:riondato2015%mdk%mdk%mdk-data-line=ref.bib:420 Google ScholarDigital Library
- dbibitemlabel{18}S. Sarawagi. textquotedblleftExplaining Differences in Multidimensional Aggregates.textquotedblright In VLDB'99. 1999.łabelvldb:sarawagi1999%mdk%mdk%mdk-data-line=ref.bib:79 Google ScholarDigital Library
- dbibitemlabel{19}T. Sellam, and M. Kersten. textquotedblleftFast, Explainable View Detection to Characterize Exploration Queries.textquotedblright In SSDBM'16. 2016.łabelssdbm:sellam2016%mdk%mdk%mdk-data-line=ref.bib:509 Google ScholarDigital Library
- dbibitemlabel{20}J. Seo, and B. Shneiderman. textquotedblleftA Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data.textquotedblright Information Visualization 4 (2): 96--113. 2005.łabelinfovis:jinwook2005%mdk%mdk%mdk-data-line=ref.bib:276 Google ScholarDigital Library
- dbibitemlabel{21}B. Tang, S. Han, M. L. Yiu, R. Ding, and D. Zhang. textquotedblleftExtracting Top-K Insights from Multi-Dimensional Data.textquotedblright In SIGMOD'17. 2017.łabelsigmod:zhang2017%mdk%mdk%mdk-data-line=ref.bib:519 Google ScholarDigital Library
- dbibitemlabel{22}M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. textquotedblleftSeeDB: Efficient Data-Driven Visualization Recommendations to Support Visual Analytics.textquotedblright Proc. VLDB Endow. 8 (13): 2182--2193. Sep. 2015.łabelvldb:vartak2015%mdk%mdk%mdk-data-line=ref.bib:242 Google ScholarDigital Library
- dbibitemlabel{23}A. Wasay, M. Athanassoulis, and S. Idreos. textquotedblleftQueriosity: Automated Data Exploration.textquotedblright In Proc. IEEE Intl. Congress on Big Data. 2015.łabelbigdata:wasay2015%mdk%mdk%mdk-data-line=ref.bib:530 Google ScholarDigital Library
- dbibitemlabel{24}K. Wongsuphasawat, D. Moritz, A. Anand, J. Mackinlay, B. Howe, and J. Heer. textquotedblleftVoyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations.textquotedblright IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis). 2016.łabelinfovis:2016-voyager%mdk%mdk%mdk-data-line=ref.bib:284Google Scholar
- dbibitemlabel{25}Y. Yan, L. J. Chen, and Z. Zhang. textquotedblleftError-Bounded Sampling for Analytics on Big Sparse Data.textquotedblright In VLDB'14. 2014.łabelvldb:yan2014%mdk%mdk Google ScholarDigital Library
Index Terms
- Efficient Attribute Recommendation with Probabilistic Guarantee
Recommendations
Efficient Approximation Framework for Attribute Recommendation
PACMMODTrend analysis is a fundamental type of analytical query in online analytical processing (OLAP) systems. In trend analysis, a key step is to identify k valuable attributes whose distributions in two subsets under different predicates significantly differ ...
Efficient processing of top-k dominating queries in distributed environments
Due to the recent massive data generation, preference queries are becoming an increasingly important for users because such queries retrieve only a small number of preferable data objects from a huge multi-dimensional dataset. A top-k dominating query, ...
Efficient Consistent Query Answering Based on Attribute Deletions
CSA '08: Proceedings of the International Symposium on Computer Science and its ApplicationsData integrated from multiple sources may contain inconsistencies. A consistent query answer (CQA) in a possibly inconsistent database is an answer which is true in every minimal repair of the database. It is proved that for most constraints and queries ...
Comments