ABSTRACT
In this paper, we present a comparison of nonparametric estimation methods for computing approximations of the selectivities of queries, in particular range queries. In contrast to previous studies, the focus of our comparison is on metric attributes with large domains which occur for example in spatial and temporal databases. We also assume that only small sample sets of the required relations are available for estimating the selectivity. In addition to the popular histogram estimators, our comparison includes so-called kernel estimation methods. Although these methods have been proven to be among the most accurate estimators known in statistics, they have not been considered for selectivity estimation of database queries, so far. We first show how to generate kernel estimators that deliver accurate approximate selectivities of queries. Thereafter, we reveal that two parameters, the number of samples and the so-called smoothing parameter, are important for the accuracy of both kernel estimators and histogram estimators. For histogram estimators, the smoothing parameter determines the number of bins (histogram classes). We first present the optimal smoothing parameter as a function of the number of samples and show how to compute approximations of the optimal parameter. Moreover, we propose a new selectivity estimator that can be viewed as an hybrid of histogram and kernel estimators. Experimental results show the performance of different estimators in practice. We found in our experiments that kernel estimators are most efficient for continuously distributed data sets, whereas for our real data sets the hybrid technique is most promising.
- 1.Chert, C.M. & Roussopoulos, N. "Adaptive Selectivity Estimation Using Query Feedback" Proc. ACM-SIG- MOD Intl. Conf. on Management of Data 1994. Google ScholarDigital Library
- 2.Ioannidis, Yannis E. & Christodoulakis, S. "Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results" TODS 18(4) 1993.709-748. Google ScholarDigital Library
- 3.Gregory Piatetsky-Shapiro & Charles Connell. "Accurate Estimation of the Number of Tuples Satisfying a Condition" SIGMOD Conference 1984. 256-276. Google ScholarDigital Library
- 4.Yossi Matias, Jeffrey Scott Vitter, Min Wang. "Wavelet-Based Histograms for Selectivity Estimation" SIG- MOD Conference 1998.448-459. Google ScholarDigital Library
- 5.Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya. "Random Sampling for Histogram Con-. struction: How much is enough?" SIGMOD Conference 1998.436-447. Google ScholarDigital Library
- 6.Joseph M. Hellerstein, Peter J. Haas, Helen Wang. "Online Aggregation" SIGMOD Conference 1997. 171-182. Google ScholarDigital Library
- 7.H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, Torsten Sue,1. "Optimal Histograms with Quality Guarantees" VLDB 1998. 275-286. Google ScholarDigital Library
- 8.Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, Eugene J. Shekita. "Improved Histograms for Selectivity Estimation of Range Predicates" SIGMOD Conference 1996. 294-305. Google ScholarDigital Library
- 9.Michael V. Mannino, Paicheng Chu, Thomas Sager. "Statistical Profile Estimation in Database Systems" Computing Surveys 20(3) 1988. 191-221. Google ScholarDigital Library
- 10.Gasser, T. & Engel, J. & Seifert, B. ,,Nonparametric function estimation" in: Rao (Ed.),"Handbook of Sta.tistics Vol. 9", North Holland 1993.Google Scholar
- 11.David W. Scott. "Multivariate Density Estimation" Wiley & Sons 1992.Google Scholar
- 12.Selinger, P.G. & Astrahan, M.M. & Chamberlin, D.D. & Lorie, R.A. & Price, T.T. ,,Access path selection in a relational database management system" Proc. ACM SGMOD Conference 1979.23-34. Google ScholarDigital Library
- 13.Silverman, B.W. ,,Density Estimation for Statistics and Data Analysis" Chapman & Hall 1986.Google Scholar
- 14.Simonoff, J. & Dong, J. "The Construction and Properties of Boundary Kernels for Sparse Multinomials" Journal of Computational and Graphical Statistics 1994.Google Scholar
- 15.Wand, M.P. & Jones, M.C. ,,Kernel Smoothing" Chapman & Hall 1995.Google Scholar
- 16.Brodsky, B.E. & Darkhovsky, B.S. "Nonparametric Methods in change-point problems" Kluwer Academic Publishers 1993.Google Scholar
- 17.http://www.mathematik.uni-marburg.de/DBS/download/data.htmlGoogle Scholar
- 18.http://www, tiger, gov/Google Scholar
- 19.http://www.kdnuggets.com/datasets,htmlGoogle Scholar
Index Terms
- A comparison of selectivity estimators for range queries on metric attributes
Recommendations
Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataSelectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality ...
A comparison of selectivity estimators for range queries on metric attributes
In this paper, we present a comparison of nonparametric estimation methods for computing approximations of the selectivities of queries, in particular range queries. In contrast to previous studies, the focus of our comparison is on metric attributes ...
Selectivity estimators for multidimensional range queries over real attributes
Estimating the selectivity of multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper, we consider the following problem: given a table of d attributes ...
Comments