Many commercial database management systems maintain histograms to summarize the contents of relations in order to perform efficient estimation of query result sizes and access plan costs. The accuracy of these estimates is often of critical importance. But, there has never been a systematic study of all histogram aspects and their effectiveness in providing accurate estimations. In this thesis, we identify (theoretically and experimentally) the most accurate classes of histograms for estimating the sizes and distributions of the results of several important query operators and provide efficient (sampling-based) techniques to construct these histograms. All of these histograms are novel and differ in fundamental ways from traditional histograms. We provide a systematic classification of all classes of histograms based on certain canonical aspects of histograms that determine their effectiveness in a given estimation problem. We also provide techniques to capture dependencies between attributes in a relation and show that these techniques are far more accurate than the traditional attribute independence assumption. Finally, we use histograms to effectively balance load during parallel joins, thus demonstrating their versatility. Our over all conclusion from the accuracy and efficiency of histogram-based techniques is that they can be used both for enhanced accuracy in traditional applications (e.g., query optimizers) as well as in novel applications that can benefit from estimates (e.g., approximate query processors and load balancers).
Cited By
- Ba J and Rigger M Testing Database Engines via Query Plan Guidance Proceedings of the 45th International Conference on Software Engineering, (2060-2071)
- Hamdan H A mixture model approach to big data clustering and classification Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, (1-6)
- Lin X, Zhang Q, Yuan Y and Liu Q (2007). Error minimization in approximate range aggregates, Data & Knowledge Engineering, 62:1, (156-176), Online publication date: 1-Jul-2007.
- Wang Y Relations between two common types of rectangular tilings Proceedings of the 17th international conference on Algorithms and Computation, (193-202)
- Lin X, Liu Q, Yuan Y, Zhou X and Lu H (2006). Summarizing level-two topological relations in large spatial datasets, ACM Transactions on Database Systems (TODS), 31:2, (584-630), Online publication date: 1-Jun-2006.
- Roy P, Mohania M, Bamba B and Raman S Towards automatic association of relevant unstructured content with structured query results Proceedings of the 14th ACM international conference on Information and knowledge management, (405-412)
- Muthukrishnan S, Strauss M and Zheng X Workload-optimal histograms on streams Proceedings of the 13th annual European conference on Algorithms, (734-745)
- Elmongui H, Mokbel M and Aref W Spatio-temporal histograms Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases, (19-36)
- Gilbert A, Kotidis Y, Muthukrishnan S and Strauss M (2005). Domain-Driven Data Synopses for Dynamic Quantiles, IEEE Transactions on Knowledge and Data Engineering, 17:7, (927-938), Online publication date: 1-Jul-2005.
- Muthukrishnan S and Suel T (2005). Approximation algorithms for array partitioning problems, Journal of Algorithms, 54:1, (85-104), Online publication date: 1-Jan-2005.
- Mamoulis N, Papadias D and Arkoumanis D (2019). Complex Spatial Query Processing, Geoinformatica, 8:4, (311-346), Online publication date: 1-Dec-2004.
- Pham H and Sevcik K Structure choices for two-dimensional histogram construction Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, (13-27)
- Miled Z, Liu J, Bukhres O, Li H, Martin J, Balagopalakrishna C and Oppelt R (2019). Use and Maintenance of Histograms for Large Scientific Database Access Planning, Journal of Intelligent Information Systems, 23:2, (145-178), Online publication date: 1-Sep-2004.
- Zhang Q and Lin X Clustering moving objects for spatio-temporal selectivity estimation Proceedings of the 15th Australasian database conference - Volume 27, (123-130)
- Amir A, Kashi R and Netanyahu N Efficient Multidimensional Quantitative Hypotheses Generation Proceedings of the Third IEEE International Conference on Data Mining
- Shin H, Moon B and Lee S (2003). Adaptive and Incremental Processing for Distance Join Queries, IEEE Transactions on Knowledge and Data Engineering, 15:6, (1561-1578), Online publication date: 1-Nov-2003.
- Lin X, Liu Q, Yuan Y and Zhou X Multiscale histograms Proceedings of the 29th international conference on Very large data bases - Volume 29, (814-825)
- An N, Jin J and Sivasubramaniam A (2003). Toward an Accurate Analysis of Range Queries on Spatial Data, IEEE Transactions on Knowledge and Data Engineering, 15:2, (305-323), Online publication date: 1-Feb-2003.
- Qiao L, Agrawal D and Abbadi A RHist Proceedings of the eleventh international conference on Information and knowledge management, (469-476)
- Gibbons P, Matias Y and Poosala V (2002). Fast incremental maintenance of approximate histograms, ACM Transactions on Database Systems, 27:3, (261-298), Online publication date: 1-Sep-2002.
- Bernardino J, Furtado P and Madeira H (2019). Approximate Query Answering Using Data Warehouse Striping, Journal of Intelligent Information Systems, 19:2, (145-167), Online publication date: 1-Sep-2002.
- Gilbert A, Kotidis Y, Muthukrishnan S and Strauss M How to summarize the universe Proceedings of the 28th international conference on Very Large Data Bases, (454-465)
- Cadez I, Smyth P, McLachlan G and McLaren C (2019). Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data, Machine Language, 47:1, (7-34), Online publication date: 1-Apr-2002.
- Berman P, DasGupta B and Muthukrishnan S Slice and dice Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, (455-464)
- Riedewald M, Agrawal D and El Abbadi A Managing and analyzing massive data sets with data cubes Handbook of massive data sets, (547-578)
- Garofalakis M and Gibbon P Approximate Query Processing Proceedings of the 27th International Conference on Very Large Data Bases
- Gibbons P Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports Proceedings of the 27th International Conference on Very Large Data Bases, (541-550)
- Amir A, Kashi R and Netanyahu N Analyzing Quantitative Databases Proceedings of the 27th International Conference on Very Large Data Bases, (89-98)
- Gilbert A, Kotidis Y, Muthukrishnan S and Strauss M Surfing Wavelets on Streams Proceedings of the 27th International Conference on Very Large Data Bases, (79-88)
- Berman P, DasGupta B, Muthukrishnan S and Ramaswami S Improved approximation algorithms for rectangle tiling and packing Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, (427-436)
- Faloutsos C, Seeger B, Traina A and Traina C (2019). Spatial join selectivity using power laws, ACM SIGMOD Record, 29:2, (177-188), Online publication date: 1-Jun-2000.
- Faloutsos C, Seeger B, Traina A and Traina C Spatial join selectivity using power laws Proceedings of the 2000 ACM SIGMOD international conference on Management of data, (177-188)
- König A and Weikum G Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation Proceedings of the 25th International Conference on Very Large Data Bases, (423-434)
- Acharya S, Gibbons P, Poosala V and Ramaswamy S Join synopses for approximate query answering Proceedings of the 1999 ACM SIGMOD international conference on Management of data, (275-286)
- Acharya S, Poosala V and Ramaswamy S Selectivity estimation in spatial databases Proceedings of the 1999 ACM SIGMOD international conference on Management of data, (13-24)
- Acharya S, Gibbons P, Poosala V and Ramaswamy S (2019). Join synopses for approximate query answering, ACM SIGMOD Record, 28:2, (275-286), Online publication date: 1-Jun-1999.
- Acharya S, Poosala V and Ramaswamy S (2019). Selectivity estimation in spatial databases, ACM SIGMOD Record, 28:2, (13-24), Online publication date: 1-Jun-1999.
- Alon N, Gibbons P, Matias Y and Szegedy M Tracking join and self-join sizes in limited storage Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (10-20)
- Furtado P and Madeira H Summary Grids Proceedings of the Sixth International Conference on Database Systems for Advanced Applications, (187-194)
- Saraç K, Eğecioǧlu Ö and El Abbadi A Iterated DFT based techniques for join size estimation Proceedings of the seventh international conference on Information and knowledge management, (348-355)
- Matias Y, Vitter J and Wang M (1998). Wavelet-based histograms for selectivity estimation, ACM SIGMOD Record, 27:2, (448-459), Online publication date: 1-Jun-1998.
- Matias Y, Vitter J and Wang M Wavelet-based histograms for selectivity estimation Proceedings of the 1998 ACM SIGMOD international conference on Management of data, (448-459)
- Khanna S, Muthukrishnan S and Paterson M On approximating rectangle tiling and packing Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, (384-393)
- Gibbons P, Matias Y and Poosala V Fast Incremental Maintenance of Approximate Histograms Proceedings of the 23rd International Conference on Very Large Data Bases, (466-475)
Index Terms
- Histogram-based estimation techniques in database systems
Recommendations
A new approach to building histogram for selectivity estimation in query processing optimization
Recently, histograms have been considered as an effective way to produce quick approximate answers to decision support queries. They are also taken as a basic tool for data visualization and analysis. In this paper, we propose a new approach to ...
An Improved Image Contrast Enhancement Based on Histogram Equalization and Brightness Preserving Weight Clustering Histogram Equalization
CSNT '11: Proceedings of the 2011 International Conference on Communication Systems and Network TechnologiesIntensity transformation function based on information extracted from image intensity histogram play a basic role in image processing, in areas such as enhancement. Histogram equalization (HE) is a conventional method for image contract enhancement. ...
Proactive and reactive multi-dimensional histogram maintenance for selectivity estimation
Many state-of-the-art selectivity estimation methods use query feedback to maintain histogram buckets, thereby using the limited memory efficiently. However, they are ''reactive'' in nature, that is, they update the histogram based on queries that have ...