skip to main content
Histogram-based estimation techniques in database systems
Publisher:
  • University of Wisconsin at Madison
  • Engineering Experiment Station Madison, WI
  • United States
Order Number:UMI Order No. GAX97-16074
Bibliometrics
Skip Abstract Section
Abstract

Many commercial database management systems maintain histograms to summarize the contents of relations in order to perform efficient estimation of query result sizes and access plan costs. The accuracy of these estimates is often of critical importance. But, there has never been a systematic study of all histogram aspects and their effectiveness in providing accurate estimations. In this thesis, we identify (theoretically and experimentally) the most accurate classes of histograms for estimating the sizes and distributions of the results of several important query operators and provide efficient (sampling-based) techniques to construct these histograms. All of these histograms are novel and differ in fundamental ways from traditional histograms. We provide a systematic classification of all classes of histograms based on certain canonical aspects of histograms that determine their effectiveness in a given estimation problem. We also provide techniques to capture dependencies between attributes in a relation and show that these techniques are far more accurate than the traditional attribute independence assumption. Finally, we use histograms to effectively balance load during parallel joins, thus demonstrating their versatility. Our over all conclusion from the accuracy and efficiency of histogram-based techniques is that they can be used both for enhanced accuracy in traditional applications (e.g., query optimizers) as well as in novel applications that can benefit from estimates (e.g., approximate query processors and load balancers).

Cited By

  1. Ba J and Rigger M Testing Database Engines via Query Plan Guidance Proceedings of the 45th International Conference on Software Engineering, (2060-2071)
  2. ACM
    Hamdan H A mixture model approach to big data clustering and classification Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, (1-6)
  3. Lin X, Zhang Q, Yuan Y and Liu Q (2007). Error minimization in approximate range aggregates, Data & Knowledge Engineering, 62:1, (156-176), Online publication date: 1-Jul-2007.
  4. Wang Y Relations between two common types of rectangular tilings Proceedings of the 17th international conference on Algorithms and Computation, (193-202)
  5. ACM
    Lin X, Liu Q, Yuan Y, Zhou X and Lu H (2006). Summarizing level-two topological relations in large spatial datasets, ACM Transactions on Database Systems (TODS), 31:2, (584-630), Online publication date: 1-Jun-2006.
  6. ACM
    Roy P, Mohania M, Bamba B and Raman S Towards automatic association of relevant unstructured content with structured query results Proceedings of the 14th ACM international conference on Information and knowledge management, (405-412)
  7. Muthukrishnan S, Strauss M and Zheng X Workload-optimal histograms on streams Proceedings of the 13th annual European conference on Algorithms, (734-745)
  8. Elmongui H, Mokbel M and Aref W Spatio-temporal histograms Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases, (19-36)
  9. Gilbert A, Kotidis Y, Muthukrishnan S and Strauss M (2005). Domain-Driven Data Synopses for Dynamic Quantiles, IEEE Transactions on Knowledge and Data Engineering, 17:7, (927-938), Online publication date: 1-Jul-2005.
  10. Muthukrishnan S and Suel T (2005). Approximation algorithms for array partitioning problems, Journal of Algorithms, 54:1, (85-104), Online publication date: 1-Jan-2005.
  11. Mamoulis N, Papadias D and Arkoumanis D (2019). Complex Spatial Query Processing, Geoinformatica, 8:4, (311-346), Online publication date: 1-Dec-2004.
  12. Pham H and Sevcik K Structure choices for two-dimensional histogram construction Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, (13-27)
  13. Miled Z, Liu J, Bukhres O, Li H, Martin J, Balagopalakrishna C and Oppelt R (2019). Use and Maintenance of Histograms for Large Scientific Database Access Planning, Journal of Intelligent Information Systems, 23:2, (145-178), Online publication date: 1-Sep-2004.
  14. Zhang Q and Lin X Clustering moving objects for spatio-temporal selectivity estimation Proceedings of the 15th Australasian database conference - Volume 27, (123-130)
  15. Amir A, Kashi R and Netanyahu N Efficient Multidimensional Quantitative Hypotheses Generation Proceedings of the Third IEEE International Conference on Data Mining
  16. Shin H, Moon B and Lee S (2003). Adaptive and Incremental Processing for Distance Join Queries, IEEE Transactions on Knowledge and Data Engineering, 15:6, (1561-1578), Online publication date: 1-Nov-2003.
  17. Lin X, Liu Q, Yuan Y and Zhou X Multiscale histograms Proceedings of the 29th international conference on Very large data bases - Volume 29, (814-825)
  18. An N, Jin J and Sivasubramaniam A (2003). Toward an Accurate Analysis of Range Queries on Spatial Data, IEEE Transactions on Knowledge and Data Engineering, 15:2, (305-323), Online publication date: 1-Feb-2003.
  19. ACM
    Qiao L, Agrawal D and Abbadi A RHist Proceedings of the eleventh international conference on Information and knowledge management, (469-476)
  20. ACM
    Gibbons P, Matias Y and Poosala V (2002). Fast incremental maintenance of approximate histograms, ACM Transactions on Database Systems, 27:3, (261-298), Online publication date: 1-Sep-2002.
  21. Bernardino J, Furtado P and Madeira H (2019). Approximate Query Answering Using Data Warehouse Striping, Journal of Intelligent Information Systems, 19:2, (145-167), Online publication date: 1-Sep-2002.
  22. Gilbert A, Kotidis Y, Muthukrishnan S and Strauss M How to summarize the universe Proceedings of the 28th international conference on Very Large Data Bases, (454-465)
  23. Cadez I, Smyth P, McLachlan G and McLaren C (2019). Maximum Likelihood Estimation of Mixture Densities for Binned and Truncated Multivariate Data, Machine Language, 47:1, (7-34), Online publication date: 1-Apr-2002.
  24. Berman P, DasGupta B and Muthukrishnan S Slice and dice Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, (455-464)
  25. Riedewald M, Agrawal D and El Abbadi A Managing and analyzing massive data sets with data cubes Handbook of massive data sets, (547-578)
  26. Garofalakis M and Gibbon P Approximate Query Processing Proceedings of the 27th International Conference on Very Large Data Bases
  27. Gibbons P Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports Proceedings of the 27th International Conference on Very Large Data Bases, (541-550)
  28. Amir A, Kashi R and Netanyahu N Analyzing Quantitative Databases Proceedings of the 27th International Conference on Very Large Data Bases, (89-98)
  29. Gilbert A, Kotidis Y, Muthukrishnan S and Strauss M Surfing Wavelets on Streams Proceedings of the 27th International Conference on Very Large Data Bases, (79-88)
  30. Berman P, DasGupta B, Muthukrishnan S and Ramaswami S Improved approximation algorithms for rectangle tiling and packing Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, (427-436)
  31. ACM
    Faloutsos C, Seeger B, Traina A and Traina C (2019). Spatial join selectivity using power laws, ACM SIGMOD Record, 29:2, (177-188), Online publication date: 1-Jun-2000.
  32. ACM
    Faloutsos C, Seeger B, Traina A and Traina C Spatial join selectivity using power laws Proceedings of the 2000 ACM SIGMOD international conference on Management of data, (177-188)
  33. König A and Weikum G Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation Proceedings of the 25th International Conference on Very Large Data Bases, (423-434)
  34. ACM
    Acharya S, Gibbons P, Poosala V and Ramaswamy S Join synopses for approximate query answering Proceedings of the 1999 ACM SIGMOD international conference on Management of data, (275-286)
  35. ACM
    Acharya S, Poosala V and Ramaswamy S Selectivity estimation in spatial databases Proceedings of the 1999 ACM SIGMOD international conference on Management of data, (13-24)
  36. ACM
    Acharya S, Gibbons P, Poosala V and Ramaswamy S (2019). Join synopses for approximate query answering, ACM SIGMOD Record, 28:2, (275-286), Online publication date: 1-Jun-1999.
  37. ACM
    Acharya S, Poosala V and Ramaswamy S (2019). Selectivity estimation in spatial databases, ACM SIGMOD Record, 28:2, (13-24), Online publication date: 1-Jun-1999.
  38. ACM
    Alon N, Gibbons P, Matias Y and Szegedy M Tracking join and self-join sizes in limited storage Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (10-20)
  39. Furtado P and Madeira H Summary Grids Proceedings of the Sixth International Conference on Database Systems for Advanced Applications, (187-194)
  40. ACM
    Saraç K, Eğecioǧlu Ö and El Abbadi A Iterated DFT based techniques for join size estimation Proceedings of the seventh international conference on Information and knowledge management, (348-355)
  41. ACM
    Matias Y, Vitter J and Wang M (1998). Wavelet-based histograms for selectivity estimation, ACM SIGMOD Record, 27:2, (448-459), Online publication date: 1-Jun-1998.
  42. ACM
    Matias Y, Vitter J and Wang M Wavelet-based histograms for selectivity estimation Proceedings of the 1998 ACM SIGMOD international conference on Management of data, (448-459)
  43. Khanna S, Muthukrishnan S and Paterson M On approximating rectangle tiling and packing Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, (384-393)
  44. Gibbons P, Matias Y and Poosala V Fast Incremental Maintenance of Approximate Histograms Proceedings of the 23rd International Conference on Very Large Data Bases, (466-475)
Contributors
  • Nokia Bell Labs

Recommendations