research-article

Histograms reloaded: the merits of bucket diversity

Authors:
Carl-Christian Kanne

University of Mannheim, Mannheim, Germany

University of Mannheim, Mannheim, Germany
View Profile

,
Guido Moerkotte

University of Mannheim, Mannheim, Germany

University of Mannheim, Mannheim, Germany
View Profile

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataJune 2010Pages 663–674https://doi.org/10.1145/1807167.1807239

Published:06 June 2010Publication History

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 663–674

ABSTRACT

Virtually all histograms store for each bucket the number of distinct values it contains and their average frequency. In this paper, we question this paradigm. We start out by investigating the estimation precision of three commercial database systems which also follow the above paradigm. It turns out that huge errors are quite common. We then introduce new bucket types and investigate their accuracy when building optimal histograms with them. The results are ambiguous. There is no clear winner among the bucket types. At this point, we (1) switch to heterogeneous histograms, where different buckets of the same histogram possibly are of different types, and (2) design more bucket types. The nice consequence of introducing heterogeneous histograms is that we can guarantee decent upper error bounds while at the same time heterogeneous histograms require far less space than homogeneous histograms.

References

M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. ACM SIGMOD/SIGACT Conf. on Princ. of Database Syst. (PODS), pages 268--279, 2000. Google ScholarDigital Library
T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, 1990. Google ScholarDigital Library
P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 541--550, 2001. Google ScholarDigital Library
L. Haas, M. Carey, M. Livny, and A. Shukla. Seeking the truth about ad hoc join costs. The VLDB Journal, 6(3):241--256, 1997. Google ScholarDigital Library
Y. Ioannidis. Universality of serial histograms. In Proc. Int. Conf. on Very Large Data Bases (VLDB), pages 256--267, 1993. Google ScholarDigital Library
Y. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Trans. on Database Systems, 18(4):709--748, 1993. Google ScholarDigital Library
Y. Ioannidis and V. Poosola. Balancing histogram optimality and practicality for query result size estimation. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 233--244, 1995. Google ScholarDigital Library
Y. Ioannidis and V. Poosola. Histogram-based solutions to diverse database estimation problems. IEEE Data Engineering Bulletin, 18(3):10--18, Sept 1995.Google Scholar
R. Kooi. The Optimization of Queries in Relational Databases. PhD thesis, Case Western Reserve University, 1980. Google ScholarDigital Library
G. Moerkotte. Profiles for single attributes. Technical Report TR-2010-003, Department of Mathematics and Computer Science, University of Mannheim, Mannheim, Germany, 2010.Google Scholar
G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proc. Int. Conf. on Very Large Data Bases (VLDB), 2(1):982--993, 2009. Google ScholarDigital Library
B. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuples satisfying a condition. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 256--276, 1984. Google ScholarDigital Library
V. Poosola, Y. Ioannidis, P. Haas, and E. Shekita. Improved histograms for selectivity estimates of range predicates. In Proc. of the ACM SIGMOD Conf. on Management of Data, pages 294--305, 1996. Google ScholarDigital Library

Index Terms

Histograms reloaded: the merits of bucket diversity

Recommendations

Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Histograms that guarantee a maximum multiplicative error (q-error) for estimates may significantly improve the plan quality of query optimizers. However, the construction time for histograms with maximum q-error was too high for practical use cases. In ...
Read More
Constructing Join Histograms from Histograms with q-error Guarantees
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Histograms are implemented and used in any database system, usually defined on a single-column of a database table. However, one of the most desired statistical data in such systems are statistics on the correlation among columns. In this paper we ...
Read More
Cell histograms versus color histograms for image representation and retrieval

This paper presents a new approach for content-based image retrieval, which is based on the well-known and widely used color histograms. Previous approaches have used a single global color histogram (GCH) for the whole image, or local color histograms (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Results Reproduced / v1.1
Author Tags
cardinality estimation
histograms
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 528
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Histograms reloaded: the merits of bucket diversity

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA

Constructing Join Histograms from Histograms with q-error Guarantees

Cell histograms versus color histograms for image representation and retrieval