Article

BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

Authors:
Paul G. Brown

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Peter J. Hass

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29September 2003Pages 668–679

Published:09 September 2003Publication History

VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29

Pages 668–679

ABSTRACT

We present the BHUNT scheme for automatically discovering algebraic constraints between pairs of columns in relational data. The constraints may be "fuzzy" in that they hold for most, but not all, of the records, and the columns may be in the same table or different tables. Such constraints are of interest in the context of both data mining and query optimization, and the BHUNT methodology can potentially be adapted to discover fuzzy functional dependencies and other useful relationships. BHUNT first identifies candidate sets of column value pairs that are likely to satisfy an algebraic constraint. This discovery process exploits both system catalog information and data samples, and employs pruning heuristics to control processing costs. For each candidate, BHUNT constructs algebraic constraints by applying statistical histogramming, segmentation, or clustering techniques to samples of column values. Using results from the theory of tolerance intervals, the sample sizes can be chosen to control the number of "exception" records that fail to satisfy the discovered constraints. In query-optimization mode, BHUNT can automatically partition the data into normal and exception records. During subsequent query processing, queries can be modified to incorporate the constraints; the optimizer uses the constraints to identify new, more efficient access paths. The results are then combined with the results of executing the original query against the (small) set of exception records. Experiments on a very large database using a prototype implementation of BHUNT show reductions in table accesses of up to two orders of magnitude, leading to speedups in query processing by up to a factor of 6.8.

References

{1} M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. Dover, New York, 1972. Ninth printing.Google Scholar
{2} D. Barbarà, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. IEEE Data Engrg. Bull., 20:3-45, 1997.Google Scholar
{3} S. Bell and P. Brockhausen. Discovery of constraints and data dependencies in databases. In Proc. Europ. Conf. Machine Learning (ECML-95), Lecture Notes in Artificial Intellegence 914, pages 267-270, Berlin, 1995. Springer-Verlag. Google ScholarDigital Library
{4} N. Bruno and S. Chaudhuri. Exploiting statistics on query expressions for optimization. In Proc. 2002 ACM SIGMOD Intl. Conf. Managment of Data, pages 263-274. ACM Press, 2002. Google ScholarDigital Library
{5} A. Deshpande, M. N. Garofalakis, and R. R. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proc. 2001 ACM SIGMOD Intl. Conf. Managment of Data, pages 199-210. ACM Press, 2001. Google ScholarDigital Library
{6} M. N. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In Proc. 2002 ACM SIGMOD Intl. Conf. Managment of Data, pages 476-487. ACM Press, 2002. Google ScholarDigital Library
{7} P. Hall and E. J. Hannan. On stochastic complexity and nonparametric density estimation. Biometrika, 75:705-714, 1988.Google ScholarCross Ref
{8} T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.Google ScholarCross Ref
{9} Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Tiovonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J., 42:100-111, 1999.Google ScholarCross Ref
{10} IBM Research. Autonomic computing, 2003. http://www.research.ibm.com/autonomic.Google Scholar
{11} Microsoft Research. The AutoAdmin project, 2003. http://research.microsoft.com/dmx/autoadmin.Google Scholar
{12} J.-M. Petit, F. Toumani, J.-F. Boulicaut, and J. Kouloumdjian. Towards the reverse engineering of denormalized relational databases. In Proc. 12th Intl. Conf. Data Engrg., pages 218-227. IEEE Computer Society Press, 1996. Google ScholarDigital Library
{13} V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proc. 23rd Intl. Conf. Very Large Data Bases, pages 486-495, 1997. Google ScholarDigital Library
{14} H. Scheffé and J. W. Tukey. A formula for sample sizes for population tolerance limits. Ann. Math. Statist., 15:217, 1944.Google ScholarCross Ref
{15} H. Scheffé and J. W. Tukey. Nonparametric estimation. I. Validation of order statistics. Ann. Math. Statist., 16:187-192, 1945.Google ScholarCross Ref
{16} D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York, 1992.Google ScholarCross Ref
{17} M. Siegel, E. Sciore, and S. Salveter. A method for automatic rule derivation to support semantic query optimization. ACM Trans. Database Syst., 17:563-600, 1992. Google ScholarDigital Library
{18} M. Stillger, G. Lohman, V. Markl, and M. Kandil. LEO -- DB2's LEarning Optimizer. In Proc. 27th Intl. Conf. Very Large Data Bases, pages 19-28, 2001. Google ScholarDigital Library
{19} R. Tibshirani. Estimating the number of clusters in a data set via the gap statistic. J. Roy. Statist Soc. B, 63:411-423, 2001.Google ScholarCross Ref
{20} Transaction Processing Performance Council (TPC). TPC Benchmark D (Decision Support) Standard Specification, Revision 2.1. San Jose, CA, 1998. http://www.tpc.org/tpcd.Google Scholar
{21} J. W. Tukey. Nonparametric estimation II. Statistically equivalent blocks and tolerance regions--The continuous case. Ann. Math. Statist., 18:529-539, 1947.Google ScholarCross Ref
{22} G. Weikum, A. Moenkeberg, C. Hasse, and P. Zabback. Self-tuning database technology and information services: from wishful thinking to viable engineering. In Proc. 28th Intl. Conf. Very Large Data Bases, pages 20-34, 2002. Google ScholarDigital Library
{23} S. K. M. Wong, C. J. Butz, and Y. Xiang. Automated database schema design using mined data dependencies. J. Amer. Soc. Inform. Sci., 49:455-470, 1998. Google ScholarDigital Library
{24} C. T. Yu and W. Sun. Automatic knowledge acquisition and maintenance for semantic query optimization. IEEE Trans. Knowledge Data Engrg., 1:362-375, 1989. Google ScholarDigital Library

Index Terms

BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Probabilistic reasoning
      2. Vagueness and fuzzy logic
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Relational database model
    2. Query languages
      1. Relational database query languages
  2. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Mining uncertain data for constrained frequent sets
IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications Symposium

Data mining aims to search for implicit, previously unknown, and potentially useful pieces of information---such as sets of items that are frequently co-occurring together---that are embedded in data. The mined frequent sets can be used in the discovery ...
Read More
Efficient algorithms for mining high-utility itemsets in uncertain databases

High-utility itemset mining (HUIM) is a useful set of techniques for discovering patterns in transaction databases, which considers both quantity and profit of items. However, most algorithms for mining high-utility itemsets (HUIs) assume that the ...
Read More
Efficient algorithms for mining constrained frequent patterns from uncertain data
U '09: Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data

Mining of frequent patterns is one of the popular knowledge discovery and data mining (KDD) tasks. It also plays an essential role in the mining of many other patterns such as correlation, sequences, and association rules. Hence, it has been the subject ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29
September 2003
1134 pages
ISBN:0127224424
Editors:
Johann Christoph Freytag,
Peter C. Lockemann,
Serge Abiteboul,
Michael J. Carey,
Patricia G. Selinger,
Andreas Heuer
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 9 September 2003
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 197
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining uncertain data for constrained frequent sets

Efficient algorithms for mining high-utility itemsets in uncertain databases

Efficient algorithms for mining constrained frequent patterns from uncertain data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining uncertain data for constrained frequent sets

Efficient algorithms for mining high-utility itemsets in uncertain databases

Efficient algorithms for mining constrained frequent patterns from uncertain data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media