skip to main content
10.5555/1315451.1315509dlproceedingsArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

Published:09 September 2003Publication History

ABSTRACT

We present the BHUNT scheme for automatically discovering algebraic constraints between pairs of columns in relational data. The constraints may be "fuzzy" in that they hold for most, but not all, of the records, and the columns may be in the same table or different tables. Such constraints are of interest in the context of both data mining and query optimization, and the BHUNT methodology can potentially be adapted to discover fuzzy functional dependencies and other useful relationships. BHUNT first identifies candidate sets of column value pairs that are likely to satisfy an algebraic constraint. This discovery process exploits both system catalog information and data samples, and employs pruning heuristics to control processing costs. For each candidate, BHUNT constructs algebraic constraints by applying statistical histogramming, segmentation, or clustering techniques to samples of column values. Using results from the theory of tolerance intervals, the sample sizes can be chosen to control the number of "exception" records that fail to satisfy the discovered constraints. In query-optimization mode, BHUNT can automatically partition the data into normal and exception records. During subsequent query processing, queries can be modified to incorporate the constraints; the optimizer uses the constraints to identify new, more efficient access paths. The results are then combined with the results of executing the original query against the (small) set of exception records. Experiments on a very large database using a prototype implementation of BHUNT show reductions in table accesses of up to two orders of magnitude, leading to speedups in query processing by up to a factor of 6.8.

References

  1. {1} M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions. Dover, New York, 1972. Ninth printing.Google ScholarGoogle Scholar
  2. {2} D. Barbarà, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The New Jersey data reduction report. IEEE Data Engrg. Bull., 20:3-45, 1997.Google ScholarGoogle Scholar
  3. {3} S. Bell and P. Brockhausen. Discovery of constraints and data dependencies in databases. In Proc. Europ. Conf. Machine Learning (ECML-95), Lecture Notes in Artificial Intellegence 914, pages 267-270, Berlin, 1995. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. {4} N. Bruno and S. Chaudhuri. Exploiting statistics on query expressions for optimization. In Proc. 2002 ACM SIGMOD Intl. Conf. Managment of Data, pages 263-274. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. {5} A. Deshpande, M. N. Garofalakis, and R. R. Independence is good: Dependency-based histogram synopses for high-dimensional data. In Proc. 2001 ACM SIGMOD Intl. Conf. Managment of Data, pages 199-210. ACM Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} M. N. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In Proc. 2002 ACM SIGMOD Intl. Conf. Managment of Data, pages 476-487. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} P. Hall and E. J. Hannan. On stochastic complexity and nonparametric density estimation. Biometrika, 75:705-714, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  8. {8} T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  9. {9} Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Tiovonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J., 42:100-111, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  10. {10} IBM Research. Autonomic computing, 2003. http://www.research.ibm.com/autonomic.Google ScholarGoogle Scholar
  11. {11} Microsoft Research. The AutoAdmin project, 2003. http://research.microsoft.com/dmx/autoadmin.Google ScholarGoogle Scholar
  12. {12} J.-M. Petit, F. Toumani, J.-F. Boulicaut, and J. Kouloumdjian. Towards the reverse engineering of denormalized relational databases. In Proc. 12th Intl. Conf. Data Engrg., pages 218-227. IEEE Computer Society Press, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proc. 23rd Intl. Conf. Very Large Data Bases, pages 486-495, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} H. Scheffé and J. W. Tukey. A formula for sample sizes for population tolerance limits. Ann. Math. Statist., 15:217, 1944.Google ScholarGoogle ScholarCross RefCross Ref
  15. {15} H. Scheffé and J. W. Tukey. Nonparametric estimation. I. Validation of order statistics. Ann. Math. Statist., 16:187-192, 1945.Google ScholarGoogle ScholarCross RefCross Ref
  16. {16} D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  17. {17} M. Siegel, E. Sciore, and S. Salveter. A method for automatic rule derivation to support semantic query optimization. ACM Trans. Database Syst., 17:563-600, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {18} M. Stillger, G. Lohman, V. Markl, and M. Kandil. LEO -- DB2's LEarning Optimizer. In Proc. 27th Intl. Conf. Very Large Data Bases, pages 19-28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} R. Tibshirani. Estimating the number of clusters in a data set via the gap statistic. J. Roy. Statist Soc. B, 63:411-423, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  20. {20} Transaction Processing Performance Council (TPC). TPC Benchmark D (Decision Support) Standard Specification, Revision 2.1. San Jose, CA, 1998. http://www.tpc.org/tpcd.Google ScholarGoogle Scholar
  21. {21} J. W. Tukey. Nonparametric estimation II. Statistically equivalent blocks and tolerance regions--The continuous case. Ann. Math. Statist., 18:529-539, 1947.Google ScholarGoogle ScholarCross RefCross Ref
  22. {22} G. Weikum, A. Moenkeberg, C. Hasse, and P. Zabback. Self-tuning database technology and information services: from wishful thinking to viable engineering. In Proc. 28th Intl. Conf. Very Large Data Bases, pages 20-34, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. {23} S. K. M. Wong, C. J. Butz, and Y. Xiang. Automated database schema design using mined data dependencies. J. Amer. Soc. Inform. Sci., 49:455-470, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} C. T. Yu and W. Sun. Automatic knowledge acquisition and maintenance for semantic query optimization. IEEE Trans. Knowledge Data Engrg., 1:362-375, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader