ABSTRACT
With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.
- J. Bae and S. Lee. Substring count estimation in extremely long strings. IEICE - Trans. Inf. Syst., E89-D(3):1148--1156, 2006. Google ScholarDigital Library
- S. Bressan and R. Irawan. Morphologic non-word error detection. Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA '04), pages 31--35, 2004. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem, in Proceedings of ICDE 2004, Boston, MA, USA, 2004. Google ScholarDigital Library
- D. W. Cummings. American English Spelling: An Informal Description. Johns Hopkins University Press, 1988.Google Scholar
- D. Graff. The aquaint corpus of english news text. Linguistic Data Consortium, Philadelphia, 2002.Google Scholar
- H. Jagadish, O. Kapitskaia, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal The International Journal on Very Large Data Bases, 9(3):214--230, 2000. Google ScholarDigital Library
- P. Krishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In ACM SIGMOD International Conference on Management of Data, pages 12--13. ACM, 1996. Google ScholarDigital Library
- R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191--203, 1993. Google ScholarDigital Library
- K. Kukich. Techniques for automatically correcting words in text. ACM Computer Surveys, 24:379--439, 1992. Google ScholarDigital Library
- M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(177--183), 1981.Google ScholarCross Ref
- D. D. Lewis. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.Google Scholar
- F. M. Lian. Word hy-phen-a-tion by com-put-er. PhD thesis, Stanford University, Stanford, August 1983. Google ScholarDigital Library
- J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.Google Scholar
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarCross Ref
- Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005. Google ScholarDigital Library
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library
- P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1--11, 1973. Google ScholarDigital Library
- E. M. Zamora, J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information of Processing and Management, 17:305--316, 1981.Google ScholarCross Ref
- C. Silverstein, H. Marais, M. Henzinger, M. Moricz. Analysis of a Very Large Web Search Engine Query Log. ACM SIGIR Forum, Vol. 33, pp. 6--12, 1999, ISSN:0163-5840 Google ScholarDigital Library
- R. Giegerich, S. Kurtz, J. Stoye, Efficient Implementation of Lazy Suffix Trees, Software: Practice and Experience, Vol. 33, No 11, 2003Google Scholar
- Improved count suffix trees for natural language data
Recommendations
Processing comparable corpora with Bilingual Suffix Trees
EMNLP '02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. ...
Improved selectivity estimator for XML queries based on structural synopsis
With the increasing popularity of XML database applications, the use of efficient XML query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best ...
Comments