skip to main content
10.1145/1451940.1451972acmconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Improved count suffix trees for natural language data

Published:10 September 2008Publication History

ABSTRACT

With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.

References

  1. J. Bae and S. Lee. Substring count estimation in extremely long strings. IEICE - Trans. Inf. Syst., E89-D(3):1148--1156, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Bressan and R. Irawan. Morphologic non-word error detection. Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA '04), pages 31--35, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem, in Proceedings of ICDE 2004, Boston, MA, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. W. Cummings. American English Spelling: An Informal Description. Johns Hopkins University Press, 1988.Google ScholarGoogle Scholar
  5. D. Graff. The aquaint corpus of english news text. Linguistic Data Consortium, Philadelphia, 2002.Google ScholarGoogle Scholar
  6. H. Jagadish, O. Kapitskaia, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal The International Journal on Very Large Data Bases, 9(3):214--230, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Krishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In ACM SIGMOD International Conference on Management of Data, pages 12--13. ACM, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191--203, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Kukich. Techniques for automatically correcting words in text. ACM Computer Surveys, 24:379--439, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(177--183), 1981.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. D. Lewis. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.Google ScholarGoogle Scholar
  12. F. M. Lian. Word hy-phen-a-tion by com-put-er. PhD thesis, Stanford University, Stanford, August 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.Google ScholarGoogle Scholar
  14. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  15. Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1--11, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. M. Zamora, J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information of Processing and Management, 17:305--316, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  19. C. Silverstein, H. Marais, M. Henzinger, M. Moricz. Analysis of a Very Large Web Search Engine Query Log. ACM SIGIR Forum, Vol. 33, pp. 6--12, 1999, ISSN:0163-5840 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Giegerich, S. Kurtz, J. Stoye, Efficient Implementation of Lazy Suffix Trees, Software: Practice and Experience, Vol. 33, No 11, 2003Google ScholarGoogle Scholar
  1. Improved count suffix trees for natural language data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications
        September 2008
        289 pages
        ISBN:9781605581880
        DOI:10.1145/1451940

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 September 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate74of210submissions,35%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader