research-article

Improved count suffix trees for natural language data

Authors:
Guido Sautter

Universität Karlsruhe (TH), Karlsruhe

Universität Karlsruhe (TH), Karlsruhe
View Profile

,
Cristina Abba

Corso Francia, Torino

Corso Francia, Torino
View Profile

,
Klemens Böhm

Universität Karlsruhe (TH), Karlsruhe

Universität Karlsruhe (TH), Karlsruhe
View Profile

IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applicationsSeptember 2008Pages 231–236https://doi.org/10.1145/1451940.1451972

Published:10 September 2008Publication History

IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications

Pages 231–236

ABSTRACT

With more and more natural language text stored in databases, handling respective query predicates becomes very important. Optimizing queries with predicates includes (sub)string estimation, i.e., estimating the selectivity of query terms based on small summary statistics before query execution. Count Suffix Trees (CST) are commonly used to this end. While CST yield good estimates, they are expensive to build and require a large amount of memory to be stored. To fit in the data dictionary of database systems, they have to be severely pruned. Existing pruning techniques are based on suffix frequency or tree depth. In this paper, we propose new filtering and pruning techniques that reduce both the size of CST over natural-language texts and the cost of building them. The core idea is to exploit features of the natural language data, i.e., regarding only the suffixes that are useful in a linguistic sense. The most important innovations are (a) a new aggressive approximate syllabification technique to filter out suffixes, (b) a new affix and prefix stripping procedure that conflates more terms than conventional stemming techniques, (c) the deployment of state-of-the-art trigram techniques and a new syllable-based mechanism to filter out non-words (i.e., misspellings and other language anomalies like foreign words), which would cause an over-proportional growth of the CST otherwise. -- Our evaluation with large English text corpora shows that our new mechanisms in combination decrease the size of a CST by up to 80% and shorten the build phase significantly. From a different perspective, if storage space remains unchanged, the accuracy of selectivity estimates computed from the CST increases by up to 70%.

References

J. Bae and S. Lee. Substring count estimation in extremely long strings. IEICE - Trans. Inf. Syst., E89-D(3):1148--1156, 2006. Google ScholarDigital Library
S. Bressan and R. Irawan. Morphologic non-word error detection. Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA '04), pages 31--35, 2004. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem, in Proceedings of ICDE 2004, Boston, MA, USA, 2004. Google ScholarDigital Library
D. W. Cummings. American English Spelling: An Informal Description. Johns Hopkins University Press, 1988.Google Scholar
D. Graff. The aquaint corpus of english news text. Linguistic Data Consortium, Philadelphia, 2002.Google Scholar
H. Jagadish, O. Kapitskaia, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal The International Journal on Very Large Data Bases, 9(3):214--230, 2000. Google ScholarDigital Library
P. Krishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In ACM SIGMOD International Conference on Management of Data, pages 12--13. ACM, 1996. Google ScholarDigital Library
R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191--203, 1993. Google ScholarDigital Library
K. Kukich. Techniques for automatically correcting words in text. ACM Computer Surveys, 24:379--439, 1992. Google ScholarDigital Library
M. Lennon, D. Pierce, B. Tarry, and P. Willett. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3(177--183), 1981.Google ScholarCross Ref
D. D. Lewis. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/reuters21578/.Google Scholar
F. M. Lian. Word hy-phen-a-tion by com-put-er. PhD thesis, Stanford University, Stanford, August 1983. Google ScholarDigital Library
J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22--31, 1968.Google Scholar
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.Google ScholarCross Ref
Y. Tian, S. Tata, R. A. Hankins, and J. M. Patel. Practical methods for constructing suffix trees. The VLDB Journal, 14(3):281--299, 2005. Google ScholarDigital Library
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library
P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pages 1--11, 1973. Google ScholarDigital Library
E. M. Zamora, J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information of Processing and Management, 17:305--316, 1981.Google ScholarCross Ref
C. Silverstein, H. Marais, M. Henzinger, M. Moricz. Analysis of a Very Large Web Search Engine Query Log. ACM SIGIR Forum, Vol. 33, pp. 6--12, 1999, ISSN:0163-5840 Google ScholarDigital Library
R. Giegerich, S. Kurtz, J. Stoye, Efficient Implementation of Lazy Suffix Trees, Software: Practice and Experience, Vol. 33, No 11, 2003Google Scholar

Improved count suffix trees for natural language data
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Processing comparable corpora with Bilingual Suffix Trees
EMNLP '02: Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10

We introduce Bilingual Suffix Trees (BST), a data structure that is suitable for exploiting comparable corpora. We discuss algorithms that use BSTs in order to create parallel corpora and learn translations of unseen words from comparable corpora. ...
Read More
Improved selectivity estimator for XML queries based on structural synopsis

With the increasing popularity of XML database applications, the use of efficient XML query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best ...
Read More
Introduction to Chinese Natural Language Processing
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications
September 2008
289 pages
ISBN:9781605581880
DOI:10.1145/1451940
General Chair:
Bipin C. Desai
Concordia University
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 September 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
count suffix tree
pruning
query optimization
selectivity estimation
text data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate74of210submissions,35%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 171
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improved count suffix trees for natural language data

IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications

ABSTRACT

References

Cited By

Recommendations

Processing comparable corpora with Bilingual Suffix Trees

Improved selectivity estimator for XML queries based on structural synopsis

Introduction to Chinese Natural Language Processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Improved count suffix trees for natural language data

IDEAS '08: Proceedings of the 2008 international symposium on Database engineering & applications

ABSTRACT

References

Cited By

Recommendations

Processing comparable corpora with Bilingual Suffix Trees

Improved selectivity estimator for XML queries based on structural synopsis

Introduction to Chinese Natural Language Processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media