ABSTRACT
In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the distance R from P, it reports a (possibly different) substring of T within distance cR from P. The length of the pattern P, denoted by m, is not known in advance. For the case where the distances are measured using the Hamming distance, we present a data structure which uses Õ(n1+1/c) space1 and with Õ(n1/c + mno(1)) query time. This essentially matches the earlier bounds of [Ind98], which assumed that the pattern length m is fixed in advance. In addition, our data structure can be constructed in time Õ(n1+1/c + n1+o(1)M1/3), where M is an upper bound for m. This essentially matches the preprocessing bound of [Ind98] as long as the term Õ(n1+1/c) dominates the running time, which is the case when, e.g., c < 3.We also extend our results to the case where the distances are measured according to the l1 distance. The query time and the space bound are essentially the same, while the preprocessing time becomes Õ(n1+1/c + n1+o(1)M2/3).
Index Terms
- Efficient algorithms for substring near neighbor problem
Recommendations
Linear Time Algorithms for Generalizations of the Longest Common Substring Problem
In its simplest form, the longest common substring problem is to find a longest substring common to two or multiple strings. Using (generalized) suffix trees, this problem can be solved in linear time and space. A first generalization is the k -common ...
Linear Time Algorithms for Generalizations of the Longest Common Substring Problem
In its simplest form, the longest common substring problem is to find a longest substring common to two or multiple strings. Using (generalized) suffix trees, this problem can be solved in linear time and space. A first generalization is the k-common ...
More Efficient Algorithms for Closest String and Substring Problems
The closest string problem and the closest substring problem are all natural theoretical computer science problems and find important applications in computational biology. Given $n$ input strings, the closest string (substring) problem finds a new ...
Comments