Abstract
Similarity search of complex structures is an important operation in graph-related applications since exact matching is often too restrictive. In this article, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed feature misses, our structural filtering algorithm can filter graphs without performing pairwise similarity computation. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy that could maximize the filtering capability. We prove that the complexity of optimal feature set selection is Ω(2m) in the worst case, where m is the number of features for selection. In practice, we identify several criteria to build effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly within a multifilter composition framework. The proposed feature-based filtering concept can be generalized and applied to searching approximate nonconsecutive sequences, trees, and other structured data as well.
- Beretti, S., Bimbo, A., and Vicario, E. 2001. Efficient matching and indexing of graph models in content based retrieval. IEEE Trans. Patt. Analys. Mach. Intell. 23, 1089--1105. Google ScholarDigital Library
- Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I., and Bourne, P. 2000. The protein data bank. Nucl. Acids Res. 28, 235--242.Google ScholarCross Ref
- Bunke, H. and Shearer, K. 1998. A graph distance metric based on the maximal common subgraph. Patt. Recogn. Lett. 19, 255--259. Google ScholarDigital Library
- Erickson, J. 1996. Lower bounds for fundamental geometric problems. Ph.D. dissertation. University of California at Berkeley, Berkley, CA. Google ScholarDigital Library
- Feige, U. 1998. A threshold of ln n for approximating set cover. J. ACM 45, 634--652. Google ScholarDigital Library
- Garey, M. and Johnson, D. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman & Co., New York, NY. Google ScholarDigital Library
- Giugno, R. and Shasha, D. 2002. Graphgrep: A fast and universal method for querying graphs. Proceedings of the 2002 International Conference on Pattern Recognition (ICPR'02). 112--115.Google Scholar
- Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a dbms for approximate string processing. Data Eng. Bull. 24, 28--37.Google Scholar
- Hagadone, T. 1992. Molecular substructure similarity searching: Efficient retrieval in two-dimensional structure databases. J. Chem. Inf. Comput. Sci. 32, 515--521.Google ScholarCross Ref
- Hochbaum, D. 1997. Approximation Algorithms for NP-Hard Problems. PWS Publishing, Boston, MA. Google ScholarDigital Library
- Holder, L., Cook, D., and Djoko, S. 1994. Substructure discovery in the subdue system. In Proceedings of the AAAI'94 Workshop on Knowledge Discovery in Databases (KDD'94). 169--180.Google Scholar
- Kailing, K., Kriegel, H., Schnauer, S., and Seidl, T. 2004. Efficient similarity search for hierarchical data in large databases. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT'04). 676--693.Google Scholar
- Kanehisa, M. and Goto, S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28, 27--30.Google ScholarCross Ref
- Kislicyn, S. S. 1964. On the selection of the kth element of an ordered set by pairwise comparisons. Sibirskii Mat. Z. (in Russian) 5, 557--564.Google Scholar
- Kuramochi, M. and Karypis, G. 2001. Frequent subgraph discovery. In Proceedings of the 2001 International Conference on Data Mining (ICDM'01). 313--320. Google ScholarDigital Library
- Messmer, B. and Bunke, H. 1998. A new algorithm for error-tolerant subgraph isomorphism detection. IEEE Trans. Patt. Analys. Mach. Intell. 20, 493--504. Google ScholarDigital Library
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 31--88. Google ScholarDigital Library
- Nilsson, N. 1980. Principles of Artificial Intelligence. Morgan Kaufmann, Palo Alto, CA. Google ScholarDigital Library
- Padberg, M. 1995. Linear Optimization and Extensions. Springer-Verlag, Berlin, Germany.Google Scholar
- Petrakis, E. and Faloutsos, C. 1997. Similarity searching in medical image databases. Knowl. Data Eng. 9, 3, 435--447. Google ScholarDigital Library
- Raymond, J., Gardiner, E., and Willett, P. 2002. Rascal: Calculation of graph similarity using maximum common edge subgraphs. Comput. J. 45, 631--644.Google ScholarCross Ref
- Shasha, D., Wang, J., and Giugno, R. 2002. Algorithmics and applications of tree and graph searching. In Proceedings of the 21th ACM Symposium on Principles of Database Systems (PODS'02). 39--52. Google ScholarDigital Library
- Srinivasa, S. and Kumar, S. 2003. A platform based on the multi-dimensional data model for analysis of bio-molecular structures. In Proceedings of the 2003 International Conference on Very Large Data Bases (VLDB'03). 975--986. Google ScholarDigital Library
- Ukkonen, E. 1992. Approximate string matching with q-grams and maximal matches. Theoret. Comput. Sci. 191--211. Google ScholarDigital Library
- Ullmann, J. 1977. Binary n-gram technique for automatic correction of substitution, deletion, insertion, and reversal errors in words. Comput. J. 20, 141--147.Google ScholarCross Ref
- Wang, J., Zhang, K., Jeong, K., and Shasha, D. 1994. A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6, 559--571. Google ScholarDigital Library
- Willett, P., Barnard, J., and Downs, G. 1998. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983--996.Google ScholarCross Ref
- Yan, X., Yu, P., and Han, J. 2004. Graph indexing: A frequent structure-based approach. In Proceedings of the 2004 ACM International Conference on Management of Data (SIGMOD'04). 335--346. Google ScholarDigital Library
- Yan, X., Yu, P., and Han, J. 2005. Substructure similarity search in graph databases. In Proceedings of the 2005 ACM International Conference on Management of Data (SIGMOD'05). 766--777. Google ScholarDigital Library
Index Terms
- Feature-based similarity search in graph structures
Recommendations
Efficient processing of graph similarity search
A graph similarity search is to find a set of graphs from a graph database that are similar to a given query graph. Existing works solve this problem by first defining a similarity measure between two graphs, and then presenting a filtering mechanism ...
Finding top-k similar graphs in graph databases
EDBT '12: Proceedings of the 15th International Conference on Extending Database TechnologyQuerying similar graphs in graph databases has been widely studied in graph query processing in recent years. Existing works mainly focus on subgraph similarity search and supergraph similarity search. In this paper, we study the problem of finding top-...
Index Structures for Fast Similarity Search for Binary Vectors
This article reviews index structures for fast similarity search for objects represented by binary vectors (with components equal to 0 or 1). Structures for both exact and approximate search by Hamming distance and other similarity measures are ...
Comments