Article

Similarity search: a matching based approach

Authors:
Anthony K. H. Tung

National Univ. of Singapore

National Univ. of Singapore
View Profile

,
Rui Zhang

Univ. of Melbourne

Univ. of Melbourne
View Profile

,
Nick Koudas

Univ. of Toronto

Univ. of Toronto
View Profile

,
Beng Chin Ooi

National Univ. of Singapore

National Univ. of Singapore
View Profile

Authors Info & Claims

VLDB '06: Proceedings of the 32nd international conference on Very large data basesSeptember 2006Pages 631–642

Published:01 September 2006Publication History

VLDB '06: Proceedings of the 32nd international conference on Very large data bases

Pages 631–642

ABSTRACT

Similarity search is a crucial task in multimedia retrieval and data mining. Most existing work has modelled this problem as the nearest neighbor (NN) problem, which considers the distance between the query object and the data objects over a fixed set of features. Such an approach has two drawbacks: 1) it leaves many partial similarities uncovered; 2) the distance is often affected by a few dimensions with high dissimilarity. To overcome these drawbacks, we propose the k-n-match problem in this paper.The k-n-match problem models similarity search as matching between the query object and the data objects in n dimensions, where n is a given integer smaller than dimensionality d and these n dimensions are determined dynamically to make the query object and the data objects returned in the answer set match best. The k-n-match query is expected to be superior to the kNN query in discovering partial similarities, however, it may not be as good in identifying full similarity since a single value of n may only correspond to a particular aspect of an object instead of the entirety. To address this problem, we further introduce the frequent k-n-match problem, which finds a set of objects that appears in the k-n-match answers most frequently for a range of n values. Moreover, we propose search algorithms for both problems. We prove that our proposed algorithm is optimal in terms of the number of individual attributes retrieved, which is especially useful for information retrieval from multiple systems. We can also apply the proposed algorithmic strategy to achieve a disk based algorithm for the (frequent) k-n-match query. By a thorough experimental study using both real and synthetic data sets, we show that: 1) the k-n-match query yields better result than the kNN query in identifying similar objects by partial similarities; 2) our proposed method (for processing the frequent k-n-match query) outperforms existing techniques for similarity search in terms of both effectiveness and efficiency.

References

{1} http://www1.cs.columbia.edu/CAVE/research/ softlib/coil-100.html.]]Google Scholar
{2} ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.]]Google Scholar
{3} http://kdd.ics.uci.edu/databases/CorelFeatures/ CorelFeatures.data.html.]]Google Scholar
{4} C. C. Aggarwal. Towards meaningful high-dimensional nearest neighbor search by human-computer interaction. In ICDE, 2002.]]Google ScholarCross Ref
{5} C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In ICDT, 2001.]] Google ScholarDigital Library
{6} C. C. Aggarwal and Philip S. Yu. The igrid index: Reversing the dimensionality curse for similarity indexing in high dimensional space. In KDD, 2000.]] Google ScholarDigital Library
{7} S. Berchtold, D. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In VLDB, 1996.]] Google ScholarDigital Library
{8} K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful? In ICDT, 1999.]] Google ScholarDigital Library
{9} S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, 2001.]]Google ScholarDigital Library
{10} T. Chiueh. Content-based image indexing. In VLDB, 1994.]] Google ScholarDigital Library
{11} R. Fagin. Combining fuzzy information from multiple systems. In PODS, 1996.]] Google ScholarDigital Library
{12} R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, 2003.]] Google ScholarDigital Library
{13} R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.]] Google ScholarDigital Library
{14} C. Faloutsos, W. Equitz, M. Flickner, W. Niblack, D. Petkovic, and R. Barber. Efficient and effective querying by image content. Journal of Intelligent Information Systems, 3(3):231-262, 1994.]] Google ScholarDigital Library
{15} Richard W. Hamming. Error detecting and error correcting codes. Bell Systems Technical Journal, 29:147-160, 1950.]]Google ScholarCross Ref
{16} A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In VLDB, 2000.]] Google ScholarDigital Library
{17} H. V. Jagadish. A retrieval technique for similar shapes. In SIGMOD, 1991.]] Google ScholarDigital Library
{18} E. Y. Chang K.-S. Goh, B. Li. Dyndex: a dynamic and non-metric space indexer. In ACM Multimedia, 2002.]] Google ScholarDigital Library
{19} Karen Kukich. Techniques for automatically correcting words in text. ACM Computing survey, 24(4):377-439, 1992.]] Google ScholarDigital Library
{20} D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive skyline computation in database systems. TODS, 30(1):41-82, 2005.]] Google ScholarDigital Library
{21} R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.]] Google ScholarDigital Library
{22} D. A. White and R. Jain. Similarity indexing with the ss-tree. In ICDE, 1996.]] Google ScholarDigital Library

Index Terms

Similarity search: a matching based approach

Recommendations

Efficient similarity search: arbitrary similarity measures, arbitrary composition
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Given a (large) set of objects and a query, similarity search aims to find all objects similar to the query. A frequent approach is to define a set of base similarity measures for the different aspects of the objects, and to build light-weight ...
Read More
Flexible aggregate similarity search
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Aggregate similarity search, a.k.a. aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves the most (or top-k) similar object to Q from a database P, ...
Read More
Cost-aware query planning for similarity search

Similarity search aims to find all objects similar to a query object. Typically, some base similarity measures for the different properties of the objects are defined, and light-weight similarity indexes for these measures are built. A query plan ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
VLDB '06: Proceedings of the 32nd international conference on Very large data bases
September 2006
1269 pages
Editors:
Umeshwar Dayal,
Khu-Yong Whang,
David Lomet,
Gustavo Alonso,
Guy Lohman,
Martin Kersten,
Sang K. Cha,
Young-Kuk Kim
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2006
Check for updates
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 626
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Similarity search: a matching based approach

VLDB '06: Proceedings of the 32nd international conference on Very large data bases

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient similarity search: arbitrary similarity measures, arbitrary composition

Flexible aggregate similarity search

Cost-aware query planning for similarity search