skip to main content
10.5555/1182635.1164182acmconferencesArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
Article

Similarity search: a matching based approach

Published:01 September 2006Publication History

ABSTRACT

Similarity search is a crucial task in multimedia retrieval and data mining. Most existing work has modelled this problem as the nearest neighbor (NN) problem, which considers the distance between the query object and the data objects over a fixed set of features. Such an approach has two drawbacks: 1) it leaves many partial similarities uncovered; 2) the distance is often affected by a few dimensions with high dissimilarity. To overcome these drawbacks, we propose the k-n-match problem in this paper.The k-n-match problem models similarity search as matching between the query object and the data objects in n dimensions, where n is a given integer smaller than dimensionality d and these n dimensions are determined dynamically to make the query object and the data objects returned in the answer set match best. The k-n-match query is expected to be superior to the kNN query in discovering partial similarities, however, it may not be as good in identifying full similarity since a single value of n may only correspond to a particular aspect of an object instead of the entirety. To address this problem, we further introduce the frequent k-n-match problem, which finds a set of objects that appears in the k-n-match answers most frequently for a range of n values. Moreover, we propose search algorithms for both problems. We prove that our proposed algorithm is optimal in terms of the number of individual attributes retrieved, which is especially useful for information retrieval from multiple systems. We can also apply the proposed algorithmic strategy to achieve a disk based algorithm for the (frequent) k-n-match query. By a thorough experimental study using both real and synthetic data sets, we show that: 1) the k-n-match query yields better result than the kNN query in identifying similar objects by partial similarities; 2) our proposed method (for processing the frequent k-n-match query) outperforms existing techniques for similarity search in terms of both effectiveness and efficiency.

References

  1. {1} http://www1.cs.columbia.edu/CAVE/research/ softlib/coil-100.html.]]Google ScholarGoogle Scholar
  2. {2} ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.]]Google ScholarGoogle Scholar
  3. {3} http://kdd.ics.uci.edu/databases/CorelFeatures/ CorelFeatures.data.html.]]Google ScholarGoogle Scholar
  4. {4} C. C. Aggarwal. Towards meaningful high-dimensional nearest neighbor search by human-computer interaction. In ICDE, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. {5} C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In ICDT, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} C. C. Aggarwal and Philip S. Yu. The igrid index: Reversing the dimensionality curse for similarity indexing in high dimensional space. In KDD, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} S. Berchtold, D. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In VLDB, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful? In ICDT, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, 2001.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} T. Chiueh. Content-based image indexing. In VLDB, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} R. Fagin. Combining fuzzy information from multiple systems. In PODS, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. {13} R. Fagin, A. Lotem, and M. Naor. Optimal aggregation algorithms for middleware. In PODS, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. {14} C. Faloutsos, W. Equitz, M. Flickner, W. Niblack, D. Petkovic, and R. Barber. Efficient and effective querying by image content. Journal of Intelligent Information Systems, 3(3):231-262, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} Richard W. Hamming. Error detecting and error correcting codes. Bell Systems Technical Journal, 29:147-160, 1950.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. {16} A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? In VLDB, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} H. V. Jagadish. A retrieval technique for similar shapes. In SIGMOD, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. {18} E. Y. Chang K.-S. Goh, B. Li. Dyndex: a dynamic and non-metric space indexer. In ACM Multimedia, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} Karen Kukich. Techniques for automatically correcting words in text. ACM Computing survey, 24(4):377-439, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive skyline computation in database systems. TODS, 30(1):41-82, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. {21} R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} D. A. White and R. Jain. Similarity indexing with the ss-tree. In ICDE, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Similarity search: a matching based approach

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader