skip to main content
10.1145/1873951.1874018acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Building contextual visual vocabulary for large-scale image applications

Authors Info & Claims
Published:25 October 2010Publication History

ABSTRACT

Not withstanding its great success and wide adoption in Bag-of-visual Words representation, visual vocabulary created from single image local features is often shown to be ineffective largely due to three reasons. First, many detected local features are not stable enough, resulting in many noisy and non-descriptive visual words in images. Second, single visual word discards the rich spatial contextual information among the local features, which has been proven to be valuable for visual matching. Third, the distance metric commonly used for generating visual vocabulary does not take the semantic context into consideration, which renders them to be prone to noise. To address these three confrontations, we propose an effective visual vocabulary generation framework containing three novel contributions: 1) we propose an effective unsupervised local feature refinement strategy; 2) we consider local features in groups to model their spatial contexts; 3) we further learn a discriminant distance metric between local feature groups, which we call discriminant group distance. This group distance is further leveraged to induce visual vocabulary from groups of local features. We name it contextual visual vocabulary, which captures both the spatial and semantic contexts. We evaluate the proposed local feature refinement strategy and the contextual visual vocabulary in two large-scale image applications: large-scale near-duplicate image retrieval on a dataset containing 1.5 million images and image search re-ranking tasks. Our experimental results show that the contextual visual vocabulary shows significant improvement over the classic visual vocabulary. Moreover, it outperforms the state-of-the-art Bundled Feature in the terms of retrieval precision, memory consumption and efficiency.

References

  1. J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. Proc. ICCV, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. Proc. CVPR, pp. 2161--2168, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2): 91--110, Nov. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. Proc. ICCV, pp. 17--21, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Lazebnik and M. Raginsky. Supervised learning of quantizer codebook by information loss minimization. T-PAMI, 31(7): 1294--1309, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Perronnin. Universal and adapted vocabularies for generic visual categorization. T-PAMI, 30(7): 1243--1256, July 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visual words to visual phrases. Proc. CVPR, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. Y. Zheng, M. Zhao, S. Y. Neo, T. Chua, and Q. Tian. Visual synset: a higher-level visual representation. Proc. CVPR, 2008.Google ScholarGoogle Scholar
  9. D. Liu, G. Hua, P. Viola, and T. Chen. Integrated feature selection and higher-order spatial feature extraction for object categorization. Proc. CVPR, pp. 1--8, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlations. Proc. CVPR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Perronnin and C. Dance. Fisher kernels on visual vocabulary for image categorization. Proc. CVPR, pp. 1--8, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Liu, Y. Yang, and M. Shah. Learning semantic visual vocabularies using diffusion distance. Proc. CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  13. L. Yang, P. Meer, and D. J. Foran. Multiple class segmentation using a unified framework over mean-shift patches. Proc. CVPR, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. Winn, A. Criminisi, and T. Minka. Object categorization by learning universal visual dictionary. Proc. ICCV, pp. 17--21, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proc. CVPR, pp. 2169--2178, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification with sets of image feature. Proc. ICCV, pp. 1458--1465, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. CVPR, 2009.Google ScholarGoogle Scholar
  18. F. Moosmann, E. Nowak, and F. Jurie. Randomized clustering forests for image classification. T-PAMI, 30(9): 1632--1646, Sep. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Marszalek and C. Schmid. Spatial weighting for bag-of-features. Proc. CVPR, pp. 2118--2125, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Wu, S. C. H. Hoi, and N. Yu. Semantic-preserving bag-of-words models for efficient Iimage annotation. Proc. ACM workshop on LSMRM, pp. 19--26, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Jiang, C. Ngo, and S. Chang. Semantic context transfer across heterogeneous sources for domain adaptive video search. Proc. ACM Multimedia, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Wang, Y. G. Jiang, and C. W. Ngo. Video event detection using motion relativity and visual relatedness. ACM Multimedia, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Xu and S. F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. T-PAMI, 30(11): 1985--1997, Nov. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Zhou, X. D. Zhuang, S. C. Yan, S. F. Chang, M.H. Johnson, and T.S. Huang. SIFT-bag kernel for video event analysis. Proc. ACM Multimedia, pp. 229--238, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li. Descriptive visual words and visual phrases for image applications. Proc. ACM Multimedia, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Z. Wu, Q. Ke, and J. Sun. Bundling features for large-scale partial-duplicate web image search. Proc. CVPR, 2009.Google ScholarGoogle Scholar
  27. O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: finding a (thick) needle in a haystack. Proc. CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Perdoch, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. Proc. CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  29. H. Jegou, M. Douze, C. Schmid, and P. Petrez. Aggregating local descriptors into s compact image representation. Proc. CVPR, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  30. P. Viola and M. Jones. Robust real-time face detection. ICCV, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  31. A. Globerson and S. Roweis. Metric learning by collapsing classes. Adv. in Neu. Info. Proce. Sys., 18: 451--458, 2006.Google ScholarGoogle Scholar
  32. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. CVPR, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  33. J. Matas, O. Chum, M. Urba, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. Proc. BMVC, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  34. Y. Jing and S. Baluja. VisualRank: applying PageRank to large-scale image search. T-PAMI, 30(11): 1877--1890, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X. Hua. Bayesian video search reranking. Proc. ACM Multimedia, pp. 131--140, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Liu, X. Hua, L. Yang, M. Wang, and H. Zhang. Tag Ranking, Proc. WWW, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Deerwester, S. Dumais, and R. Harshman. Indexing by latent semantic analysis. J-ASIS, 41(6): 391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  38. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proc. VLDB, pp. 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building contextual visual vocabulary for large-scale image applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '10: Proceedings of the 18th ACM international conference on Multimedia
      October 2010
      1836 pages
      ISBN:9781605589336
      DOI:10.1145/1873951

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 October 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader