ABSTRACT
Not withstanding its great success and wide adoption in Bag-of-visual Words representation, visual vocabulary created from single image local features is often shown to be ineffective largely due to three reasons. First, many detected local features are not stable enough, resulting in many noisy and non-descriptive visual words in images. Second, single visual word discards the rich spatial contextual information among the local features, which has been proven to be valuable for visual matching. Third, the distance metric commonly used for generating visual vocabulary does not take the semantic context into consideration, which renders them to be prone to noise. To address these three confrontations, we propose an effective visual vocabulary generation framework containing three novel contributions: 1) we propose an effective unsupervised local feature refinement strategy; 2) we consider local features in groups to model their spatial contexts; 3) we further learn a discriminant distance metric between local feature groups, which we call discriminant group distance. This group distance is further leveraged to induce visual vocabulary from groups of local features. We name it contextual visual vocabulary, which captures both the spatial and semantic contexts. We evaluate the proposed local feature refinement strategy and the contextual visual vocabulary in two large-scale image applications: large-scale near-duplicate image retrieval on a dataset containing 1.5 million images and image search re-ranking tasks. Our experimental results show that the contextual visual vocabulary shows significant improvement over the classic visual vocabulary. Moreover, it outperforms the state-of-the-art Bundled Feature in the terms of retrieval precision, memory consumption and efficiency.
- J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. Proc. ICCV, 2003. Google ScholarDigital Library
- D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. Proc. CVPR, pp. 2161--2168, 2006. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2): 91--110, Nov. 2004. Google ScholarDigital Library
- F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. Proc. ICCV, pp. 17--21, 2005. Google ScholarDigital Library
- S. Lazebnik and M. Raginsky. Supervised learning of quantizer codebook by information loss minimization. T-PAMI, 31(7): 1294--1309, July 2009. Google ScholarDigital Library
- F. Perronnin. Universal and adapted vocabularies for generic visual categorization. T-PAMI, 30(7): 1243--1256, July 2008. Google ScholarDigital Library
- J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visual words to visual phrases. Proc. CVPR, 2007.Google ScholarCross Ref
- Y. Zheng, M. Zhao, S. Y. Neo, T. Chua, and Q. Tian. Visual synset: a higher-level visual representation. Proc. CVPR, 2008.Google Scholar
- D. Liu, G. Hua, P. Viola, and T. Chen. Integrated feature selection and higher-order spatial feature extraction for object categorization. Proc. CVPR, pp. 1--8, 2008.Google ScholarCross Ref
- S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlations. Proc. CVPR, 2006. Google ScholarDigital Library
- F. Perronnin and C. Dance. Fisher kernels on visual vocabulary for image categorization. Proc. CVPR, pp. 1--8, 2007.Google ScholarCross Ref
- J. Liu, Y. Yang, and M. Shah. Learning semantic visual vocabularies using diffusion distance. Proc. CVPR, 2009.Google ScholarCross Ref
- L. Yang, P. Meer, and D. J. Foran. Multiple class segmentation using a unified framework over mean-shift patches. Proc. CVPR, 2007.Google ScholarCross Ref
- J. Winn, A. Criminisi, and T. Minka. Object categorization by learning universal visual dictionary. Proc. ICCV, pp. 17--21, 2005. Google ScholarDigital Library
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proc. CVPR, pp. 2169--2178, 2006. Google ScholarDigital Library
- K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification with sets of image feature. Proc. ICCV, pp. 1458--1465, 2005. Google ScholarDigital Library
- J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. CVPR, 2009.Google Scholar
- F. Moosmann, E. Nowak, and F. Jurie. Randomized clustering forests for image classification. T-PAMI, 30(9): 1632--1646, Sep. 2008. Google ScholarDigital Library
- M. Marszalek and C. Schmid. Spatial weighting for bag-of-features. Proc. CVPR, pp. 2118--2125, 2006. Google ScholarDigital Library
- L. Wu, S. C. H. Hoi, and N. Yu. Semantic-preserving bag-of-words models for efficient Iimage annotation. Proc. ACM workshop on LSMRM, pp. 19--26, 2009. Google ScholarDigital Library
- Y. Jiang, C. Ngo, and S. Chang. Semantic context transfer across heterogeneous sources for domain adaptive video search. Proc. ACM Multimedia, 2009. Google ScholarDigital Library
- F. Wang, Y. G. Jiang, and C. W. Ngo. Video event detection using motion relativity and visual relatedness. ACM Multimedia, 2008. Google ScholarDigital Library
- D. Xu and S. F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. T-PAMI, 30(11): 1985--1997, Nov. 2008. Google ScholarDigital Library
- X. Zhou, X. D. Zhuang, S. C. Yan, S. F. Chang, M.H. Johnson, and T.S. Huang. SIFT-bag kernel for video event analysis. Proc. ACM Multimedia, pp. 229--238, 2008. Google ScholarDigital Library
- S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li. Descriptive visual words and visual phrases for image applications. Proc. ACM Multimedia, 2009. Google ScholarDigital Library
- Z. Wu, Q. Ke, and J. Sun. Bundling features for large-scale partial-duplicate web image search. Proc. CVPR, 2009.Google Scholar
- O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: finding a (thick) needle in a haystack. Proc. CVPR, 2009.Google ScholarCross Ref
- M. Perdoch, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. Proc. CVPR, 2009.Google ScholarCross Ref
- H. Jegou, M. Douze, C. Schmid, and P. Petrez. Aggregating local descriptors into s compact image representation. Proc. CVPR, 2010.Google ScholarCross Ref
- P. Viola and M. Jones. Robust real-time face detection. ICCV, 2001.Google ScholarCross Ref
- A. Globerson and S. Roweis. Metric learning by collapsing classes. Adv. in Neu. Info. Proce. Sys., 18: 451--458, 2006.Google Scholar
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. CVPR, 2009.Google ScholarCross Ref
- J. Matas, O. Chum, M. Urba, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. Proc. BMVC, 2002.Google ScholarCross Ref
- Y. Jing and S. Baluja. VisualRank: applying PageRank to large-scale image search. T-PAMI, 30(11): 1877--1890, 2008. Google ScholarDigital Library
- X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X. Hua. Bayesian video search reranking. Proc. ACM Multimedia, pp. 131--140, 2008. Google ScholarDigital Library
- D. Liu, X. Hua, L. Yang, M. Wang, and H. Zhang. Tag Ranking, Proc. WWW, 2009. Google ScholarDigital Library
- S. Deerwester, S. Dumais, and R. Harshman. Indexing by latent semantic analysis. J-ASIS, 41(6): 391--407, 1990.Google ScholarCross Ref
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proc. VLDB, pp. 518--529, 1999. Google ScholarDigital Library
Index Terms
- Building contextual visual vocabulary for large-scale image applications
Recommendations
Building descriptive and discriminative visual codebook for large-scale image applications
Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount ...
Descriptive visual words and visual phrases for image applications
MM '09: Proceedings of the 17th ACM international conference on MultimediaThe Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, ...
Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications
Bag-of-visual Words (BoWs) representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are ...
Comments