research-article

Building contextual visual vocabulary for large-scale image applications

Authors:
Shiliang Zhang

Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

,
Qingming Huang

Graduate University of Chinese Academy of Sciences, Beijing, China

Graduate University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Gang Hua

IBM Research T. J. Watson Center, New York, USA

IBM Research T. J. Watson Center, New York, USA
View Profile

,
Shuqiang Jiang

Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

,
Wen Gao

Institute of Digital Multimedia, Peking University, Beijing, China

Institute of Digital Multimedia, Peking University, Beijing, China
View Profile

,
Qi Tian

Computer Science Depart., University of Texas at San Antonio, San Antonio, USA

Computer Science Depart., University of Texas at San Antonio, San Antonio, USA
View Profile

MM '10: Proceedings of the 18th ACM international conference on MultimediaOctober 2010Pages 501–510https://doi.org/10.1145/1873951.1874018

Published:25 October 2010Publication History

MM '10: Proceedings of the 18th ACM international conference on Multimedia

Pages 501–510

ABSTRACT

Not withstanding its great success and wide adoption in Bag-of-visual Words representation, visual vocabulary created from single image local features is often shown to be ineffective largely due to three reasons. First, many detected local features are not stable enough, resulting in many noisy and non-descriptive visual words in images. Second, single visual word discards the rich spatial contextual information among the local features, which has been proven to be valuable for visual matching. Third, the distance metric commonly used for generating visual vocabulary does not take the semantic context into consideration, which renders them to be prone to noise. To address these three confrontations, we propose an effective visual vocabulary generation framework containing three novel contributions: 1) we propose an effective unsupervised local feature refinement strategy; 2) we consider local features in groups to model their spatial contexts; 3) we further learn a discriminant distance metric between local feature groups, which we call discriminant group distance. This group distance is further leveraged to induce visual vocabulary from groups of local features. We name it contextual visual vocabulary, which captures both the spatial and semantic contexts. We evaluate the proposed local feature refinement strategy and the contextual visual vocabulary in two large-scale image applications: large-scale near-duplicate image retrieval on a dataset containing 1.5 million images and image search re-ranking tasks. Our experimental results show that the contextual visual vocabulary shows significant improvement over the classic visual vocabulary. Moreover, it outperforms the state-of-the-art Bundled Feature in the terms of retrieval precision, memory consumption and efficiency.

References

J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. Proc. ICCV, 2003. Google ScholarDigital Library
D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. Proc. CVPR, pp. 2161--2168, 2006. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2): 91--110, Nov. 2004. Google ScholarDigital Library
F. Jurie and B. Triggs. Creating efficient codebooks for visual recognition. Proc. ICCV, pp. 17--21, 2005. Google ScholarDigital Library
S. Lazebnik and M. Raginsky. Supervised learning of quantizer codebook by information loss minimization. T-PAMI, 31(7): 1294--1309, July 2009. Google ScholarDigital Library
F. Perronnin. Universal and adapted vocabularies for generic visual categorization. T-PAMI, 30(7): 1243--1256, July 2008. Google ScholarDigital Library
J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: from visual words to visual phrases. Proc. CVPR, 2007.Google ScholarCross Ref
Y. Zheng, M. Zhao, S. Y. Neo, T. Chua, and Q. Tian. Visual synset: a higher-level visual representation. Proc. CVPR, 2008.Google Scholar
D. Liu, G. Hua, P. Viola, and T. Chen. Integrated feature selection and higher-order spatial feature extraction for object categorization. Proc. CVPR, pp. 1--8, 2008.Google ScholarCross Ref
S. Savarese, J. Winn, and A. Criminisi. Discriminative object class models of appearance and shape by correlations. Proc. CVPR, 2006. Google ScholarDigital Library
F. Perronnin and C. Dance. Fisher kernels on visual vocabulary for image categorization. Proc. CVPR, pp. 1--8, 2007.Google ScholarCross Ref
J. Liu, Y. Yang, and M. Shah. Learning semantic visual vocabularies using diffusion distance. Proc. CVPR, 2009.Google ScholarCross Ref
L. Yang, P. Meer, and D. J. Foran. Multiple class segmentation using a unified framework over mean-shift patches. Proc. CVPR, 2007.Google ScholarCross Ref
J. Winn, A. Criminisi, and T. Minka. Object categorization by learning universal visual dictionary. Proc. ICCV, pp. 17--21, 2005. Google ScholarDigital Library
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. Proc. CVPR, pp. 2169--2178, 2006. Google ScholarDigital Library
K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification with sets of image feature. Proc. ICCV, pp. 1458--1465, 2005. Google ScholarDigital Library
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. CVPR, 2009.Google Scholar
F. Moosmann, E. Nowak, and F. Jurie. Randomized clustering forests for image classification. T-PAMI, 30(9): 1632--1646, Sep. 2008. Google ScholarDigital Library
M. Marszalek and C. Schmid. Spatial weighting for bag-of-features. Proc. CVPR, pp. 2118--2125, 2006. Google ScholarDigital Library
L. Wu, S. C. H. Hoi, and N. Yu. Semantic-preserving bag-of-words models for efficient Iimage annotation. Proc. ACM workshop on LSMRM, pp. 19--26, 2009. Google ScholarDigital Library
Y. Jiang, C. Ngo, and S. Chang. Semantic context transfer across heterogeneous sources for domain adaptive video search. Proc. ACM Multimedia, 2009. Google ScholarDigital Library
F. Wang, Y. G. Jiang, and C. W. Ngo. Video event detection using motion relativity and visual relatedness. ACM Multimedia, 2008. Google ScholarDigital Library
D. Xu and S. F. Chang. Video event recognition using kernel methods with multilevel temporal alignment. T-PAMI, 30(11): 1985--1997, Nov. 2008. Google ScholarDigital Library
X. Zhou, X. D. Zhuang, S. C. Yan, S. F. Chang, M.H. Johnson, and T.S. Huang. SIFT-bag kernel for video event analysis. Proc. ACM Multimedia, pp. 229--238, 2008. Google ScholarDigital Library
S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li. Descriptive visual words and visual phrases for image applications. Proc. ACM Multimedia, 2009. Google ScholarDigital Library
Z. Wu, Q. Ke, and J. Sun. Bundling features for large-scale partial-duplicate web image search. Proc. CVPR, 2009.Google Scholar
O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: finding a (thick) needle in a haystack. Proc. CVPR, 2009.Google ScholarCross Ref
M. Perdoch, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. Proc. CVPR, 2009.Google ScholarCross Ref
H. Jegou, M. Douze, C. Schmid, and P. Petrez. Aggregating local descriptors into s compact image representation. Proc. CVPR, 2010.Google ScholarCross Ref
P. Viola and M. Jones. Robust real-time face detection. ICCV, 2001.Google ScholarCross Ref
A. Globerson and S. Roweis. Metric learning by collapsing classes. Adv. in Neu. Info. Proce. Sys., 18: 451--458, 2006.Google Scholar
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: a large-scale hierarchical image database. CVPR, 2009.Google ScholarCross Ref
J. Matas, O. Chum, M. Urba, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. Proc. BMVC, 2002.Google ScholarCross Ref
Y. Jing and S. Baluja. VisualRank: applying PageRank to large-scale image search. T-PAMI, 30(11): 1877--1890, 2008. Google ScholarDigital Library
X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X. Hua. Bayesian video search reranking. Proc. ACM Multimedia, pp. 131--140, 2008. Google ScholarDigital Library
D. Liu, X. Hua, L. Yang, M. Wang, and H. Zhang. Tag Ranking, Proc. WWW, 2009. Google ScholarDigital Library
S. Deerwester, S. Dumais, and R. Harshman. Indexing by latent semantic analysis. J-ASIS, 41(6): 391--407, 1990.Google ScholarCross Ref
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proc. VLDB, pp. 518--529, 1999. Google ScholarDigital Library

Index Terms

Building contextual visual vocabulary for large-scale image applications
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Building descriptive and discriminative visual codebook for large-scale image applications

Inspired by the success of textual words in large-scale textual information processing, researchers are trying to extract visual words from images which function similar as textual words. Visual words are commonly generated by clustering a large amount ...
Read More
Descriptive visual words and visual phrases for image applications
MM '09: Proceedings of the 17th ACM international conference on Multimedia

The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, ...
Read More
Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications

Bag-of-visual Words (BoWs) representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '10: Proceedings of the 18th ACM international conference on Multimedia
October 2010
1836 pages
ISBN:9781605589336
DOI:10.1145/1873951
General Chairs:
Alberto del Bimbo
University of Florence, Italy
,
Shih-Fu Chang
Columbia University, USA
,
Program Chair:
Arnold Smeulders
University of Amsterdam, NL
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bag-of-visual words
image search re-ranking
near-duplicate image retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 111
  Total Citations
  View Citations
- 1,088
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building contextual visual vocabulary for large-scale image applications

MM '10: Proceedings of the 18th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Building descriptive and discriminative visual codebook for large-scale image applications

Descriptive visual words and visual phrases for image applications

Generating Descriptive Visual Words and Visual Phrases for Large-Scale Image Applications