Abstract
Feature representation for visual content is the key to the progress of many fundamental applications such as annotation and cross-modal retrieval. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application domains where high-quality and large-scale training data are expensive to obtain. In this article, we propose a novel deep feature learning paradigm based on social collective intelligence, which can be acquired from the inexhaustible social multimedia content on the Web, in particular, largely social images and tags. Differing from existing feature learning approaches that rely on high-quality image-label supervision, our weak supervision is acquired by mining the visual-semantic embeddings from noisy, sparse, and diverse social image collections. The resultant image-word embedding space can be used to (1) fine-tune deep visual models for low-level feature extractions and (2) seek sparse representations as high-level cross-modal features for both image and text. We offer an easy-to-use implementation for the proposed paradigm, which is fast and compatible with any state-of-the-art deep architectures. Extensive experiments on several benchmarks demonstrate that the cross-modal features learned by our paradigm significantly outperforms others in various applications such as content-based retrieval, classification, and image captioning.
- Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE, 475--486. Google ScholarDigital Library
- Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1--47 (2014). Google ScholarDigital Library
- Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the IEEE International Conference on Computer Vision. 1409--1416. Google ScholarDigital Library
- Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 48. Google ScholarDigital Library
- Alexandre d’Aspremont. 2008. Smooth optimization with approximate gradient. SIAM J. Optim. 19, 3 (2008), 1171--1183. Google ScholarDigital Library
- Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and others. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231. Google ScholarDigital Library
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391.Google ScholarCross Ref
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning. 647--655.Google ScholarDigital Library
- Chen Fang, Hailin Jin, Jianchao Yang, and Zhe Lin. 2015b. Collaborative feature learning from social media. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 577--585.Google Scholar
- Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015a. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarCross Ref
- Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In NIPS. Google ScholarDigital Library
- Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. 2015. Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision. 4274--4282. Google ScholarDigital Library
- Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. 529--545.Google ScholarCross Ref
- Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. 2014. Open-vocabulary object retrieval. In Robotics: Science and Systems, Vol. 2. 6.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016a. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM. Google ScholarDigital Library
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853--899. Google ScholarCross Ref
- Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 243--252. Google ScholarDigital Library
- Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 29. Google ScholarDigital Library
- Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3668--3678.Google ScholarCross Ref
- Andrej Karpathy, Armand Joulin, and Fei Fei F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 1889--1897. Google ScholarDigital Library
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Multimodal neural language models. In ICML, Vol. 14. 595--603.Google ScholarDigital Library
- Yehuda Koren, Robert Bell, Chris Volinsky, and others. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37. Google ScholarDigital Library
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, and others. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
- Brian McFee, Thierry Bertin-Mahieux, Daniel P. W. Ellis, and Gert R. G. Lanckriet. 2012. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web. ACM, 909--916. Google ScholarDigital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google ScholarDigital Library
- Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In AI 8 Statistics, Vol. 5. 246--252.Google Scholar
- Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1 (2016), 11--33.Google ScholarCross Ref
- Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151. Google ScholarDigital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. (1999).Google Scholar
- Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 717--726. Google ScholarDigital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701--710. Google ScholarDigital Library
- Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neur. Netw. 12, 1 (1999), 145--151. Google ScholarDigital Library
- Ali-Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806--813. Google ScholarDigital Library
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701. Google ScholarDigital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252. Google ScholarDigital Library
- Fereshteh Sadeghi, Santosh K. Divvala, and Ali Farhadi. 2015. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1456--1464.Google ScholarCross Ref
- Jitao Sang and Changsheng Xu. 2012. Right buddy makes the difference: An early exploration of social relation analysis in multimedia applications. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 19--28. Google ScholarDigital Library
- Jitao Sang, Changsheng Xu, and Jing Liu. 2012. User-aware image tag refinement via ternary semantic analysis. IEEE Trans. Multimed. 14, 3 (2012), 883--895. Google ScholarDigital Library
- Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems. 2222--2230. Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.Google ScholarCross Ref
- Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon. 2010. Efficient object category recognition using classemes. In European Conference on Computer Vision. Springer, 776--789. Google ScholarDigital Library
- Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Learning common sense through visual abstraction. In Proceedings of the IEEE International Conference on Computer Vision. 2542--2550. Google ScholarDigital Library
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarCross Ref
- John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas S. Huang, and Shuicheng Yan. 2010. Sparse representation for computer vision and pattern recognition. Proc. IEEE (2010).Google ScholarCross Ref
- Fei Wu, Xinyan Lu, Jun Song, Shuicheng Yan, Zhongfei Mark Zhang, Yong Rui, and Yueting Zhuang. 2016. Learning of multimodal representations with random walks on the click graph. IEEE Trans. Image Process. 25, 2 (2016), 630--642.Google ScholarCross Ref
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67--78.Google ScholarCross Ref
- Zhaoquan Yuan, Jitao Sang, Changsheng Xu, and Yan Liu. 2014. A unified framework of latent feature learning in social media. IEEE Trans. Multimed. 16, 6 (2014), 1624--1635.Google ScholarCross Ref
- Hanwang Zhang, Xindi Shang, Huanbo Luan, Yang Yang, and Tat-Seng Chua. 2015. Learning features from large-scale, noisy and social image-tag collection. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1079--1082. Google ScholarDigital Library
- Hanwang Zhang, Xindi Shang, Wenzhuo Yang, Huan Xu, Huanbo Luan, and Tat-Seng Chua. 2016. Online collaborative learning for open-vocabulary visual classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
- Hanwang Zhang, Yang Yang, Huanbo Luan, Shuicheng Yang, and Tat-Seng Chua. 2014a. Start from scratch: Towards automatically identifying, modeling, and naming visual attributes. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 187--196. Google ScholarDigital Library
- Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014b. Robust (semi) nonnegative graph embedding. IEEE Trans. Image Process. 23, 7 (2014), 2996--3012.Google ScholarCross Ref
- Jinfeng Zhuang, Tao Mei, Steven C. H. Hoi, Xian-Sheng Hua, and Shipeng Li. 2011. Modeling social strength in social media community via kernel-based learning. In Proceedings of the 19th ACM International Conference on Multimedia. ACM, 113--122. Google ScholarDigital Library
Index Terms
- Learning from Collective Intelligence: Feature Learning Using Social Images and Tags
Recommendations
Supervised representation learning for multi-label classification
AbstractRepresentation learning is one of the most important aspects of multi-label learning because of the intricate nature of multi-label data. Current research on representation learning either fails to consider label knowledge or is affected by the ...
Learning Features from Large-Scale, Noisy and Social Image-Tag Collection
MM '15: Proceedings of the 23rd ACM international conference on MultimediaFeature representation for multimedia content is the key to the progress of many fundamental multimedia tasks. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application to domains ...
Image Annotation by Propagating Labels from Semantic Neighbourhoods
Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels ("class-imbalance"). Additionally, due to ...
Comments