research-article

Learning from Collective Intelligence: Feature Learning Using Social Images and Tags

Authors:
Hanwang Zhang

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Xindi Shang

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

,
Huanbo Luan

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Meng Wang

Hefei University of Technology, China

Hefei University of Technology, China
View Profile

,
Tat-Seng Chua

National University of Singapore, Singapore

National University of Singapore, Singapore
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13 Issue 1Article No.: 1pp 1–23https://doi.org/10.1145/2978656

Published:02 November 2016Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Feature representation for visual content is the key to the progress of many fundamental applications such as annotation and cross-modal retrieval. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application domains where high-quality and large-scale training data are expensive to obtain. In this article, we propose a novel deep feature learning paradigm based on social collective intelligence, which can be acquired from the inexhaustible social multimedia content on the Web, in particular, largely social images and tags. Differing from existing feature learning approaches that rely on high-quality image-label supervision, our weak supervision is acquired by mining the visual-semantic embeddings from noisy, sparse, and diverse social image collections. The resultant image-word embedding space can be used to (1) fine-tune deep visual models for low-level feature extractions and (2) seek sparse representations as high-level cross-modal features for both image and text. We offer an easy-to-use implementation for the proposed paradigm, which is fast and compatible with any state-of-the-art deep architectures. Extensive experiments on several benchmarks demonstrate that the cross-modal features learned by our paradigm significantly outperforms others in various applications such as content-based retrieval, classification, and image captioning.

References

Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE, 475--486. Google ScholarDigital Library
Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1--47 (2014). Google ScholarDigital Library
Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the IEEE International Conference on Computer Vision. 1409--1416. Google ScholarDigital Library
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 48. Google ScholarDigital Library
Alexandre d’Aspremont. 2008. Smooth optimization with approximate gradient. SIAM J. Optim. 19, 3 (2008), 1171--1183. Google ScholarDigital Library
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and others. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231. Google ScholarDigital Library
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391.Google ScholarCross Ref
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning. 647--655.Google ScholarDigital Library
Chen Fang, Hailin Jin, Jianchao Yang, and Zhe Lin. 2015b. Collaborative feature learning from social media. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 577--585.Google Scholar
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015a. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarCross Ref
Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In NIPS. Google ScholarDigital Library
Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. 2015. Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision. 4274--4282. Google ScholarDigital Library
Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. 529--545.Google ScholarCross Ref
Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. 2014. Open-vocabulary object retrieval. In Robotics: Science and Systems, Vol. 2. 6.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016a. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM. Google ScholarDigital Library
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853--899. Google ScholarCross Ref
Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 243--252. Google ScholarDigital Library
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 29. Google ScholarDigital Library
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3668--3678.Google ScholarCross Ref
Andrej Karpathy, Armand Joulin, and Fei Fei F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 1889--1897. Google ScholarDigital Library
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Multimodal neural language models. In ICML, Vol. 14. 595--603.Google ScholarDigital Library
Yehuda Koren, Robert Bell, Chris Volinsky, and others. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37. Google ScholarDigital Library
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, and others. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
Brian McFee, Thierry Bertin-Mahieux, Daniel P. W. Ellis, and Gert R. G. Lanckriet. 2012. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web. ACM, 909--916. Google ScholarDigital Library
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google ScholarDigital Library
Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In AI 8 Statistics, Vol. 5. 246--252.Google Scholar
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1 (2016), 11--33.Google ScholarCross Ref
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151. Google ScholarDigital Library
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. (1999).Google Scholar
Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 717--726. Google ScholarDigital Library
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701--710. Google ScholarDigital Library
Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neur. Netw. 12, 1 (1999), 145--151. Google ScholarDigital Library
Ali-Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806--813. Google ScholarDigital Library
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701. Google ScholarDigital Library
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252. Google ScholarDigital Library
Fereshteh Sadeghi, Santosh K. Divvala, and Ali Farhadi. 2015. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1456--1464.Google ScholarCross Ref
Jitao Sang and Changsheng Xu. 2012. Right buddy makes the difference: An early exploration of social relation analysis in multimedia applications. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 19--28. Google ScholarDigital Library
Jitao Sang, Changsheng Xu, and Jing Liu. 2012. User-aware image tag refinement via ternary semantic analysis. IEEE Trans. Multimed. 14, 3 (2012), 883--895. Google ScholarDigital Library
Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems. 2222--2230. Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.Google ScholarCross Ref
Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon. 2010. Efficient object category recognition using classemes. In European Conference on Computer Vision. Springer, 776--789. Google ScholarDigital Library
Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Learning common sense through visual abstraction. In Proceedings of the IEEE International Conference on Computer Vision. 2542--2550. Google ScholarDigital Library
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarCross Ref
John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas S. Huang, and Shuicheng Yan. 2010. Sparse representation for computer vision and pattern recognition. Proc. IEEE (2010).Google ScholarCross Ref
Fei Wu, Xinyan Lu, Jun Song, Shuicheng Yan, Zhongfei Mark Zhang, Yong Rui, and Yueting Zhuang. 2016. Learning of multimodal representations with random walks on the click graph. IEEE Trans. Image Process. 25, 2 (2016), 630--642.Google ScholarCross Ref
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67--78.Google ScholarCross Ref
Zhaoquan Yuan, Jitao Sang, Changsheng Xu, and Yan Liu. 2014. A unified framework of latent feature learning in social media. IEEE Trans. Multimed. 16, 6 (2014), 1624--1635.Google ScholarCross Ref
Hanwang Zhang, Xindi Shang, Huanbo Luan, Yang Yang, and Tat-Seng Chua. 2015. Learning features from large-scale, noisy and social image-tag collection. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1079--1082. Google ScholarDigital Library
Hanwang Zhang, Xindi Shang, Wenzhuo Yang, Huan Xu, Huanbo Luan, and Tat-Seng Chua. 2016. Online collaborative learning for open-vocabulary visual classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarCross Ref
Hanwang Zhang, Yang Yang, Huanbo Luan, Shuicheng Yang, and Tat-Seng Chua. 2014a. Start from scratch: Towards automatically identifying, modeling, and naming visual attributes. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 187--196. Google ScholarDigital Library
Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014b. Robust (semi) nonnegative graph embedding. IEEE Trans. Image Process. 23, 7 (2014), 2996--3012.Google ScholarCross Ref
Jinfeng Zhuang, Tao Mei, Steven C. H. Hoi, Xian-Sheng Hua, and Shipeng Li. 2011. Modeling social strength in social media community via kernel-based learning. In Proceedings of the 19th ACM International Conference on Multimedia. ACM, 113--122. Google ScholarDigital Library

Index Terms

Learning from Collective Intelligence: Feature Learning Using Social Images and Tags
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations

Recommendations

Supervised representation learning for multi-label classification
Abstract
Representation learning is one of the most important aspects of multi-label learning because of the intricate nature of multi-label data. Current research on representation learning either fails to consider label knowledge or is affected by the ...
Read More
Learning Features from Large-Scale, Noisy and Social Image-Tag Collection
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Feature representation for multimedia content is the key to the progress of many fundamental multimedia tasks. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application to domains ...
Read More
Image Annotation by Propagating Labels from Semantic Neighbourhoods

Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels ("class-imbalance"). Additionally, due to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 1
February 2017
278 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3012406
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2016
- Accepted: 1 June 2016
- Revised: 1 May 2016
- Received: 1 February 2016
Published in tomm Volume 13, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Representation learning
cross-media analysis
visual-semantic embedding
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 84
  Total Citations
  View Citations
- 1,054
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning from Collective Intelligence: Feature Learning Using Social Images and Tags

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Supervised representation learning for multi-label classification

Learning Features from Large-Scale, Noisy and Social Image-Tag Collection

Image Annotation by Propagating Labels from Semantic Neighbourhoods