skip to main content
research-article

Learning from Collective Intelligence: Feature Learning Using Social Images and Tags

Published:02 November 2016Publication History
Skip Abstract Section

Abstract

Feature representation for visual content is the key to the progress of many fundamental applications such as annotation and cross-modal retrieval. Although recent advances in deep feature learning offer a promising route towards these tasks, they are limited in application domains where high-quality and large-scale training data are expensive to obtain. In this article, we propose a novel deep feature learning paradigm based on social collective intelligence, which can be acquired from the inexhaustible social multimedia content on the Web, in particular, largely social images and tags. Differing from existing feature learning approaches that rely on high-quality image-label supervision, our weak supervision is acquired by mining the visual-semantic embeddings from noisy, sparse, and diverse social image collections. The resultant image-word embedding space can be used to (1) fine-tune deep visual models for low-level feature extractions and (2) seek sparse representations as high-level cross-modal features for both image and text. We offer an easy-to-use implementation for the proposed paradigm, which is fast and compatible with any state-of-the-art deep architectures. Extensive experiments on several benchmarks demonstrate that the cross-modal features learned by our paradigm significantly outperforms others in various applications such as content-based retrieval, classification, and image captioning.

References

  1. Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE, 475--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49, 1--47 (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the IEEE International Conference on Computer Vision. 1409--1416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alexandre d’Aspremont. 2008. Smooth optimization with approximate gradient. SIAM J. Optim. 19, 3 (2008), 1171--1183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and others. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning. 647--655.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen Fang, Hailin Jin, Jianchao Yang, and Zhe Lin. 2015b. Collaborative feature learning from social media. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 577--585.Google ScholarGoogle Scholar
  10. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015a. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarGoogle ScholarCross RefCross Ref
  11. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, and others. 2013. Devise: A deep visual-semantic embedding model. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. 2015. Learning image and user features for recommendation in social networks. In Proceedings of the IEEE International Conference on Computer Vision. 4274--4282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. 529--545.Google ScholarGoogle ScholarCross RefCross Ref
  14. Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. 2014. Open-vocabulary object retrieval. In Robotics: Science and Systems, Vol. 2. 6.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  16. Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016a. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853--899. Google ScholarGoogle ScholarCross RefCross Ref
  18. Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 243--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3668--3678.Google ScholarGoogle ScholarCross RefCross Ref
  21. Andrej Karpathy, Armand Joulin, and Fei Fei F. Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Multimodal neural language models. In ICML, Vol. 14. 595--603.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yehuda Koren, Robert Bell, Chris Volinsky, and others. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, and others. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332 (2016).Google ScholarGoogle Scholar
  25. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Brian McFee, Thierry Bertin-Mahieux, Daniel P. W. Ellis, and Gert R. G. Lanckriet. 2012. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web. ACM, 909--916. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In AI 8 Statistics, Vol. 5. 246--252.Google ScholarGoogle Scholar
  29. Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of relational machine learning for knowledge graphs. Proc. IEEE 104, 1 (2016), 11--33.Google ScholarGoogle ScholarCross RefCross Ref
  30. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. (1999).Google ScholarGoogle Scholar
  32. Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 717--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neur. Netw. 12, 1 (1999), 145--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ali-Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems. 693--701. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Fereshteh Sadeghi, Santosh K. Divvala, and Ali Farhadi. 2015. Viske: Visual knowledge extraction and question answering by visual verification of relation phrases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1456--1464.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jitao Sang and Changsheng Xu. 2012. Right buddy makes the difference: An early exploration of social relation analysis in multimedia applications. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jitao Sang, Changsheng Xu, and Jing Liu. 2012. User-aware image tag refinement via ternary semantic analysis. IEEE Trans. Multimed. 14, 3 (2012), 883--895. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems. 2222--2230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  43. Lorenzo Torresani, Martin Szummer, and Andrew Fitzgibbon. 2010. Efficient object category recognition using classemes. In European Conference on Computer Vision. Springer, 776--789. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ramakrishna Vedantam, Xiao Lin, Tanmay Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Learning common sense through visual abstraction. In Proceedings of the IEEE International Conference on Computer Vision. 2542--2550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  46. John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas S. Huang, and Shuicheng Yan. 2010. Sparse representation for computer vision and pattern recognition. Proc. IEEE (2010).Google ScholarGoogle ScholarCross RefCross Ref
  47. Fei Wu, Xinyan Lu, Jun Song, Shuicheng Yan, Zhongfei Mark Zhang, Yong Rui, and Yueting Zhuang. 2016. Learning of multimodal representations with random walks on the click graph. IEEE Trans. Image Process. 25, 2 (2016), 630--642.Google ScholarGoogle ScholarCross RefCross Ref
  48. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  49. Zhaoquan Yuan, Jitao Sang, Changsheng Xu, and Yan Liu. 2014. A unified framework of latent feature learning in social media. IEEE Trans. Multimed. 16, 6 (2014), 1624--1635.Google ScholarGoogle ScholarCross RefCross Ref
  50. Hanwang Zhang, Xindi Shang, Huanbo Luan, Yang Yang, and Tat-Seng Chua. 2015. Learning features from large-scale, noisy and social image-tag collection. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1079--1082. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Hanwang Zhang, Xindi Shang, Wenzhuo Yang, Huan Xu, Huanbo Luan, and Tat-Seng Chua. 2016. Online collaborative learning for open-vocabulary visual classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  52. Hanwang Zhang, Yang Yang, Huanbo Luan, Shuicheng Yang, and Tat-Seng Chua. 2014a. Start from scratch: Towards automatically identifying, modeling, and naming visual attributes. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014b. Robust (semi) nonnegative graph embedding. IEEE Trans. Image Process. 23, 7 (2014), 2996--3012.Google ScholarGoogle ScholarCross RefCross Ref
  54. Jinfeng Zhuang, Tao Mei, Steven C. H. Hoi, Xian-Sheng Hua, and Shipeng Li. 2011. Modeling social strength in social media community via kernel-based learning. In Proceedings of the 19th ACM International Conference on Multimedia. ACM, 113--122. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning from Collective Intelligence: Feature Learning Using Social Images and Tags

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 13, Issue 1
      February 2017
      278 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3012406
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 November 2016
      • Accepted: 1 June 2016
      • Revised: 1 May 2016
      • Received: 1 February 2016
      Published in tomm Volume 13, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader