skip to main content
research-article

Few-shot Food Recognition via Multi-view Representation Learning

Authors Info & Claims
Published:14 July 2020Publication History
Skip Abstract Section

Abstract

This article considers the problem of few-shot learning for food recognition. Automatic food recognition can support various applications, e.g., dietary assessment and food journaling. Most existing works focus on food recognition with large numbers of labelled samples, and fail to recognize food categories with few samples. To address this problem, we propose a Multi-View Few-Shot Learning (MVFSL) framework to explore additional ingredient information for few-shot food recognition. Besides category-oriented deep visual features, we introduce ingredient-supervised deep network to extract ingredient-oriented features. As general and intermediate attributes of food, ingredient-oriented features are informative and complementary to category-oriented features, and thus they play an important role in improving food recognition. Particularly in few-shot food recognition, ingredient information can bridge the gap between disjoint training categories and test categories. To take advantage of ingredient information, we fuse these two kinds of features by first combining their feature maps from their respective deep networks and then convolving combined feature maps. Such convolution is further incorporated into a multi-view relation network, which is capable of comparing pairwise images to enable fine-grained feature learning. MVFSL is trained in an end-to-end fashion for joint optimization on two types of feature learning subnetworks and relation subnetworks. Extensive experiments on different food datasets have consistently demonstrated the advantage of MVFSL in multi-view feature fusion. Furthermore, we extend another two types of networks, namely, Siamese Network and Matching Network, by introducing ingredient information for few-shot food recognition. Experimental results have also shown that introducing ingredient information into these two networks can improve the performance of few-shot food recognition.

References

  1. Kiyoharu Aizawa, Yuto Maruyama, He Li, and Chamin Morikawa. 2013. Food balance estimation by using personal dietary tendencies in a multimedia food log. IEEE Trans. Multimedia 15, 8 (2013), 2176--2185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Giuseppe Amato, Paolo Bolettieri, Monteiro De Lira Vinicius, Cristina Ioana Muntean, Raffaele Perego, and Chiara Renso. 2017. Social media image recognition for food trend analysis. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1333--1336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems. MIT Press, 3981--3989.Google ScholarGoogle Scholar
  4. Shuang Ao and Charles X. Ling. 2015. Adapting new categories for food recognition with deep representation. In Proceedings of the IEEE International Conference on Data Mining Workshop. 1196--1203.Google ScholarGoogle Scholar
  5. Oscar Beijbom, Neel Joshi, Dan Morris, Scott Saponas, and Siddharth Khullar. 2015. Menu-match: Restaurant-specific food logging from images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 844--851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. 2016. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems. MIT Press, 523--531.Google ScholarGoogle Scholar
  7. Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory D. Abowd, and Irfan Essa. 2015. Leveraging context to support automated food recognition in restaurants. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 580--587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Marc Bolaños, Aina Ferrà, and Petia Radeva. 2017. Food ingredients recognition through multi-label learning. In Proceedings of the International Conference on Image Analysis and Processing. Springer, 394--402.Google ScholarGoogle ScholarCross RefCross Ref
  9. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101--mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision. 446--461.Google ScholarGoogle ScholarCross RefCross Ref
  10. Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, and Tao Mei. 2018. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4080--4088.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia. 32--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xin Chen, Hua Zhou, Yu Zhu, and Liang Diao. 2017. ChineseFoodNet: A large-scale image dataset for Chinese food recognition. arXiv preprint arXiv:1705.02743.Google ScholarGoogle Scholar
  13. Joachim Dehais, Marios Anthimopoulos, Sergey Shevchik, and Stavroula Mougiakakou. 2017. Two-view 3D reconstruction for food volume estimation. IEEE Trans. Multimedia 19, 5 (2017), 1090--1099.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Lixi Deng, Jingjing Chen, Qianru Sun, Xiangnan He, Sheng Tang, Zhaoyan Ming, Yongdong Zhang, and Tat-Seng Chua. 2019. Mixed-dish recognition with contextual relation networks. In Proceedings of the 27th ACM International Conference on Multimedia (MM’19). 112--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Fei-Fei, R. Fergus, and P. Perona. 2006. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 4 (2006), 594--611.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google ScholarGoogle Scholar
  17. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. 1126--1135.Google ScholarGoogle Scholar
  18. Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4476--4484.Google ScholarGoogle ScholarCross RefCross Ref
  19. Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition. 4367--4375.Google ScholarGoogle ScholarCross RefCross Ref
  20. Cheng Gong, Peicheng Zhou, and Junwei Han. 2016. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54, 12 (2016), 7405--7415.Google ScholarGoogle ScholarCross RefCross Ref
  21. Junwei Han, Dingwen Zhang, Gong Cheng, Lei Guo, and Jinchang Ren. 2015. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 53, 6 (2015), 3325--3337.Google ScholarGoogle ScholarCross RefCross Ref
  22. Zhizhong Han, Xinhai Liu, Yu-Shen Liu, and Matthias Zwicker. 2019. Parts4Feature: Learning 3D global features from generally semantic parts in multiple views. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’19). 766--773.Google ScholarGoogle ScholarCross RefCross Ref
  23. Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C. L. Philip Chen. 2019. 3D2SeqViews: Aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 28, 8 (2019), 3986--3999.Google ScholarGoogle ScholarCross RefCross Ref
  24. Zhizhong Han, Mingyang Shang, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C. L. Philip Chen. 2019. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 28, 2 (2019), 1941--0042.Google ScholarGoogle Scholar
  25. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  26. Luis Herranz, Shuqiang Jiang, and Ruihan Xu. 2017. Modeling restaurant context for food recognition. IEEE Trans. Multimedia 19, 2 (2017), 430--440.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2261--2269.Google ScholarGoogle Scholar
  28. Shuqiang Jiang, Weiqing Min, Linhu Liu, and Zhengdong Luo. 2019. Multi-scale multi-view deep feature aggregation for food recognition. IEEE Trans. Image Process. 29, 1 (2019), 265--276.Google ScholarGoogle ScholarCross RefCross Ref
  29. Taichi Joutou and Keiji Yanai. 2010. A food image recognition system with multiple kernel learning. In Proceedings of the IEEE International Conference on Image Processing. 285--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. 2014. Food detection and recognition using convolutional neural network. In Proceedings of the ACM International Conference on Multimedia. 1085--1088.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yoshiyuki Kawano and Keiji Yanai. 2014. Food image recognition with deep convolutional features. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. 589--593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yoshiyuki Kawano and Keiji Yanai. 2014. Foodcam: A real-time mobile food recognition system employing fisher vector. In Proceedings of the International Conference on Multimedia Modeling. Springer, 369--373.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google ScholarGoogle Scholar
  34. Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning, Vol. 2.Google ScholarGoogle Scholar
  35. Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33.Google ScholarGoogle Scholar
  36. Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2013. One-shot learning by inverting a compositional causal process. In Proceedings of the International Conference on Neural Information Processing Systems. 2526--2534.Google ScholarGoogle Scholar
  37. Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sungju Hwang, and Yi Yang. 2019. Learning to propagate labels: Transductive propagation network for few-shot learning. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  38. Yuzhen Lu, Yuping Huang, and Renfu Lu. 2017. Innovative hyperspectral imaging-based techniques for quality evaluation of fruits and vegetables: A review. Appl. Sci. 7, 2 (2017), 189.Google ScholarGoogle ScholarCross RefCross Ref
  39. J. Marin, A. Biswas, F. Ofli, N. Hynes, A. Salvador, Y. Aytar, I. Weber, and A. Torralba. 2019. Recipe1M+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. (2019), 1. Early Access.Google ScholarGoogle Scholar
  40. Niki Martinel, Gian Luca Foresti, and Christian Micheloni. 2018. Wide-slice residual networks for food recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 567--576.Google ScholarGoogle ScholarCross RefCross Ref
  41. Niki Martinel, Claudio Piciarelli, Christian Micheloni, and Gian Luca Foresti. 2015. A structured committee for food recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshop. 484--492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. E. Mesas, M. Mu±ozpareja, E. Lopez-Garcia, and F. Rodríguez Artalejo. 2012. Selected eating behaviours and excess body weight: A systematic review.Obesity Rev. Offic. J. Int. Assoc. Study Obesity 13, 2 (2012), 106.Google ScholarGoogle ScholarCross RefCross Ref
  43. Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P. Murphy. 2015. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1233--1241.Google ScholarGoogle Scholar
  44. Weiqing Min, Bing-Kun Bao, Shuhuan Mei, Yaohui Zhu, Yong Rui, and Shuqiang Jiang. 2018. You are what you eat: Exploring rich recipe information for cross-region food analysis. IEEE Trans. Multimedia 20, 4 (2018), 950--964.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Weiqing Min, Shuqiang Jiang, Linhu Liu, Yong Rui, and Ramesh Jain. 2019. A survey on food computing. ACM Comput. Surv. 52, 5 (2019), 92:1--92:36.Google ScholarGoogle Scholar
  46. Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, and Luis Herranz. 2017. Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Trans. Multimedia 19, 5 (2017), 1100--1113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Weiqing Min, Linhu Liu, Zhengdong Luo, and Shuqiang Jiang. 2019. Ingredient-guided cascaded multi-attention network for food recognition. In Proceedings of the ACM International Conference on Multimedia. 99--107.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. arXiv preprint arXiv:1703.00837.Google ScholarGoogle Scholar
  49. Kaoru Ota, Minh Son Dao, Vasileios Mezaris, and Francesco G. B. De Natale. 2017. Deep learning for mobile multimedia: A survey. ACM Trans. Multimedia Comput. Commun. Appl. 13, 3s (2017), 34:1--34:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Parisa Pouladzadeh and Shervin Shirmohammadi. 2017. Mobile multi-food recognition using deep learning. ACM Trans. Multimedia Comput. Commun. Appl. 13, 3s (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. 2018. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7229--7238.Google ScholarGoogle Scholar
  52. Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3068--3076.Google ScholarGoogle ScholarCross RefCross Ref
  53. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. 2016. One-shot learning with memory-augmented neural networks. CoRR abs/1605.06065.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision Workshop. 618--626.Google ScholarGoogle ScholarCross RefCross Ref
  55. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google ScholarGoogle Scholar
  56. Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. MIT Press, 4080--4090.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199--1208.Google ScholarGoogle Scholar
  58. Ryosuke Tanno, Koichi Okamoto, and Keiji Yanai. 2016. DeepFoodCam: A DCNN-based real-time mobile food recognition system. In Proceedings of the International Workshop on Multimedia Assisted Dietary Management. 89--89.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Sebastian Thrun. 1998. Lifelong Learning Algorithms. Springer, 181--209.Google ScholarGoogle Scholar
  60. Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630--3638.Google ScholarGoogle Scholar
  61. Ruihan Xu, Luis Herranz, Shuqiang Jiang, Shuang Wang, Xinhang Song, and Ramesh Jain. 2015. Geolocalized modeling for dish recognition. IEEE Trans. Multimedia 17, 8 (2015), 1187--1199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Shulin Yang, Mei Chen, Dean Pomerleau, and Rahul Sukthankar. 2010. Food recognition using statistics of pairwise local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2249--2256.Google ScholarGoogle Scholar
  63. Matthew D. Zeiler and Rob Fergus. 2013. Visualizing and understanding convolutional networks. CoRR abs/1311.2901.Google ScholarGoogle Scholar
  64. Dingwen Zhang, Deyu Meng, and Junwei Han. 2017. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans. Pattern Anal. Mach. Intell. 39, 5 (2017), 865--878.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. 2014. Part-based R-CNNs for fine-grained category detection. In Proceedings of the European Conference on Computer Vision. 834--849.Google ScholarGoogle ScholarCross RefCross Ref
  66. Jiannan Zheng, Z. Jane Wang, and Chunsheng Zhu. 2017. Food image recognition via superpixel-based low-level and mid-level distance coding for smart home applications. Sustainability 9, 5 (2017), 856.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Few-shot Food Recognition via Multi-view Representation Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
        August 2020
        364 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3409646
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 July 2020
        • Online AM: 7 May 2020
        • Accepted: 1 March 2020
        • Revised: 1 January 2020
        • Received: 1 July 2019
        Published in tomm Volume 16, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format