research-article

Few-shot Food Recognition via Multi-view Representation Learning

Authors:
Shuqiang Jiang

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

0000-0002-1596-4326
View Profile

,
Weiqing Min

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

,
Yongqiang Lyu

Qingdao KingAgroot Precision Agriculture Technology Co., Ltd, Qingdao, Shandong, China

Qingdao KingAgroot Precision Agriculture Technology Co., Ltd, Qingdao, Shandong, China
View Profile

,
Linhu Liu

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16 Issue 3Article No.: 87pp 1–20https://doi.org/10.1145/3391624

Published:14 July 2020Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

This article considers the problem of few-shot learning for food recognition. Automatic food recognition can support various applications, e.g., dietary assessment and food journaling. Most existing works focus on food recognition with large numbers of labelled samples, and fail to recognize food categories with few samples. To address this problem, we propose a Multi-View Few-Shot Learning (MVFSL) framework to explore additional ingredient information for few-shot food recognition. Besides category-oriented deep visual features, we introduce ingredient-supervised deep network to extract ingredient-oriented features. As general and intermediate attributes of food, ingredient-oriented features are informative and complementary to category-oriented features, and thus they play an important role in improving food recognition. Particularly in few-shot food recognition, ingredient information can bridge the gap between disjoint training categories and test categories. To take advantage of ingredient information, we fuse these two kinds of features by first combining their feature maps from their respective deep networks and then convolving combined feature maps. Such convolution is further incorporated into a multi-view relation network, which is capable of comparing pairwise images to enable fine-grained feature learning. MVFSL is trained in an end-to-end fashion for joint optimization on two types of feature learning subnetworks and relation subnetworks. Extensive experiments on different food datasets have consistently demonstrated the advantage of MVFSL in multi-view feature fusion. Furthermore, we extend another two types of networks, namely, Siamese Network and Matching Network, by introducing ingredient information for few-shot food recognition. Experimental results have also shown that introducing ingredient information into these two networks can improve the performance of few-shot food recognition.

References

Kiyoharu Aizawa, Yuto Maruyama, He Li, and Chamin Morikawa. 2013. Food balance estimation by using personal dietary tendencies in a multimedia food log. IEEE Trans. Multimedia 15, 8 (2013), 2176--2185.Google ScholarDigital Library
Giuseppe Amato, Paolo Bolettieri, Monteiro De Lira Vinicius, Cristina Ioana Muntean, Raffaele Perego, and Chiara Renso. 2017. Social media image recognition for food trend analysis. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1333--1336.Google ScholarDigital Library
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems. MIT Press, 3981--3989.Google Scholar
Shuang Ao and Charles X. Ling. 2015. Adapting new categories for food recognition with deep representation. In Proceedings of the IEEE International Conference on Data Mining Workshop. 1196--1203.Google Scholar
Oscar Beijbom, Neel Joshi, Dan Morris, Scott Saponas, and Siddharth Khullar. 2015. Menu-match: Restaurant-specific food logging from images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 844--851.Google ScholarDigital Library
Luca Bertinetto, João F. Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. 2016. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems. MIT Press, 523--531.Google Scholar
Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory D. Abowd, and Irfan Essa. 2015. Leveraging context to support automated food recognition in restaurants. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 580--587.Google ScholarDigital Library
Marc Bolaños, Aina Ferrà, and Petia Radeva. 2017. Food ingredients recognition through multi-label learning. In Proceedings of the International Conference on Image Analysis and Processing. Springer, 394--402.Google ScholarCross Ref
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101--mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision. 446--461.Google ScholarCross Ref
Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, and Tao Mei. 2018. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4080--4088.Google ScholarCross Ref
Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia. 32--41.Google ScholarDigital Library
Xin Chen, Hua Zhou, Yu Zhu, and Liang Diao. 2017. ChineseFoodNet: A large-scale image dataset for Chinese food recognition. arXiv preprint arXiv:1705.02743.Google Scholar
Joachim Dehais, Marios Anthimopoulos, Sergey Shevchik, and Stavroula Mougiakakou. 2017. Two-view 3D reconstruction for food volume estimation. IEEE Trans. Multimedia 19, 5 (2017), 1090--1099.Google ScholarDigital Library
Lixi Deng, Jingjing Chen, Qianru Sun, Xiangnan He, Sheng Tang, Zhaoyan Ming, Yongdong Zhang, and Tat-Seng Chua. 2019. Mixed-dish recognition with contextual relation networks. In Proceedings of the 27th ACM International Conference on Multimedia (MM’19). 112--120.Google ScholarDigital Library
L. Fei-Fei, R. Fergus, and P. Perona. 2006. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 4 (2006), 594--611.Google ScholarDigital Library
C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google Scholar
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning. 1126--1135.Google Scholar
Jianlong Fu, Heliang Zheng, and Tao Mei. 2017. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4476--4484.Google ScholarCross Ref
Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition. 4367--4375.Google ScholarCross Ref
Cheng Gong, Peicheng Zhou, and Junwei Han. 2016. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 54, 12 (2016), 7405--7415.Google ScholarCross Ref
Junwei Han, Dingwen Zhang, Gong Cheng, Lei Guo, and Jinchang Ren. 2015. Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning. IEEE Trans. Geosci. Remote Sens. 53, 6 (2015), 3325--3337.Google ScholarCross Ref
Zhizhong Han, Xinhai Liu, Yu-Shen Liu, and Matthias Zwicker. 2019. Parts4Feature: Learning 3D global features from generally semantic parts in multiple views. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’19). 766--773.Google ScholarCross Ref
Zhizhong Han, Honglei Lu, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C. L. Philip Chen. 2019. 3D2SeqViews: Aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 28, 8 (2019), 3986--3999.Google ScholarCross Ref
Zhizhong Han, Mingyang Shang, Zhenbao Liu, Chi-Man Vong, Yu-Shen Liu, Matthias Zwicker, Junwei Han, and C. L. Philip Chen. 2019. SeqViews2SeqLabels: Learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 28, 2 (2019), 1941--0042.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Luis Herranz, Shuqiang Jiang, and Ruihan Xu. 2017. Modeling restaurant context for food recognition. IEEE Trans. Multimedia 19, 2 (2017), 430--440.Google ScholarDigital Library
G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2261--2269.Google Scholar
Shuqiang Jiang, Weiqing Min, Linhu Liu, and Zhengdong Luo. 2019. Multi-scale multi-view deep feature aggregation for food recognition. IEEE Trans. Image Process. 29, 1 (2019), 265--276.Google ScholarCross Ref
Taichi Joutou and Keiji Yanai. 2010. A food image recognition system with multiple kernel learning. In Proceedings of the IEEE International Conference on Image Processing. 285--288.Google ScholarDigital Library
Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. 2014. Food detection and recognition using convolutional neural network. In Proceedings of the ACM International Conference on Multimedia. 1085--1088.Google ScholarDigital Library
Yoshiyuki Kawano and Keiji Yanai. 2014. Food image recognition with deep convolutional features. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. 589--593.Google ScholarDigital Library
Yoshiyuki Kawano and Keiji Yanai. 2014. Foodcam: A real-time mobile food recognition system employing fisher vector. In Proceedings of the International Conference on Multimedia Modeling. Springer, 369--373.Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning, Vol. 2.Google Scholar
Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. 2011. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33.Google Scholar
Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2013. One-shot learning by inverting a compositional causal process. In Proceedings of the International Conference on Neural Information Processing Systems. 2526--2534.Google Scholar
Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sungju Hwang, and Yi Yang. 2019. Learning to propagate labels: Transductive propagation network for few-shot learning. In Proceedings of the International Conference on Learning Representations.Google Scholar
Yuzhen Lu, Yuping Huang, and Renfu Lu. 2017. Innovative hyperspectral imaging-based techniques for quality evaluation of fruits and vegetables: A review. Appl. Sci. 7, 2 (2017), 189.Google ScholarCross Ref
J. Marin, A. Biswas, F. Ofli, N. Hynes, A. Salvador, Y. Aytar, I. Weber, and A. Torralba. 2019. Recipe1M+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE Trans. Pattern Anal. Mach. Intell. (2019), 1. Early Access.Google Scholar
Niki Martinel, Gian Luca Foresti, and Christian Micheloni. 2018. Wide-slice residual networks for food recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 567--576.Google ScholarCross Ref
Niki Martinel, Claudio Piciarelli, Christian Micheloni, and Gian Luca Foresti. 2015. A structured committee for food recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshop. 484--492.Google ScholarDigital Library
A. E. Mesas, M. Mu±ozpareja, E. Lopez-Garcia, and F. Rodríguez Artalejo. 2012. Selected eating behaviours and excess body weight: A systematic review.Obesity Rev. Offic. J. Int. Assoc. Study Obesity 13, 2 (2012), 106.Google ScholarCross Ref
Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban, Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang, and Kevin P. Murphy. 2015. Im2Calories: Towards an automated mobile vision food diary. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1233--1241.Google Scholar
Weiqing Min, Bing-Kun Bao, Shuhuan Mei, Yaohui Zhu, Yong Rui, and Shuqiang Jiang. 2018. You are what you eat: Exploring rich recipe information for cross-region food analysis. IEEE Trans. Multimedia 20, 4 (2018), 950--964.Google ScholarDigital Library
Weiqing Min, Shuqiang Jiang, Linhu Liu, Yong Rui, and Ramesh Jain. 2019. A survey on food computing. ACM Comput. Surv. 52, 5 (2019), 92:1--92:36.Google Scholar
Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, and Luis Herranz. 2017. Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Trans. Multimedia 19, 5 (2017), 1100--1113.Google ScholarDigital Library
Weiqing Min, Linhu Liu, Zhengdong Luo, and Shuqiang Jiang. 2019. Ingredient-guided cascaded multi-attention network for food recognition. In Proceedings of the ACM International Conference on Multimedia. 99--107.Google ScholarDigital Library
Tsendsuren Munkhdalai and Hong Yu. 2017. Meta networks. arXiv preprint arXiv:1703.00837.Google Scholar
Kaoru Ota, Minh Son Dao, Vasileios Mezaris, and Francesco G. B. De Natale. 2017. Deep learning for mobile multimedia: A survey. ACM Trans. Multimedia Comput. Commun. Appl. 13, 3s (2017), 34:1--34:22.Google ScholarDigital Library
Parisa Pouladzadeh and Shervin Shirmohammadi. 2017. Mobile multi-food recognition using deep learning. ACM Trans. Multimedia Comput. Commun. Appl. 13, 3s (2017).Google ScholarDigital Library
Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. 2018. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7229--7238.Google Scholar
Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier Marin, Ferda Ofli, Ingmar Weber, and Antonio Torralba. 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3068--3076.Google ScholarCross Ref
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy P. Lillicrap. 2016. One-shot learning with memory-augmented neural networks. CoRR abs/1605.06065.Google ScholarDigital Library
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision Workshop. 618--626.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.Google Scholar
Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. MIT Press, 4080--4090.Google ScholarDigital Library
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199--1208.Google Scholar
Ryosuke Tanno, Koichi Okamoto, and Keiji Yanai. 2016. DeepFoodCam: A DCNN-based real-time mobile food recognition system. In Proceedings of the International Workshop on Multimedia Assisted Dietary Management. 89--89.Google ScholarDigital Library
Sebastian Thrun. 1998. Lifelong Learning Algorithms. Springer, 181--209.Google Scholar
Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630--3638.Google Scholar
Ruihan Xu, Luis Herranz, Shuqiang Jiang, Shuang Wang, Xinhang Song, and Ramesh Jain. 2015. Geolocalized modeling for dish recognition. IEEE Trans. Multimedia 17, 8 (2015), 1187--1199.Google ScholarDigital Library
Shulin Yang, Mei Chen, Dean Pomerleau, and Rahul Sukthankar. 2010. Food recognition using statistics of pairwise local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2249--2256.Google Scholar
Matthew D. Zeiler and Rob Fergus. 2013. Visualizing and understanding convolutional networks. CoRR abs/1311.2901.Google Scholar
Dingwen Zhang, Deyu Meng, and Junwei Han. 2017. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans. Pattern Anal. Mach. Intell. 39, 5 (2017), 865--878.Google ScholarDigital Library
Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. 2014. Part-based R-CNNs for fine-grained category detection. In Proceedings of the European Conference on Computer Vision. 834--849.Google ScholarCross Ref
Jiannan Zheng, Z. Jane Wang, and Chunsheng Zhu. 2017. Food image recognition via superpixel-based low-level and mid-level distance coding for smart home applications. Sustainability 9, 5 (2017), 856.Google ScholarCross Ref

Index Terms

Few-shot Food Recognition via Multi-view Representation Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
      2. Computer vision representations
        Image representations

Recommendations

Few-shot Food Recognition with Pre-trained Model
CEA++ '22: Proceedings of the 1st International Workshop on Multimedia for Cooking, Eating, and related APPlications

Food recognition is a challenging task due to the diversity of food. However, conventional training in food recognition networks demands large amounts of labeled images, which is laborious and expensive. In this work, we aim to tackle the challenging ...
Read More
A supervised extreme learning committee for food recognition

A food recognition system exploiting a supervised committee of classifiers is proposed.The system automatically selects the optimal features for the task.The structured fusion approach is designed to achieve an optimal ranking.Evaluations have been ...
Read More
Food Item Recognition and Intake Measurement Techniques
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

High-calorie intake can be harmful and result in numerous diseases. Standard intake of a number of calories is fundamental for keeping the right balance of calories in the human body. Currently, some techniques allow users to estimate the calorie count ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 3
August 2020
364 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3409646
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 July 2020
- Online AM: 7 May 2020
- Accepted: 1 March 2020
- Revised: 1 January 2020
- Received: 1 July 2019
Published in tomm Volume 16, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Food recognition
deep learning
few-shot learning
visual recognition
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 651
  Total Downloads
- Downloads (Last 12 months)92
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Few-shot Food Recognition via Multi-view Representation Learning

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Few-shot Food Recognition with Pre-trained Model

A supervised extreme learning committee for food recognition

Food Item Recognition and Intake Measurement Techniques