skip to main content
research-article

Single-stage Instance Segmentation

Authors Info & Claims
Published:05 July 2020Publication History
Skip Abstract Section

Abstract

Albeit the highest accuracy of object detection is generally acquired by multi-stage detectors, like R-CNN and its extension approaches, the single-stage object detectors also achieve remarkable performance with faster execution and higher scalability. Inspired by this, we propose a single-stage framework to tackle the instance segmentation task. Building on a single-stage object detection network in hand, our model outputs the detected bounding box of each instance, the semantic segmentation result, and the pixel affinity simultaneously. After that, we generate the final instance masks via a fast post-processing method with the help of the three outputs above. As far as we know, it is the first attempt to segment instances in a single-stage pipeline on challenging datasets. Extensive experiments demonstrate the efficiency of our post-processing method, and the proposed framework obtains competitive results as a single-stage instance segmentation method. We achieve 32.5 box AP and 26.0 mask AP on the COCO validation set with 500 pixels input scale and 22.9 mask AP on the Cityscapes test set.

References

  1. Anurag Arnab and Philip H. S. Torr. 2016. Bottom-up instance segmentation using deep higher-order CRFs. arXiv:1609.02583. (2016).Google ScholarGoogle Scholar
  2. Anurag Arnab and Philip H. S. Torr. 2017. Pixelwise instance segmentation with a dynamically instantiated network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle Scholar
  3. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2481--2495.Google ScholarGoogle ScholarCross RefCross Ref
  4. Min Bai and Raquel Urtasun. 2017. Deep watershed transform for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  5. Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  6. Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. 2019. YOLACT: Real-time instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19).Google ScholarGoogle ScholarCross RefCross Ref
  7. Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. 2018. MaskLab: Instance segmentation by refining object detection with semantic and direction features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  8. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  9. Yadang Chen, Chuanyan Hao, Alex X. Liu, and Enhua Wu. 2019. Appearance-consistent video object segmentation based on a multinomial event model. ACM Trans. Multimedia Comput. Commun. Applic. 15, 2 (2019), 40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  11. Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  12. Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. 2016. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’16).Google ScholarGoogle Scholar
  13. Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  14. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ruochen Fan, Qibin Hou, Ming-Ming Cheng, Tai-Jiang Mu, and Shi-Min Hu. 2017. S net: Single stage salient-instance segmentation. arXiv:1711.07618. (2017).Google ScholarGoogle Scholar
  16. Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio Guadarrama, and Kevin P. Murphy. 2017. Semantic instance segmentation via deep metric learning. arXiv:1703.10277. (2017).Google ScholarGoogle Scholar
  17. Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. 2017. DSSD: Deconvolutional single shot detector. arXiv:1701.06659. (2017).Google ScholarGoogle Scholar
  18. Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. 2019. SSAP: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19).Google ScholarGoogle ScholarCross RefCross Ref
  19. Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. 2018. Detectron. Retrieved from https://github.com/facebookresearch/detectron.Google ScholarGoogle Scholar
  20. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS’10).Google ScholarGoogle Scholar
  21. Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google ScholarGoogle ScholarCross RefCross Ref
  22. Zeeshan Hayder, Xuming He, and Mathieu Salzmann. 2017. Boundary-aware instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  23. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).Google ScholarGoogle Scholar
  24. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google ScholarGoogle ScholarCross RefCross Ref
  25. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  26. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. (2017).Google ScholarGoogle Scholar
  27. Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, and Ross Girshick. 2018. Learning to segment every thing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  28. Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. 2017. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  29. Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. 2017. InstanceCut: From edges to instances with multicut. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  30. Hei Law and Jia Deng. 2018. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2017. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  32. Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2017. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  33. Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. 2018. DetNet: Design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  34. Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Jianchao Yang, Liang Lin, and Shuicheng Yan. 2017. Proposal-free network for instance-level object segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40, 12 (2017), 2978--2991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle Scholar
  36. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).Google ScholarGoogle ScholarCross RefCross Ref
  37. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14).Google ScholarGoogle Scholar
  38. Songtao Liu, Di Huang, and Yunhong Wang. 2018. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  39. Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. 2017. SGN: Sequential grouping networks for instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).Google ScholarGoogle ScholarCross RefCross Ref
  40. Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. 2018. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  41. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarGoogle Scholar
  42. Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. 2018. Affinity derivation and graph merge for instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  43. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle ScholarCross RefCross Ref
  44. Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShufflenNet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  45. Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2015. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Pedro O. Pinheiro, Ronan Collobert, and Piotr Dollár. 2015. Learning to segment object candidates. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15).Google ScholarGoogle Scholar
  47. Pedro O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarGoogle ScholarCross RefCross Ref
  48. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  49. Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. (2018).Google ScholarGoogle Scholar
  50. Mengye Ren and Richard S. Zemel. 2017. End-to-end instance segmentation with recurrent attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle Scholar
  51. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Bernardino Romera-Paredes and Philip Hilaire Sean Torr. 2016. Recurrent instance segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’16).Google ScholarGoogle ScholarCross RefCross Ref
  53. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention (MICCAI’15).Google ScholarGoogle ScholarCross RefCross Ref
  54. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  56. Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, and Abhinav Gupta. 2016. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851. (2016).Google ScholarGoogle Scholar
  57. Ke Sun, Mingjie Li, Dong Liu, and Jingdong Wang. 2018. IGCV3: Interleaved low-rank group convolutions for efficient deep neural networks. In Proceedings of the British Machine Vision Conference (BMVC’18).Google ScholarGoogle Scholar
  58. Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19).Google ScholarGoogle ScholarCross RefCross Ref
  59. Jonas Uhrig, Marius Cordts, Uwe Franke, and Thomas Brox. 2016. Pixel-level encoding and depth layering for instance-level semantic labeling. In Proceedings of the German Conference on Pattern Recognition (GCPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  60. Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. 2018. Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18).Google ScholarGoogle ScholarCross RefCross Ref
  61. Zifeng Wu, Chunhua Shen, and Anton van den Hengel. 2016. Bridging category-level and instance-level semantic image segmentation. arXiv:1605.06885. (2016).Google ScholarGoogle Scholar
  62. Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. 2019. PolarMask: Single shot instance segmentation with polar representation. arXiv:1909.13226. (2019).Google ScholarGoogle Scholar
  63. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  64. Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. 2019. Explicit shape encoding for real-time instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19).Google ScholarGoogle ScholarCross RefCross Ref
  65. Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Trans. Multimedia Comput. Commun. Applic. 12, 1 (2015), 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Qianni Zhang and Ebroul Izquierdo. 2013. Multifeature analysis and semantic context learning for image classification. ACM Trans. Multimedia Comput. Commun. Applic. 9, 2 (2013), 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. 2018. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle Scholar
  68. Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. 2016. Instance-level segmentation for autonomous driving with deep densely connected MRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Single-stage Instance Segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
        August 2020
        364 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3409646
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 July 2020
        • Online AM: 7 May 2020
        • Accepted: 1 March 2020
        • Revised: 1 February 2020
        • Received: 1 August 2019
        Published in tomm Volume 16, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format