Luequan Wang, Hongbin Xu, Wenxiong Kang. MVContrast: Unsupervised Pretraining for Multi-view 3D Object Recognition. Machine Intelligence Research, vol. 20, no. 6, pp.872-883, 2023. https://doi.org/10.1007/s11633-023-1430-z
Citation: Luequan Wang, Hongbin Xu, Wenxiong Kang. MVContrast: Unsupervised Pretraining for Multi-view 3D Object Recognition. Machine Intelligence Research, vol. 20, no. 6, pp.872-883, 2023. https://doi.org/10.1007/s11633-023-1430-z

MVContrast: Unsupervised Pretraining for Multi-view 3D Object Recognition

doi: 10.1007/s11633-023-1430-z
More Information
  • Author Bio:

    Luequan Wang received the B. Sc. degree in automation from South China University of Technology, China in 2020. He is a master student in automationscience and engineering at South China University of Technology, China.His research interests include self-supervised learning, 3D vision and deep learning. E-mail: 875713197@qq.com ORCID iD: 0000-0001-9320-6873

    Hongbin Xu received the M. Sc. degree from South China University of Technology, China in 2021. He is currently a Ph. D. degree candidate in automationscience and engineering at South China University of Technology (SCUT), China. His research interests include 3D vision, multi-view stereo and self-supervised learning.E-mail: hongbinxu1013@gmail.comORCID iD: 0000-0002-3455-1527

    Wenxiong Kang received the M. Sc. degree from Northwestern Polytechnical University, China in 2003, and the Ph. D. degree in automationscience and engineering from South China University of Technology, China in 2009. He is currently a professor with School of Automation Science and Engineering, South China University of Technology, China.His research interests include biometrics identification, image processing, pattern recognition and computer vision. E-mail: auwxkang@scut.edu.cn (Corresponding author) ORCID iD: 0000-0001-9023-7252

  • Received Date: 2022-11-01
  • Accepted Date: 2023-02-24
  • Publish Online: 2023-05-10
  • Publish Date: 2023-12-01
  • 3D shape recognition has drawn much attention in recent years. The view-based approach performs best of all. However, the current multi-view methods are almost all fully supervised, and the pretraining models are almost all based on ImageNet. Although the pretraining results of ImageNet are quite impressive, there is still a significant discrepancy between multi-view datasets and ImageNet. Multi-view datasets naturally retain rich 3D information. In addition, large-scale datasets such as ImageNet require considerable cleaning and annotation work, so it is difficult to regenerate a second dataset. In contrast, unsupervised learning methods can learn general feature representations without any extra annotation. To this end, we propose a three-stage unsupervised joint pretraining model. Specifically, we decouple the final representations into three fine-grained representations. Data augmentation is utilized to obtain pixel-level representations within each view. And we boost the spatial invariant features from the view level. Finally, we exploit global information at the shape level through a novel extract-and-swap module. Experimental results demonstrate that the proposed method gains significantly in 3D object classification and retrieval tasks, and shows generalization to cross-dataset tasks.

     

  • loading
  • [1]
    K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition, [Online], Available: https://arxiv.org/abs/1409.1556, 2014.
    [2]
    K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
    [3]
    K. M. He, R. Girshick, P. Dollár. Rethinking imageNet pre-training. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 4917–4926, 2019. DOI: 10.1109/ICCV.2019.00502.
    [4]
    T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, Article number 149, 2020.
    [5]
    I. Misra, L. van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 6706–6716, 2020. DOI: 10.1109/CVPR42600.2020.00674.
    [6]
    H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller. Multi-view convolutional neural networks for 3D shape recognition. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 945–953, 2015. DOI: 10.1109/ICCV.2015.114.
    [7]
    D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2536–2544, 2016. DOI: 10.1109/CVPR.2016.278.
    [8]
    R. Qian, T. J. Meng, B. Q. Gong, M. H. Yang, H. S. Wang, S. Belongie, Y. Cui. Spatiotemporal contrastive video representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 6960–6970, 2021. DOI: 10.1109/CVPR46437.2021.00689.
    [9]
    R. Zhang, P. Isola, A. A. Efros. Colorful image colorization. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 649–666, 2016. DOI: 10.1007/978-3-319-46487-9_40.
    [10]
    C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, W. Z. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 105–114, 2017. DOI: 10.1109/CVPR.2017.19.
    [11]
    J. Y. Liang, J. Z. Cao, G. L. Sun, K. Zhang, L. Van Gool, R. Timofte. SwinIR: Image restoration using swin transformer. In Proceedings of IEEE/CVF International Conference on Computer Vision Workshops, IEEE, Montreal, Canada, pp. 1833–1844, 2021. DOI: 10.1109/ICCVW54120.2021.00210.
    [12]
    R. R. Zhang, Z. Y. Guo, W. Zhang, K. C. Li, X. P. Miao, B. Cui, Y. Qiao, P. Gao, H. S. Li. PointCLIP: Point cloud understanding by clip. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 8542–8552, 2022. DOI: 10.1109/CVPR52688.2022.00836.
    [13]
    T. Y. Huang, B. W. Dong, Y. H. Yang, X. S. Huang, R. W. H. Lau, W. L. Ouyang, W. M. Zuo. CLIP2Point: Transfer CLIP to point cloud classification with image-depth pre-training, [Online], Available: https://arxiv.org/abs/2210.01055, 2022.
    [14]
    K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726–9735, 2020. DOI: 10.1109/CVPR42600.2020.00975.
    [15]
    X. L. Chen, H. Q. Fan, R. Girshick, K. M. He. Improved baselines with momentum contrastive learning, [Online], Available: https://arxiv.org/abs/2003.04297, 2020.
    [16]
    J. B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, M. Valko. Bootstrap your own latent a new approach to self-supervised learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 1786, 2020.
    [17]
    Z. Zhang, L. Liu, F. M. Shen, H. T. Shen, L. Shao. Binary multi-view clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1774–1782, 2019. DOI: 10.1109/TPAMI.2018.2847335.
    [18]
    T. Yu, J. J. Meng, J. S. Yuan. Multi-view harmonized bilinear network for 3D object recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 186–194, 2018. DOI: 10.1109/CVPR.2018.00027.
    [19]
    X. W. He, T. T. Huang, S. Bai, X. Bai. View N-gram network for 3d object retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 7514–7523, 2019. DOI: 10.1109/ICCV.2019.00761.
    [20]
    Y. Xu, C. D. Zheng, R. T. Xu, Y. H. Quan, H. B. Ling. Multi-view 3D shape recognition via correspondence-aware deep learning. IEEE Transactions on Image Processing, vol. 30, pp. 5299–5312, 2021. DOI: 10.1109/TIP.2021.3082310.
    [21]
    S. S. Mohammadi, Y. M. Wang, A. Del Bue. Pointview-GCN: 3D shape classification with multi-view point clouds. In Proceedings of IEEE International Conference on Image Processing, Anchorage, USA, pp. 3103–3107, 2021. DOI: 10.1109/ICIP42928.2021.9506426.
    [22]
    Z. Z. Han, X. Y. Wang, C. M. Vong, Y. S. Liu, M. Zwicker, C. L. P. Chen. 3Dviewgraph: Learning global features for 3D shapes from a graph of unordered views with attention. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, pp. 758–765, 2019.
    [23]
    X. Wei, R. X. Yu, J. Sun. View-GCN: View-based graph convolutional network for 3D shape analysis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 1847–1856, 2020. DOI: 10.1109/CVPR42600.2020.00192.
    [24]
    A. Hamdi, S. Giancola, B. Ghanem. MVTN: Multi-view transformation network for 3D shape recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1–11, 2021. DOI: 10.1109/ICCV48922.2021.00007.
    [25]
    R. Girdhar, D. F. Fouhey, M. Rodriguez, A. Gupta. Learning a predictable and generative vector representation for objects. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 484–499, 2016. DOI: 10.1007/978-3-319-46466-4_29.
    [26]
    A. Sharma, O. Grau, M. Fritz. VConv-DAE: Deep volumetric shape learning without object labels. In Proceedings of European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 236–250, 2016. 10.1007/978-3-319-49409-8_20.
    [27]
    J. J. Wu, C. K. Zhang, T. F. Xue, W. T. Freeman, J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 82–90, 2016.
    [28]
    S. K. Liu, L. Giles, A. Ororbia. Learning a hierarchical latent-variable model of 3D shapes. In Proceedings of International Conference on 3D Vision, IEEE, Verona, Italy, pp. 542–551, 2018. DOI: 10.1109/3DV.2018.00068.
    [29]
    P. Achlioptas, O. Diamanti, I. Mitliagkas, L. J. Guibas. Learning representations and generative models for 3D point clouds. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, pp. 40–49, 2018.
    [30]
    Y. Q. Yang, C. Feng, Y. R. Shen, D. Tian. FoldingNet: Point cloud auto-encoder via deep grid deformation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 206–215, 2018. DOI: 10.1109/CVPR.2018.00029.
    [31]
    Y. H. Zhao, T. Birdal, H. W. Deng, F. Tombari. 3D point capsule networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1009–1018, 2019. DOI: 10.1109/CVPR.2019.00110.
    [32]
    Z. Z. Han, X. Y. Wang, Y. S. Liu, M. Zwicker. Multi-angle point cloud-VAE: Unsupervised feature learning for 3D point clouds from multiple angles by joint self-reconstruction and half-to-half prediction. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 10441–10450, 2019. DOI: 10.1109/ICCV.2019.01054.
    [33]
    Z. Z. Han, M. Y. Shang, Y. S. Liu, M. Zwicker. View inter-prediction GAN: Unsupervised representation learning for 3D shapes by learning global shape memories to support local view predictions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, USA, Article number 1027, 2019.
    [34]
    Y. F. Feng, Z. Z. Zhang, X. B. Zhao, R. R. Ji, Y. Gao. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 264–272, 2018. DOI: 10.1109/CVPR.2018.00035.
    [35]
    J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, pp. 248-255, 2009. DOI: 10.1109/CVPR.2009.5206848.
    [36]
    H. X. You, Y. F. Feng, X. B. Zhao, C. Q. Zou, R. R. Ji, Y. Gao. PVRNet: Point-view relation neural network for 3D shape recognition. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, USA, Article number 1119, 2019. DOI: 10.1609/aaai.v33i01.33019119.
    [37]
    M. Kazhdan, T. Funkhouser, S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3D shape descriptors. In Proceedings of Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, Aachen, Germany, pp. 156–164, 2003.
    [38]
    Z. R. Wu, S. R. Song, A. Khosla, F. Yu, L. G. Zhang, X. O. Tang, J. X. Xiao. 3D shapeNets: A deep representation for volumetric shapes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 1912–1920, 2015. DOI: 10.1109/CVPR.2015.7298801.
    [39]
    T. Furuya, R. Ohbuchi. Deep aggregation of local 3D geometric features for 3D model retrieval. In Proceedings of British Machine Vision Conference, York, UK, Article number 8, 2016.
    [40]
    B. G. Shi, S. Bai, Z. C. Zhou, X. Bai. DeepPano: Deep panoramic representation for 3-D shape recognition. IEEE Signal Processing Letters, vol. 22, no. 12, pp. 2339–2343, 2015. DOI: 10.1109/LSP.2015.2480802.
    [41]
    S. Bai, X. Bai, Z. C. Zhou, Z. X. Zhang, L. Jan Latecki. GIFT: A real-time and scalable 3D shape search engine. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 5023–5032, 2016. DOI: 10.1109/CVPR.2016.543.
    [42]
    K. Sfikas, T. Theoharis, I. Pratikakis. Exploiting the PANORAMA representation for convolutional neural network classification and retrieval. In Proceedings of Workshop on 3D Object Retrieval, Lyon, France, 2017. DOI: 10.2312/3dor.20171045.
    [43]
    X. W. He, Y. Zhou, Z. C. Zhou, S. Bai, X. Bai. Triplet-center loss for multi-view 3D object retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1945–1954, 2018. DOI: 10.1109/CVPR.2018.00208.
    [44]
    M. Savva, F. Yu, H. Su, A. Kanezaki, T. Furuya, R. Ohbuchi, Z. C. Zhou, R. Yu, S. Bai, X. Bai, M. Aono, A. Tatsuma, S. Thermos, A. Axenopoulos, G. T. Papadopoulos, P. Daras, X. Deng, Z. H. Lian, B. Li, H. Johan, Y. J. Lu, S. Mk. Large-scale 3D shape retrieval from shapeNet core55. In Proceedings of Workshop on 3D Object Retrieval, Lyon, France, pp. 39–50, 2017. DOI: 10.2312/3dor.20171050.
    [45]
    B. Li, H. Johan. 3D model retrieval using hybrid features and class information. Multimedia Tools and Applications, vol. 62, no. 3, pp. 821–846, 2013. DOI: 10.1007/s11042-011-0873-3.
    [46]
    D. Robben, J. Bertels, S. Willems, D. Vandermeulen, F. Maes, P. Suetens. DeepVoxNet: Voxel-Wise Prediction for 3D Images, Report No. KUL/ESAT/PSI/1801, 2018.
    [47]
    A. Kanezaki, Y. Matsushita, Y. Nishida. RotationNet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5010–5019, 2018. DOI: 10.1109/CVPR.2018.00526.
    [48]
    Y. M. Rao, J. W. Lu, J. Zhou. Global-local bidirectional reasoning for unsupervised representation learning of 3D point clouds. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 5375–5384, 2020. DOI: 10.1109/CVPR42600.2020.00542.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(8)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (540) PDF downloads(31) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return