Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, Bo Xu. VLP: A Survey on Vision-language Pre-training. Machine Intelligence Research, vol. 20, no. 1, pp.38-56, 2023. https://doi.org/10.1007/s11633-022-1369-5
Citation: Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, Bo Xu. VLP: A Survey on Vision-language Pre-training. Machine Intelligence Research, vol. 20, no. 1, pp.38-56, 2023. https://doi.org/10.1007/s11633-022-1369-5

VLP: A Survey on Vision-language Pre-training

doi: 10.1007/s11633-022-1369-5
More Information
  • Author Bio:

    Fei-Long Chen received the B. Sc. degree in computer sciences from Hefei University of Technology, China in 2018. He is currently a Ph. D. degree candidate in pattern recognition and intelligent system at Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China. His research interests include theoretical research on vision-language pre-training, multi-modal question answering and dialog.E-mail: chenfeilong2018@ia.ac.cnORCID iD: 0000-0002-4860-8483

    Du-Zhen Zhang, received the B. Sc. degree in software engineering from Shandong University, China in 2019. He is currently a Ph.D. degree candidate in pattern recognition and intelligent system at Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China. His research interests include theoretical research on reinforcement learning, natural language processing, and spiking neural networks.E-mail: zhangduzhen2019@ia.ac.cn

    Ming-Lun Han received the B. Sc. degree in electronic and information engineering from Harbin Institute of Technology, China in 2018. He is a Ph. D. degree candidate in pattern recognition and intelligent system at Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academd of Sciences, China. His research interests include speech recognition, speech synthesis, and speech chain.E-mail: hanminglun2018@ia.ac.cn

    Xiu-Yi Chen received the B. Sc. degree in automation from Department of Control Science and Engineering, Jilin University, China in 2017, and the Ph. D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 2022. His research interests include cross-modal retrieval, multimodal learning, dialogue system, knowledge-grounded generation and speech separation.E-mail: chenxiuyi2017@ia.ac.cn

    Jing Shi received the B. Sc. degree in automation from School of Instrumentation and Optoelectronic Engineering from Beihang University, China in 2012, and the Ph. D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 2021. He is a research assistant in Institute of Automation, Chinese Academy of Sciences, China. His interests include cross-modal modeling, multimodal learning, dialogue system, speech recognition and speech separation.E-mail: shijing2014@ia.ac.cn

    Shuang Xu received the B. Sc. and M. Sc. degrees in measuring and testing technologies and instruments from Yanshan University, China in 2001 and 2004, respectively, and the Ph.D. degree in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 2009. She is a professor in Institute of Automation, Chinese Academy of Sciences, China. Her research interests include natural language processing and understanding, human-AI hybird intelligence.E-mail: shuang.xu@ia.ac.cn

    Bo Xu received the B. Sc. degree in electrical engineering from Zhejiang University, China in 1988, and the M. Sc. and Ph. D. degrees in pattern recognition and intelligent system from Institute of Automation, Chinese Academy of Sciences, China in 1992 and 1997, respectively. He is a professor, the director of Institute of Automation, Chinese Academy of Sciences, China, and also deputy director of the Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, China. His research interests include brain-inspired intelligence, brain-inspired cognitive models, natural language processing and understanding, brain-inspired robotics.E-mail: xubo@ia.ac.cn (Corresponding author)ORCID iD: 0000-0002-1111-1529

  • Received Date: 2022-06-05
  • Accepted Date: 2022-08-17
  • In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances in five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.

     

  • loading
  • [1]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 6000–6010, 2017.
    [2]
    J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
    [3]
    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 161×6 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
    [4]
    S. Schneider, A. Baevski, R. Collobert, M. Auli. Wav2Vec: Unsupervised pre-training for speech recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 3465–3469, 2019. DOI: 10.21437/Interspeech.2019-1873.
    [5]
    X. P. Qiu, T. X. Sun, Y. G. Xu, Y. F. Shao, N. Dai, X. J. Huang. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. DOI: 10.1007/s11431-020-1647-3.
    [6]
    X. Han, Z. Y. Zhang, N. Ding, Y. X. Gu, X. Liu, Y. Q. Huo, J. Z. Qiu, Y. Yao, A. Zhang, L. Zhang, W. T. Han, M. L. Huang, Q. Jin, Y. Y. Lan, Y. Liu, Z. Y. Liu, Z. W. Lu, X. P. Qiu, R. H. Song, J. Tang, J. R. Wen, J. H. Yuan, W. X. Zhao, J. Zhu. Pre-trained models: Past, present and future. AI Open, vol. 2, pp. 225–250, 2021. DOI: 10.1016/j.aiopen.2021.08.002.
    [7]
    K. Han, Y. H. Wang, H. T. Chen, X. H. Chen, J. Y. Guo, Z. H. Liu, Y. H. Tang, A. Xiao, C. J. Xu, Y. X. Xu, Z. H. Yang, Y. M. Zhang, D. C. Tao. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: 10.1109/TPAMI.2022.3152247.
    [8]
    J. S. Lu, D. Batra, D. Parikh, S. Lee. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 13–23, 2019.
    [9]
    L. H. Li, M. Yatskar, D. Yin, C. J. Hsieh, K. W. Chang. VisualBERT: A simple and performant baseline for vision and language. [Online], Available: https://arxiv.org/abs/1908.03557, 2019.
    [10]
    X. J. Li, X. Yin, C. Y. Li, P. C. Zhang, X. W. Hu, L. Zhang, L. J. Wang, H. D. Hu, L. Dong, F. R. Wei, Y. Choi, J. F. Gao. OSCAR: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 121–137, 2020. DOI: 10.1007/978-3-030-58577-8_8.
    [11]
    S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 91–99, 2015.
    [12]
    P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6077–6086, 2018. DOI: 10.1109/CVPR.2018.00636.
    [13]
    G. Li, N. Duan, Y. J. Fang, M. Gong, D. X. Jiang. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 11336–11344, 2020.
    [14]
    Z. R. Wang, J. H. Yu, A. W. Yu, Z. H. Dai, Y. Tsvetkov, Y. Cao. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [15]
    H. Z. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, X. L. Chen. In defense of grid features for visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10267–10276, 2020. DOI: 10.1109/CVPR42600.2020.01028.
    [16]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
    [17]
    H. S. Luo, L. Ji, M. Zhong, Y. Chen, W. Lei, N. Duan, T. R. Li. CLIP4Clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, vol. 508, pp. 293–304, 2022. DOI: 10.1016/j.neucom.2022.07.028.
    [18]
    H. Fang, P. F. Xiong, L. H. Xu, Y. Chen. CLIP2Video: Mastering video-text retrieval via image clip. [Online], Available: https://arxiv.org/abs/2106.11097, 2021.
    [19]
    K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
    [20]
    J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, pp. 248–255, 2009. DOI: 10.1109/CVPR.2009.5206848.
    [21]
    C. Feichtenhofer, H. Q. Fan, J. Malik, K. M. He. SlowFast networks for video recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 6202–6211, 2019. DOI: 10.1109/ICCV.2019.00630.
    [22]
    J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6299–6308, 2017. DOI: 10.1109/CVPR.2017.502.
    [23]
    W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman. The kinetics human action video dataset. [Online], Available: https://arxiv.org/abs/1705.06950, 2017.
    [24]
    Y. H. Liu, M. Ott, N. Goyal, J. F. Du, M. Joshi, D. Q. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. [Online], Available: https://arxiv.org/abs/1907.11692, 2019.
    [25]
    Z. Z. Lan, M. D. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2019
    [26]
    Z. L. Yang, Z. H. Dai, Y. M. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 5753–5763, 2019.
    [27]
    P. C. Zhang, X. J. Li, X. W. Hu, J. W. Yang, L. Zhang, L. J. Wang, Y. Choi, J. F. Gao. VinVL: Revisiting visual representations in vision-language models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 5579–5588, 2021. DOI: 10.1109/CVPR46437.2021.00553.
    [28]
    Y. Zeng, X. S. Zhang, H. Li. Multi-grained vision language pre-training: Aligning texts with visual concepts. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, pp. 25994–26009, 2022.
    [29]
    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jegou. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 10347–10357, 2021.
    [30]
    Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: Universal image-text representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: 10.1007/978-3-030-58577-8_7.
    [31]
    L. W. Zhou, H. Palangi, L. Zhang, H. D. Hu, J. Corso, J. F. Gao. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 13041–13049, 2020.
    [32]
    S. Y. Zhang, T. Jiang, T. Wang, K. Kuang, Z. Zhao, J. K. Zhu, J. Yu, H. X. Yang, F Wu. DeVLBert: Learning deconfounded visio-linguistic representations. In Proceedings of the 28th ACM International Conference on Multimedia, ACM, Seattle, USA, pp. 4373–4382, 2020. DOI: 10.1145/3394171.3413518.
    [33]
    Z. Y. Dou, Y. C. Xu, Z. Gan, J. F. Wang, S. H. Wang, L. J. Wang, C. G. Zhu, P. C. Zhang, L. Yuan, N. Y. Peng, Z. C. Liu, M. Zeng. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18145–18155, 2022. DOI: 10.1109/CVPR52688.2022.01763.
    [34]
    W. L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism Quarterly, vol. 30, no. 4, pp. 415–433, 1953. DOI: 10.1177/107769905303000401.
    [35]
    J. N. Li, R. R. Selvaraju, A. D. Gotmare, S. Joty, C. M. Xiong, S. C. H. Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of 35th Conference on Neural Information Processing Systems, pp. 9694–9705, 2021.
    [36]
    L. J. Li, Y. C. Chen, Y. Cheng, Z. Gan, L. C. Yu, J. J. Liu. HERO: Hierarchical encoder for video + language omni-representation pre-training. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 2046–2065, 2020. DOI: 10.18653/v1/2020.emnlp-main.161.
    [37]
    S. Antol, A. Agrawal, J. S. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh. VQA: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2425–2433, 2015. DOI: 10.1109/ICCV.2015.279.
    [38]
    J. Lei, L. C. Yu, M. Bansal, T. L. Berg. TVQA: Localized, compositional video question answering. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1369–1379, 2018. DOI: 10.18653/v1/D18-1167.
    [39]
    O. Vinyals, A. Toshev, S. Bengio, D. Erhan. Show and tell: A neural image caption generator. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 3156–3164, 2015. DOI: 10.1109/CVPR.2015.7298935.
    [40]
    S. Bai, S. An. A survey on automatic image caption generation. Neurocomputing, vol. 311, pp. 291–304, 2018. DOI: 10.1016/j.neucom.2018.05.080.
    [41]
    Q. L. Xia, H. Y. Huang, N. Duan, D. D. Zhang, L. Ji, Z. Sui, E. Cui, T. Bharti, M. Zhou. XGPT: Cross-modal generative pre-training for image captioning. In Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing, Springer, Qingdao, China, pp. 786–797, 2021. DOI: 10.1007/978-3-030-88480-2_63.
    [42]
    A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, N. Carion. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1780–1790, 2021. DOI: 10.1109/ICCV48922.2021.00180.
    [43]
    M. Zhuge, D. H. Gao, D. P. Fan, L. B. Jin, B. Chen, H. M. Zhou, M. H. Qiu, L. Shao. Kaleido-BERT: Vision-language pre-training on fashion domain. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12647–12657, 2021. DOI: 10.1109/CVPR46437.2021.01246.
    [44]
    V. Ordonez, G. Kulkarni, T. L. Berg. Im2TEXT: Describing images using 1 million captioned photographs. In Proceedings of the 24th International Conference on Neural Information Processing Systems, ACM, Granada, Spain, pp. 1143–1151, 2011.
    [45]
    P. Young, A. Lai, M. Hodosh, J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014. DOI: 10.1162/tacl_a_00166.
    [46]
    T. Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740–755, 2014. DOI: 10.1007/978-3-319-10602-1_48.
    [47]
    R. Krishna, Y. K. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, vol. 123, pp. 1, pp. 32–73, 2017. DOI: 10.1007/s11263-016-0981-7.
    [48]
    Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6904–6913, 2017. DOI: 10.1109/CVPR.2017.670.
    [49]
    A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nie\ssner, M. Savva, S. R. Song, A. Zeng, Y. D. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In Proceedings of International Conference on 3D Vision, IEEE, Qingdao, China, pp. 667–676, 2017. DOI: 10.1109/3DV.2017.00081.
    [50]
    N. Rostamzadeh, S. Hosseini, T. Boquet, W. Stokowiec, Y. Zhang, C. Jauvin, C. Pal. Fashion-Gen: The generative fashion dataset and challenge. [Online], Available: https://arxiv.org/abs/1806.08317, 2018.
    [51]
    P. Sharma, N. Ding, S. Goodman, R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2556–2565, 2018. DOI: 10.18653/v1/P18-1238.
    [52]
    D. A. Hudson, C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6700–6709, 2019. DOI: 10.1109/CVPR.2019.00686.
    [53]
    D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. [Online], Available: https://arxiv.org/abs/2001.07966, 2020.
    [54]
    S. Changpinyo, P. Sharma, N. Ding, R. Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3558–3568, 2021. DOI: 10.1109/CVPR46437.2021.00356.
    [55]
    C. Jia, Y. F. Yang, Y. Xia, Y. T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. H. Sung, Z. Li, T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 4904–4916, 2021.
    [56]
    A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 2630–2640, 2019. DOI: 10.1109/ICCV.2019.00272.
    [57]
    M. Bain, A. Nagrani, G. Varol, A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1728–1738, 2021. DOI: 10.1109/ICCV48922.2021.00175.
    [58]
    C. Sun, A. Myers, C. Vondrick, K. Murphy, C. Schmid. VideoBERT: A joint model for video and language representation learning. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 7463–7472, 2019. DOI: 10.1109/ICCV.2019.00756.
    [59]
    Q. Wu, D. Teney, P. Wang, C. H. Shen, A. Dick, A. Van Den Hengel. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, vol. 163, pp. 21–40, 2017. DOI: 10.1016/j.cviu.2017.05.001.
    [60]
    K. Kafle, C. Kanan. Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding, vol. 163, pp. 3–20, 2017. DOI: 10.1016/j.cviu.2017.06.005.
    [61]
    K. Kafle, C. Kanan. An analysis of visual question answering algorithms. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 1965–1973, 2017. DOI: 10.1109/ICCV.2017.217.
    [62]
    S. J. Geng, J. Zhang, H. Zhang, A. Elgammal, D. N. Metaxas. 2nd place solution to the GQA challenge 2019. [Online], Available: https://arxiv.org/abs/1907.06794, 2019.
    [63]
    Y. Bitton, G. Stanovsky, R. Schwartz, M. Elhadad. Automatic generation of contrast sets from scene graphs: Probing the compositional consistency of GQA. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 94–105, 2021. DOI: 10.18653/v1/2021.naacl-main.9.
    [64]
    J. C. Li, S. L. Tang, L. C. Zhu, H. C. Shi, X. W. Huang, F. Wu, Y. Yang, Y. T. Zhuang. Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1867–1877, 2021. DOI: 10.1109/ICCV48922.2021.00188.
    [65]
    A. Chaudhary. Robust Vision and Language Inference via Semantics Transformed Adversarial Training, Ph. D. dissertation, Arizona State University, Phoenix, USA, 2021.
    [66]
    N. Xie, F. Lai, D. Doran, A. Kadav. Visual entailment: A novel task for fine-grained image understanding. [Online], Available: https://arxiv.org/abs/1901.06706, 2019.
    [67]
    H. Y. Song, L. Dong, W. N. Zhang, T. Liu, F. R. Wei. CLIP models are few-shot learners: Empirical studies on VQA and visual entailment. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 6088–6100, 2022. DOI: 10.18653/v1/2022.acl-long.421.
    [68]
    N. Xie, F. Lai, D. Doran, A. Kadav. Visual entailment task for visually-grounded language learning. [Online], Available: https://arxiv.org/abs/1811.10582, 2018.
    [69]
    R. Zellers, Y. Bisk, A. Farhadi, Y. Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6720–6731, 2019. DOI: 10.1109/CVPR.2019.00688.
    [70]
    W. J. Yu, J. W. Zhou, W. H. Yu, X. D. Liang, N. Xiao. Heterogeneous graph learning for visual commonsense reasoning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 2769–2779, 2019.
    [71]
    K. Ye, A. Kovashka. A case study of the shortcut effects in visual commonsense reasoning. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 35, no. 4, pp. 3181–3189, 2021.
    [72]
    A. Suhr, M. Lewis, J. Yeh, Y. Artzi. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 217–223, 2017. DOI: 10.18653/v1/P17-2034.
    [73]
    A. Marasović, C. Bhagavatula, J. S. Park, R. Le Bras, N. A. Smith, Y. Choi. Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs. In Proceedings of Conference on Empitical Methods in Natural Language Processing, pp. 2810–2829, 2020. DOI: 10.18653/v1/2020.findings-emnlp.253.
    [74]
    X. H. Liu, Z. H. Wang, J. Shao, X. G. Wang, H. S. Li. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1950–1959, 2019. DOI: 10.1109/CVPR.2019.00205.
    [75]
    S. B. Yang, G. B. Li, Y. Z. Yu. Cross-modal relationship inference for grounding referring expressions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 4145–4154, 2019. DOI: 10.1109/CVPR.2019.00427.
    [76]
    H. W. Zhang, Y. L. Niu, S. F. Chang. Grounding referring expressions in images by variational context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 4158–4166, 2018. DOI: 10.1109/CVPR.2018.00437.
    [77]
    D. Ghosal, M. S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, P. Bhattacharyya. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3454–3466, 2018. DOI: 10.18653/v1/D18-1382.
    [78]
    M. S. Akhtar, D. Chauhan, D. Ghosal, S. Poria, A. Ekbal, P. Bhattacharyya. Multi-task learning for multi-modal emotion recognition and sentiment analysis. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 370–379, 2019. DOI: 10.18653/v1/N19-1034.
    [79]
    J. M. Liu, P. X. Zhang, Y. Liu, W. D. Zhang, J. Fang. Summary of multi-modal sentiment analysis technology. Journal of Frontiers of Computer Science and Technology, vol. 15, no. 7, pp. 1165–1182, 2021. DOI: 10.3778/j.issn.1673-9418.2012075. (in Chinese)
    [80]
    D. Z. Zhang, X. Y. Chen, S. Xu, B. Xu. Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 4429–4440, 2020. DOI: 10.18653/v1/2020.coling-main.392.
    [81]
    K. Y. Wang, Q. Y. Yin, W. Wang, S. Wu, L. Wang. A comprehensive survey on cross-modal retrieval. [Online], Available: https://arxiv.org/abs/1607.06215, 2016.
    [82]
    N. C. Mithun, J. C. Li, F. Metze, A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of ACM on International Conference on Multimedia Retrieval, Yokohama, Japan, pp. 19–27, 2018. DOI: 10.1145/3206025.3206064.
    [83]
    H. Chen, G. G. Ding, X. D. Liu, Z. J. Lin, J. Liu, J. Han. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 12652–12660, 2020. DOI: 10.1109/CVPR42600.2020.01267.
    [84]
    F. L. Chen, X. Y. Chen, J. X. Shi, D. Z. Zhang, J. L. Chang, Q. Tian. HiVLP: Hierarchical vision-language pre-training for fast image-text retrieval. [Online], Available: https://arxiv.org/abs/2205.12105, 2022.
    [85]
    K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, ACM, Lille, France, pp. 2048–2057, 2015.
    [86]
    B. R. Wang, L. Ma, W. Zhang, W. Liu. Reconstruction network for video captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7622–7631, 2018. DOI: 10.1109/CVPR.2018.00795.
    [87]
    H. Agrawal, K. Desai, Y. F. Wang, X. L. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, P. Anderson. Nocaps: Novel object captioning at scale. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Republic of Korea, pp. 8948–8957, 2019. DOI: 10.1109/ICCV.2019.00904.
    [88]
    Q. Y. Feng, Y. Wu, H. H. Fan, C. G. Yan, M. L. Xu, Y. Yang. Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3413–3421, 2020. DOI: 10.1109/TCSVT.2020.2965966.
    [89]
    A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, D. Batra. Visual dialog. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 326–335, 2017. DOI: 10.1109/CVPR.2017.121.
    [90]
    F. L. Chen, F. D. Meng, J. M. Xu, P. Li, B. Xu, J. Zhou. DMRM: A dual-channel multi-hop reasoning model for visual dialog. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 34, no. 5, pp. 7504–7511, 2020. DOI: 10.1609/aaai.v34i05.6248.
    [91]
    F. L. Chen, X. Y. Chen, F. D. Meng, P. Li, J. Zhou. GoG: Relation-aware graph-over-graph network for visual dialog. In Proceedings of International Joint Conference on Natural Language Processing, pp. 230–243, 2021. DOI: 10.18653/v1/2021.findings-acl.20.
    [92]
    F. L. Chen, F. D. Meng, X. Y. Chen, P. Li, J. Zhou. Multimodal incremental transformer with visual grounding for visual dialogue generation. In Proceedings of International Joint Conference on Natural Language Processing, pp. 436–446, 2021. DOI: 10.18653/v1/2021.findings-acl.38.
    [93]
    L. Specia, S. Frank, K. Sima′an, D. Elliott. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation, Berlin, Germany, pp. 543–553, 2016. DOI: 10.18653/v1/W16-2346.
    [94]
    Y. J. Yin, F. D. Meng, J. S. Su, C. L. Zhou, Z. Y. Yang, J. Zhou, J. B. Luo. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3025–3035, 2020. DOI: 10.18653/v1/2020.acl-main.273.
    [95]
    Y. H. Su, K. Fan, N. Bach, C. C. J. Kuo, F. Huang. Unsupervised multi-modal neural machine translation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 10482–10491, 2019. DOI: 10.1109/CVPR.2019.01073.
    [96]
    X. Wang, Q. Y. Huang, A. Celikyilmaz, J. F. Gao, D. H. Shen, Y. F. Wang, W. Y. Wang, L. Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6629–6638, 2019. DOI: 10.1109/CVPR.2019.00679.
    [97]
    F. D. Zhu, Y. Zhu, X. J. Chang, X. D. Liang. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10012–10022, 2020. DOI: 10.1109/CVPR42600.2020.01003.
    [98]
    J. Gu, E. Stefani, Q. Wu, J. Thomason, X. Wang. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 7606–7623, 2022. DOI: 10.18653/v1/2022.acl-long.524.
    [99]
    S. Mori, H. Nishida, H. Yamada. Optical Character Recognition, New York, USA: J. Wiley, 1999.
    [100]
    J. Memon, M. Sami, R. A. Khan, M. Uddin. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access, vol. 8, pp. 142642–142668, 2020. DOI: 10.1109/ACCESS.2020.3012542.
    [101]
    R. Strudel, R. Garcia, I. Laptev, C. Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 7242–7252, 2021. DOI: 10.1109/ICCV48922.2021.00717.
    [102]
    Y. J. Mo, Y. Wu, X. N. Yang, F. L. Liu, Y. J. Liao. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing, vol. 493, pp. 626–646, 2022. DOI: 10.1016/j.neucom.2022.01.005.
    [103]
    Z. Q. Zhao, P. Zheng, S. T. Xu, X. D. Wu. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, vol. 30, no. 11, pp. 3212–3232, 2019. DOI: 10.1109/TNNLS.2018.2876865.
    [104]
    Y. X. Fang, B. C. Liao, X. G. Wang, J. M. Fang, J. Y. Qi, R. Wu, J. W. Niu, W. Y. Liu. You only look at one sequence: Rethinking transformer in vision through object detection. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 26183–26197, 2021.
    [105]
    C. Sun, F. Baradel, K. Murphy, C. Schmid. Learning video representations using contrastive bidirectional transformer. [Online], Available: https://arxiv.org/abs/1906.05743, 2019.
    [106]
    H. S. Luo, L. Ji, B. T. Shi, H. Y. Huang, N. Duan, T. R. Li, J. Li, T. Bharti, M. Zhou. UniVL: A unified video and language pre-training model for multimodal understanding and generation. [Online], Available: https://arxiv.org/abs/2002.06353, 2020.
    [107]
    N. Rethmeier, I. Augenstein. Long-tail zero and few-shot learning via contrastive pretraining on and for small data. Computer Sciences and Mathematics Forum, vol. 3, no. 1, Article number 10, 2022. DOI: 10.3390/cmsf2022003010.
    [108]
    W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2019.
    [109]
    Y. Wang, S. R. Joty, M. R. Lyu, I. King, C. M. Xiong, S. C. H. Hoi. VD-BERT: A unified vision and dialog transformer with BERT. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 3325–3338, 2020. DOI: 10.18653/v1/2020.emnlp-main.269.
    [110]
    J. H. Dong, Y. Cong, G. Sun, B. N. Zhong, X. W. Xu. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 4022–4031, 2020. DOI: 10.1109/CVPR42600.2020.00408.
    [111]
    J. H. Dong, Y. Cong, G. Sun, Z. Fang, Z. M. Ding. Where and how to transfer: Knowledge aggregation-induced transferability perception for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: 10.1109/TPAMI.2021.3128560.
    [112]
    H. B. Bao, W. H. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, F. R. Wei. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. [Online], Available: https://arxiv.org/abs/2111.02358, 2021.
    [113]
    H. Tan, M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 5100–5111, 2019. DOI: 10.18653/v1/D19-1514.
    [114]
    C. Alberti, J. Ling, M. Collins, D. Reitter. Fusion of detected objects in text for visual question answering. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 2131–2140, 2019. DOI: 10.18653/v1/D19-1219.
    [115]
    J. S. Lu, V. Goswami, M. Rohrbach, D. Parikh, S. Lee. 12-in-1: Multi-task vision and language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10437–10446, 2020. DOI: 10.1109/CVPR42600.2020.01045.
    [116]
    V. Murahari, D. Batra, D. Parikh, A. Das. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 336–352, 2020. DOI: 10.1007/978-3-030-58523-5_20.
    [117]
    W. T. Hao, C. Y. Li, X. J. Li, L. Carin, J. F Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 13134–13143, 2020. DOI: 10.1109/CVPR42600.2020.01315.
    [118]
    J. Y. Lin, A. Yang, Y. C. Zhang, J. Liu, J. R. Zhou, H. X. Yang. InterBERT: Vision-and-language interaction for multi-modal pretraining. [Online], Available: https://arxiv.org/abs/2003.13198, 2020.
    [119]
    Z. C. Huang, Z. Y. Zeng, B. Liu, D. M. Fu, J. L. Fu. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. [Online], Available: https://arxiv.org/abs/2004.00849, 2020.
    [120]
    Y. C. Hong, Q. Wu, Y. K. Qi, C. R. Opazo, S. Gould. VLN-BERT: A recurrent vision-and-language bert for navigation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 1643–1653, 2021. DOI: 10.1109/CVPR46437.2021.00169.
    [121]
    D. H. Gao, L. B. Jin, B. Chen, M. H. Qiu, P. Li, Y. Wei, Y. Hu, H. Wang. FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2251–2260, 2020. DOI: 10.1145/3397271.3401430.
    [122]
    Z. Gan, Y. C. Chen, L. J. Li, C. Zhu, Y. Cheng, J. J. Liu. Large-scale adversarial training for vision-and-language representation learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, ACM, Vancouver, Canada, pp. 6616–6628, 2020.
    [123]
    F. Yu, J. J. Tang, W. C. Yin, Y. Sun, H. Tian, H. Wu, H. F. Wang. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, pp. 3208–3216, 2021.
    [124]
    M. J. Chiou, R. Zimmermann, J. S. Feng. Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access, vol. 9, pp. 50441–50451, 2021. DOI: 10.1109/ACCESS.2021.3069041.
    [125]
    J. Cho, J. Lei, H. Tan, M. Bansal. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1931–1942, 2021.
    [126]
    W. Kim, B. Son, I. Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 5583–5594, 2021.
    [127]
    Z. C. Huang, Z. Y. Zeng, Y. P. Huang, B. Liu, D. M. Fu, J. L. Fu. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12976–12985, 2021. DOI: 10.1109/CVPR46437.2021.01278.
    [128]
    H. Y. Xu, M. Yan, C. L. Li, B. Bi, S. F. Huang, W. M. Xiao, F. Huang. E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 503–513, 2021. DOI: 10.18653/v1/2021.acl-long.42.
    [129]
    H. W. Xue, Y. P. B. Liu, H. W. Peng, J. L. Fu, H. Q. Li, J. B. Luo. Probing inter-modality: Visual parsing with self-attention for vision-language pre-training. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 4514–4528, 2021.
    [130]
    S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K. W. Chang, Z. W Yao, K. Keutzer. How much can clip benefit vision-and-language tasks? In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [131]
    A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y. F Yang, J. Baldridge. MURAL: Multimodal, multitask retrieval across languages. [Online], Available: https://arxiv.org/abs/2109.05125, 2021.
    [132]
    J. Y. Yang, J. L. Duan, S. Tran, Y. Xu, S. Chanda, L. Q. Chen, B. Zeng, T. Chilimbi, J. Z. Huang. Vision-language pre-training with triple contrastive learning. In Proceeding of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 15650–15659, 2022. DOI: 10.1109/CVPR52688.2022.01522.
    [133]
    S. N. Xie, C. Sun, J. Huang, Z. W. Tu, K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. [Online], Available: https://arxiv.org/abs/1712.04851, 2017.
    [134]
    A. Urooj, A. Mazaheri, N. da Vitoria Lobo, M. Shah. MMFT-BERT: Multimodal fusion transformer with BERT encodings for visual question answering. In Proceeding of Conference on Empirical Methods in Natural Language Processing, pp. 4648–4660, 2020. DOI: 10.18653/v1/2020.findings-emnlp.417.
    [135]
    L. C. Zhu, Y. Yang. ActBERT: Learning global-local video-text representations. In Proceeding of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 8743–8752, 2020. DOI: 10.1109/CVPR42600.2020.00877.
    [136]
    R. Yan, M. Z. Shou, Y. X. Ge, A. J. Wang, X. D. Lin, G. Y. Cai, J. H. Tang. Video-text pre-training with learned regions. [Online], Available: https://arxiv.org/abs/2112.01194, 2021.
    [137]
    H. Zhu, M. D. Luo, R. Wang, A. H. Zheng, R. He. Deep audio-visual learning: A survey. International Journal of Automation and Computing, vol. 18, no. 3, pp. 351–376, 2021. DOI: 10.1007/s11633-021-1293-0.
    [138]
    J. H. Tao, J. Huang, Y. Li, Z. Lian, M. Y. Niu. Correction to: Semi-supervised ladder networks for speech emotion recognition. International Journal of Automation and Computing, vol. 18, no. 4, pp. 680–680, 2021. DOI: 10.1007/s11633-019-1215-6.
    [139]
    H. Akbari, L. Z. Yuan, R. Qian, W. H. Chuang, S. F. Chang, Y. Cui, B. Q. Gong. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 24206–24221, 2021.
    [140]
    J. Liu, X. X. Zhu, F. Liu, L. T. Guo, Z. J. Zhao, M. Z. Sun, W. N. Wang, H. Q. Lu, S. Y. Zhou, J. J. Zhang, J. Q. Wang. OPT: Omni-perception pre-trainer for cross-modal understanding and generation. [Online], Available: https://arxiv.org/abs/2107.00249, 2021.
    [141]
    A. Guzhov, F. Raue, J. Hees, A. Dengel. Audioclip: Extending clip to image, text and audio. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 976–980, 2022. DOI: 10.1109/ICASSP43922.2022.9747631.
    [142]
    R. Zellers, J. S. Lu, X. M. Lu, Y. Yu, Y. P. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, Y. Choi. MERLOT reserve: Neural script knowledge through vision and language and sound. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16354–16366, 2022. DOI: 10.1109/CVPR52688.2022.01589.
    [143]
    K. Z. Chen, Q. Y. Huang, Y. Bisk, D. McDuff, J. F. Gao. KB-VLP: Knowledge based vision and language pretraining. In Proceedings of the 38 th International Conference on Machine Learning, 2021.
    [144]
    M. Tsimpoukelli, J. Menick, S. Cabi, S. M. Ali Eslami, O. Vinyals, F. Hill. Multimodal few-shot learning with frozen language models. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 200–212, 2021.
    [145]
    A. Fan, E. Grave, A. Joulin. Reducing transformer depth on demand with structured dropout. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [146]
    V. Sanh, L. Debut, J. Chaumond, T. Wolf. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. [Online], Available: https://arxiv.org/abs/1910.01108, 2019.
    [147]
    O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat. Q8BERT: Quantized 8Bit BERT. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, IEEE, Vancouver, Canada, pp. 36–39, 2019. DOI: 10.1109/EMC2-NIPS53020.2019.00016.
    [148]
    Z. Y. Fang, J. F. Wang, X. W. Hu, L. J. Wang, Y. Z. Yang, Z. C. Liu. Compressing visual-linguistic model via knowledge distillation. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1428–1438, 2021. DOI: 10.1109/ICCV48922.2021.00146.
    [149]
    Y. G. Li, F. Liang, L. C. Zhao, Y. F. Cui, W. L. Ouyang, J. Shao, F. W. Yu, J. J. Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [150]
    C. Saharia, W. Chan, S. Saxena, L. L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. [Online], Available: https://arxiv.org/abs/2205.11487, 2022.
    [151]
    X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, T. B. Hashimoto. Diffusion-LM improves controllable text generation. [Online], Available: https://arxiv.org/abs/2205.14217, 2022.
    [152]
    W. Z. Chen, X. Han, Y. K. Lin, H. X. Zhao, Z. Y. Liu, P. Li, M. S. Sun, J. Zhou. Fully hyperbolic neural networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, pp. 5672–5686, 2022. DOI: 10.18653/v1/2022.acl-long.389.
    [153]
    M. M. Bronstein, J. Bruna, T. Cohen, P. Veličković, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. [Online], Available: https://arxiv.org/abs/2104.13478, 2021.
    [154]
    W. Maass. Networks of spiking neurons: The third generation of neural network models. Neural Networks, vol. 10, no. 9, pp. 1659–1671, 1997. DOI: 10.1016/S0893-6080(97)00011-7.
    [155]
    D. Z. Zhang, T. L. Zhang, S. C. Jia, Q. Y. Wang, B. Xu. Recent advances and new frontiers in spiking neural networks. In Proceedings of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, 2022.
    [156]
    D. Z. Zhang, T. L. Zhang, S. C. Jia, X. Cheng, B. Xu. Population-coding and dynamic-neurons improved spiking actor network for reinforcement learning. [Online], Available: https://arxiv.org/abs/2106.07854, 2021.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(2)  / Tables(3)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (26) PDF downloads(2) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return