Guyue Hu, Bin He, Hanwang Zhang. Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos. Machine Intelligence Research, vol. 20, no. 2, pp.249-262, 2023. https://doi.org/10.1007/s11633-022-1409-1
Citation: Guyue Hu, Bin He, Hanwang Zhang. Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos. Machine Intelligence Research, vol. 20, no. 2, pp.249-262, 2023. https://doi.org/10.1007/s11633-022-1409-1

Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos

doi: 10.1007/s11633-022-1409-1
More Information
  • Author Bio:

    Guyue Hu received the B. Eng. degree in automation from Hefei University of Technology, China in 2016, and the Ph. D. degree in pattern recognition and intelligent systems from National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2021. He was also a research fellow with School of Computing, National University of Singapore, Singapore from 2021 to 2022. He is currently a research fellow with School of Computer Science and Engineering, Nanyang Technological University, Singapore. He serves as a regular reviewer for a number of international journals and conferences, such as TPAMI, TMM, TCSVT, CVPR, ICCV, ECCV.His research interests include computer vision, pattern recognition, and computational neuroscience, especially in multi-modal learning, video understanding, and human activity analysis.E-mail: guyue.hu@ntu.edu.sg (Corresponding author)ORCID iD: 0000-0002-6198-8230

    Bin He received the B. Eng. degree in automation from Harbin University of Science and Technology, China in 2014, and the Ph. D. degree in mechanical and electronic engineering from Harbin University of Science and Technology, China in 2020. As a joint Ph. D. student, he finished the entire doctoral research at National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China. He is currently an engineer at North China Computing Technology Institute (alias The 15th Research Institute of China Electronics Technology Group Corporation), China.His research interests include computer vision, pattern recognition, and intelligent decision-making, especially lie in military intelligence.E-mail: binhe.cas@foxmail.comORCID iD: 0000-0002-3845-7335

    Hanwang Zhang received the B. Eng. degree in computer science from Zhejiang University, China in 2009, and the Ph. D. degree in computer science from National University of Singapore, Singapore in 2014. He was a research scientist with Department of Computer Science, Columbia University, USA from 2017 to 2018, and a research fellow with National University of Singapore from 2014 to 2016. He is currently an assistant professor at School of Computer Science and Engineering, Nanyang Technological University, Singapore. He has authored more than 150 scientific papers in these areas in top journals and conferences, including TPAMI, TIP, ICLR, NeurIPS, CVPR, ICCV, ECCV, ACL, EMNLP, etc. He is the recipient of the Best Demo Runner-up Award in ACM MM 2012, the Best Student Paper Award in ACM MM 2013, the Best Paper Honorable Mention in ACM SIGIR 2016, and TOMM Best Paper Award 2018. He is also the Winner of the Best Ph. D. Thesis Award of School of Computing, National University of Singapore, Singapure in 2014.His research interests include computer vision and multimedia, especially focusing on the fusion of deep learning and reasoning for these fields.E-mail: hanwangzhang@ntu.edu.sg ORCID iD: 0000-0001-7374-8739

  • Received Date: 2022-06-30
  • Accepted Date: 2022-12-13
  • Publish Online: 2023-03-02
  • Publish Date: 2023-04-01
  • Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then finetuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.

     

  • loading
  • [1]
    R. J. Nadolski, P. A. Kirschner, J. J. van Merriënboer. Optimizing the number of steps in learning tasks for complex skills. British Journal of Educational Psychology, vol. 75, no. 2, pp. 223–237, 2005. DOI: 10.1348/000709904X22403.
    [2]
    M. Rohrbach, S. Amin, M. Andriluka, B. Schiele. A database for fine grained activity detection of cooking activities. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, pp. 1194–1201, 2012. DOI: 10.1109/CVPR.2012.6247801.
    [3]
    Y. S. Tang, D. J. Ding, Y. M. Rao, Y. Zheng, D. Y. Zhang, L. L. Zhao, J. W. Lu, J. Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1207–1216, 2019. DOI: 10.1109/CVPR.2019.00130.
    [4]
    Y. A. Farha, A. Richard, J, Gall. When will you do what? – Anticipating temporal occurrences of activities. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5343–5352, 2018. DOI: 10.1109/CVPR.2018.00560.
    [5]
    D. Zhukov, J. B. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, J. Sivic. Cross-task CVFweakly supervised learning from instructional videos. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 3532–3540, 2019. DOI: 10.1109/CVPR.2019.00365.
    [6]
    H. Kuehne, A. Arslan, T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, USA, pp. 780–787, 2014. DOI: 10.1109/CVPR.2014.105.
    [7]
    L. W. Zhou, C. L. Xu, J. J. Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, pp. 7590–7598, 2018. DOI: 10.5555/3504035.3504965.
    [8]
    C. Y. Chang, D. A. Huang, D. F. Xu, E. Adeli, L. Fei-Fei, J. C. Niebles. Procedure planning in instructional videos. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 334–350, 2020. DOI: 10.1007/978-3-030-58621-8_20.
    [9]
    L. C. Zhu, Y. Yang. ActBERT: Learning global-local video-text representations. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 8743–8752, 2020. DOI: 10.1109/CVPR42600.2020.00877.
    [10]
    C. Sun, A. Myers, C. Vondrick, K. Murphy, C. Schmid. VideoBERT: A joint model for video and language representation learning. In Proceedings of IEEE/CVF International Conference on Computer Vision, Seoul, Repubic of Korea, pp. 7463–7472, 2019. DOI: 10.1109/ICCV.2019.00756.
    [11]
    A. Miech, J. B. Alayrac, L. Smaira, I. Laptev, J. Sivic, A. Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9876–9886, 2020. DOI: 10.1109/CVPR42600.2020.00990.
    [12]
    B. Cui, G. Y. Hu, S. Yu. DeepCollaboration: Collaborative generative and discriminative models for class incremental learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, pp. 1175–1183, 2021. DOI: 10.1609/aaai.v35i2.16204.
    [13]
    J. P. Zhang, J. M. Zhang, G. Y. Hu, Y. Chen, S. Yu. Scalenet: A convolutional network to extract multi-scale and fine-grained visual features. IEEE Access, vol. 7, pp. 147560–147570, 2019. DOI: 10.1109/ACCESS.2019.2946425.
    [14]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
    [15]
    A. Miech, D. Zhukov, J. B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic. HowTo100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Repubic of Korea, pp. 2630–2640, 2019. DOI: 10.1109/ICCV.2019.00272.
    [16]
    K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726–9735, 2020. DOI: 10.1109/CVPR42600.2020.00975.
    [17]
    G. Hu, B. Cui, S. Yu. Skeleton-based action recognition with synchronous local and non-local spatio-temporal learning and frequency attention. In Proceedings of IEEE International Conference on Multimedia and Expo, Shanghai, China, pp. 1216–1221, 2019. DOI: 10.1109/ICME.2019.00212.
    [18]
    R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 5843–5851, 2017. DOI: 10.1109/ICCV.2017.622.
    [19]
    G. Y. Hu, B. Cui, S. Yu. Joint learning in the spatio-temporal and frequency domains for skeleton-based action recognition. IEEE Transactions on Multimedia, vol. 22, no. 9, pp. 2207–2220, 2020. DOI: 10.1109/TMM.2019.2953325.
    [20]
    F. C. Heilbron, V. Escorcia, B. Ghanem, J. C. Niebles. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 961–970, 2015. DOI: 10.1109/CVPR.2015.7298698.
    [21]
    G. Y. Hu, B. Cui, Y. He, S. Yu. Progressive relation learning for group activity recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 977–986, 2020. DOI: 10.1109/CVPR42600.2020.00106.
    [22]
    M. S. Liu, J. Q. Gao, G. Y. Hu, G. F. Hao, T. Z. Jiang, C. Zhang, S. Yu. MonkeyTrail: A scalable video-based method for tracking macaque movement trajectory in daily living cages. Zoological Research, vol. 43, no. 3, pp. 343–351, 2022. DOI: 10.24272/j.issn.2095-8137.2021.353.
    [23]
    B. X. Wu, C. G. Yang, J. P. Zhong. Research on transfer learning of vision-based gesture recognition. [Online], Available: https://dblp.org/rec/journals/corr/abs-1812-05770.html?view=bibtex, 2021.
    [24]
    Z. W. Xu, X. J. Wu, J. Kittler. STRNet: Triple-stream spatiotemporal relation network for action recognition. [Online], Available: https://dblp.org/rec/conf/cvpr/WuGHFK20.html?view=bibtex, 2021.
    [25]
    L. F. Wu, Q. Wang, M. Jian, Y. Qiao, B. X. Zhao. A comprehensive review of group activity recognition in videos. International Journal of Automation and Computing, vol. 18, no. 3, pp. 334–350, 2021. DOI: 10.1007/s11633-020-1258-8.
    [26]
    D. A. Huang, J. J. Lim, L. Fei-Fei, J. C. Niebles. Unsupervised visual-linguistic reference resolution in instructional videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 1032–1041, 2017. DOI: 10.1109/CVPR.2017.116.
    [27]
    H. Doughty, D. Damen, W. Mayol-Cuevas. Who′s better? Who′s best? Pairwise deep ranking for skill determination. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6057–6066, 2018. DOI: 10.1109/CVPR.2018.00634.
    [28]
    B. Singh, T. K. Marks, M. Jones, O. Tuzel, M. Shao. A multi-stream Bi-directional recurrent neural network for fine-grained action detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1961–1970, 2016. DOI: 10.1109/CVPR.2016.216.
    [29]
    Y. A. Farha, J. Gall. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 3570–3579, 2019. DOI: 10.1109/CVPR.2019.00369.
    [30]
    Y. Zhao, Y. J. Xiong, L. M. Wang, Z. R. Wu, X. O. Tang, D. H. Lin. Temporal action detection with structured segment networks. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 2933–2942, 2017. DOI: 10.1109/ICCV.2017.317.
    [31]
    H. J. Xu, A. Das, K. Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 5794–5803, 2017. DOI: 10.1109/ICCV.2017.617.
    [32]
    A. Richard, H. Kuehne, J. Gall. Action sets: Weakly supervised action segmentation without ordering constraints. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5987–5996, 2018. DOI: 10.1109/CVPR.2018.00627.
    [33]
    H. Doughty, I. Laptev, W. Mayol-Cuevas, D. Damen. Action modifiers: Learning from adverbs in instructional videos. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 865–875, 2020. DOI: 10.1109/CVPR42600.2020.00095.
    [34]
    J. B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, S. Lacoste-Julien. Unsupervised learning from narrated instruction videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 4575–4583, 2016. DOI: 10.1109/CVPR.2016.495.
    [35]
    S. N. Aakur, S. Sarkar. A perceptual prediction framework for self supervised event segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1197–1206, 2019. DOI: 10.1109/CVPR.2019.00129.
    [36]
    A. Kukleva, H. Kuehne, F. Sener, J. Gall. Unsupervised learning of action classes with continuous temporal embedding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 12058–12066, 2019. DOI: 10.1109/CVPR.2019.01234.
    [37]
    F. Sener, A. Yao. Unsupervised learning and segmentation of complex activities from video. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 8368–8376, 2018. DOI: 10.1109/CVPR.2018.00873.
    [38]
    T. X. Sun, X. Y. Liu, X. P. Qiu, X. J. Huang. Paradigm shift in natural language processing. Machine Intelligence Research, vol. 19, no. 3, pp. 169–183, 2022. DOI: 10.1007/s11633-022-1331-6.
    [39]
    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language models are unsupervised multitask learners. OpenAI blog, vol. 1, no. 8, Article number 9, 2019.
    [40]
    T. Schick, H. Schütze. It′s not just size that matters: Small language models are also few-shot learners. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352, 2021. DOI: 10.18653/v1/2021.naacl-main.185.
    [41]
    X. L. Li, P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 4582–4597, 2021. DOI: 10.18653/v1/2021.acl-long.353.
    [42]
    P. F. Liu, W. Z. Yuan, J. L. Fu, Z. B. Jiang, H. Hayashi, G. Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. [Online], Available: https://arxiv.org/abs/2107.13586, 2021.
    [43]
    K. Y. Zhou, J. K. Yang, C. C. Loy, Z. W. Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022. DOI: 10.1007/s11263-022-01653-1.
    [44]
    Y. Yao, A. Zhang, Z. Y. Zhang, Z. Y. Liu, T. S. Chua, M. S. Sun. CPT: Colorful prompt tuning for pre-trained vision-language models. [Online], Available: https://arxiv.org/abs/2109.11797, 2021.
    [45]
    K. Y. Zhou, J. K. Yang, C. C. Loy, Z. W. Liu. Conditional prompt learning for vision-language models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 16795–16804, 2022. DOI: 10.1109/CVPR52688.2022.01631.
    [46]
    M. M. Wang, J. Z. Xing, Y. Liu. ActionCLIP: A new paradigm for video action recognition. [Online], Available: https://arxiv.org/abs/2109.08472, 2021.
    [47]
    Y. M. Rao, W. L. Zhao, G. Y. Chen, Y. S. Tang, Z. Zhu, G. Huang, J. Zhou, J. W. Lu. DenseCLIP: Language-guided dense prediction with context-aware prompting. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 18061–18070, 2022. DOI: 10.1109/CVPR52688.2022.01755.
    [48]
    C. Ju, T. D. Han, K. H. Zheng, Y. Zhang, W. D. Xie. Prompting visual-language models for efficient video understanding. In Proceedings of the 17th European Conference on Computer Vision, Springer, Tel Aviv Israel, pp. 105–124, 2022. DOI: 10.1007/978-3-031-19833-5_7.
    [49]
    W. L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism Quarterly, vol. 30, no. 4, pp. 415–433, 1953. DOI: 10.1177/107769905303000401.
    [50]
    J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
    [51]
    Z. Gan, L. J. Li, C. Y. Li, L. J. Wang, Z. C. Liu, J. F. Gao. Vision-language pre-training: Basics, recent advances, and future trends. [Online], Available: https://arxiv.org/abs/2210.09263, 2022.
    [52]
    F. L. Chen, D. Z. Zhang, M. L. Han, X. Y. Chen, J. Shi, S. Xu, B. Xu. VLP: A survey on vision-language pre-training. [Online], Available: https://arxiv.org/abs/2202.09061, 2022.
    [53]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
    [54]
    T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, S. Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 4222–4235, 2020. DOI: 10.18653/v1/2020.emnlp-main.346.
    [55]
    T. W. Lin, X. Zhao, H. S. Su, C. J. Wang, M. Yang. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 3–21, 2018. DOI: 10.1007/978-3-030-01225-0_1.
    [56]
    S. C. Wang, Y. Q. Duan, H. H. Ding, Y. P. Tan, K. H. Yap, J. S. Yuan. Learning transferable human-object interaction detector with natural language supervision. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, New Orleans, USA, pp. 929–938, 2022. DOI: 10.1109/CVPR52688.2022.00101.
    [57]
    S. N. Xie, C. Sun, J. Huang, Z. W. Tu, K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 318–335, 2018. DOI: 10.1007/978-3-030-01267-0_19.
    [58]
    T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space. [Online], Available: https://arxiv.org/abs/1301.3781, 2013.
    [59]
    L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20–36, 2016. DOI: 10.1007/978-3-319-46484-8_2.
    [60]
    J. Lin, C. Gan, S. Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Repubic of Korea, pp. 7082–7092, 2019. DOI: 10.1109/ICCV.2019.00718.
    [61]
    B. Y. Jiang, M. M. Wang, W. H. Gan, W. Wu, J. J. Yan. STM: Spatiotemporal and motion encoding for action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Repubic of Korea, pp. 2000–2009, 2019. DOI: 10.1109/ICCV.2019.00209.
    [62]
    L. M. Wang, Z. Tong, B. Ji, G. S. Wu. TDN: Temporal difference networks for efficient action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 1895–1904, 2021. DOI: 10.1109/CVPR46437.2021.00193.
    [63]
    D. Zhukov, J. B. Alayrac, I. Laptev, J. Sivic. Learning actionness via long-range temporal order verification. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 470–487, 2020. DOI: 10.1007/978-3-030-58526-6_28.
    [64]
    D. D. Shan, J. Q. Geng, M. Shu, D. F. Fouhey. Understanding human hands in contact at internet scale. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9866–9875, 2020. DOI: 10.1109/CVPR42600.2020.00989.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(8)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (496) PDF downloads(33) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return