Yang Liu, Yu-Shen Wei, Hong Yan, Guan-Bin Li, Liang Lin. Causal Reasoning Meets Visual Representation Learning: A Prospective Study. Machine Intelligence Research, vol. 19, no. 6, pp.485-511, 2022. https://doi.org/10.1007/s11633-022-1362-z
Citation: Yang Liu, Yu-Shen Wei, Hong Yan, Guan-Bin Li, Liang Lin. Causal Reasoning Meets Visual Representation Learning: A Prospective Study. Machine Intelligence Research, vol. 19, no. 6, pp.485-511, 2022. https://doi.org/10.1007/s11633-022-1362-z

Causal Reasoning Meets Visual Representation Learning: A Prospective Study

doi: 10.1007/s11633-022-1362-z
More Information
  • Author Bio:

    Yang Liu received the B. Sc. degree in telecommunications engineering from Chang′an University, China in 2014, and the Ph. D. degree in telecommunications and information systems from Xidian University, China in 2019. He is currently a research associate professor working at School of Computer Science and Engineering, Sun Yat-sen University, China. He has authorized and co-authorized more than 20 papers in top-tier academic journals and conferences. He has been serving as a reviewer for numerous academic journals and conferences such as IEEE TIP, TNNLS, TMM, TCSVT, TCyb, CVPR, ICCV, AAAI, and ECCV. He is a member of IEEE and CSIG. His research interests include video understanding, causal reasoning, and computer vision. E-mail: liuy856@mail.sysu.edu.cn ORCID iD: 0000-0002-9423-9252

    Yu-Shen Wei received the B. Sc. degree in computer science and technology from Sun Yat-sen University, China in 2020. He is currently a master student at School of Computer Science and Engineering, Sun Yat-sen University, China. His current research interests include video understanding, computer vision and machine learning. E-mail: weiysh8@mail2.sysu.edu.cn ORCID iD: 0000-0002-0527-5463

    Hong Yan received the B. Sc. degree in computer science and technology from Nanchang University, China in 2020. He is currently a master student at School of Computer Science and Engineering, Sun Yat-sen University, China. His research interests include video understanding, computer vision and machine learning. E-mail: yanh36@mail2.sysu.edu.cn ORCID iD: 0000-0003-4100-6751

    Guan-Bin Li received the Ph. D. degree from the University of Hong Kong, China in 2016. He is currently an associate professor in School of Computer Science and Engineering, Sun Yat-Sen University, China. He is a recipient of ICCV 2019 Best Paper Nomination Award. He has authorized and co-authorized on more than 70 papers in top-tier academic journals and conferences. He serves as an area chair for the conference of VISAPP. He has been serving as a reviewer for numerous academic journals and conferences such as TPAMI, IJCV, TIP, TMM, TCyb, CVPR, ICCV, ECCV and NeurIPS. His research interests include computer vision, image processing, and machine learning. E-mail: liguanbin@mail.sysu.edu.cn ORCID iD: 0000-0002-4805-0926

    Liang Lin received the Ph. D. degree from Beijing Institute of Technology, China in 2008. He is a full professor of computer science at Sun Yat-sen University, China. He served as the executive director and distinguished scientist of SenseTime Group from 2016 to 2018, leading the R&D teams for cutting-edge technology transferring. He has authored or co-authored more than 200 papers in leading academic journals and conferences, and his papers have been cited by more than 21000 times. He is an associate editor of IEEE Transactions on Neural Networks and Learning Systems and IEEE Transactions on Human-Machine Systems, and served as area Chairs for numerous conferences such as CVPR, ICCV, SIGKDD and AAAI. He is the recipient of numerous awards and honors including Wu Wen-Jun Artificial Intelligence Award, the First Prize of China Society of Image and Graphics, ICCV Best Paper Nomination in 2019, Annual Best Paper Award by Pattern Recognition (Elsevier) in 2018, Best Paper Dimond Award in IEEE ICME 2017, Google Faculty Award in 2012. His supervised Ph. D. students received ACM China Doctoral Dissertation Award, CCF Best Doctoral Dissertation and CAAI Best Doctoral Dissertation. He is a fellow of IET/IAPR. His research interests include artificial intelligence, computer vision, machine learning, multimedia, and NLP/Dialogue. E-mail: linliang@ieee.org (Corresponding author) ORCID iD: 0000-0003-2248-3755

  • Received Date: 2022-05-09
  • Accepted Date: 2022-08-01
  • Publish Online: 2022-11-03
  • Publish Date: 2022-12-01
  • Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multi-modal heterogeneous spatial/temporal/spatial-temporal data in the big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks unified guidance and analysis about why modern visual representation learning methods easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.

     

  • loading
  • [1]
    K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.1109/CVPR.2016.90.
    [2]
    T. S. Chen, L. Lin, R. Q. Chen, X. L. Hui, H. F. Wu. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1371–1384, 2022. DOI: 10.1109/TPAMI.2020.3025814.
    [3]
    A. R. Akula, K. Z. Wang, C. S. Liu, S. Saba-Sadiya, H. J. Lu, S. Todorovic, J. Chai, S. C. Zhu. CX-ToM: Counterfactual explanations with theory-of-mind for enhancing human trust in image recognition models. iScience, vol. 25, no. 1, Article number 103581, 2022. DOI: 10.1016/j.isci.2021.103581.
    [4]
    L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20–36, 2016. DOI: 10.1007/978-3-319-46484-8_2.
    [5]
    B. L. Zhou, A. Andonian, A. Oliva, A. Torralba. Temporal relational reasoning in videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 831–846, 2018. DOI: 10.1007/978-3-030-01246-5_49.
    [6]
    J. Lin, C. Gan, S. Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7082–7092, 2019. DOI: 10.1109/ICCV.2019.00718.
    [7]
    Y. Liu, K. Z. Wang, L. B. Liu, H. Y. Lan, L. Lin. TCGL: Temporal contrastive graph for self-supervised video representation learning. IEEE Transactions on Image Processing, vol. 31, pp. 1978–1993, 2022. DOI: 10.1109/TIP.2022.3147032.
    [8]
    M. Bușta, L. Neumann, J. Matas. Deep TextSpotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 2223–2231, 2017. DOI: 10.1109/ICCV.2017.242.
    [9]
    X. X. Chen, L. W. Jin, Y. Z. Zhu, C. J. Luo, T. W. Wang. Text recognition in the wild: A survey. ACM Computing Surveys, vol. 54, no. 2, Article number 42, 2022.
    [10]
    R. Rastgoo, K. Kiani, S. Escalera. Sign language recognition: A deep survey. Expert Systems with Applications, vol. 164, Article number 113794, 2021.
    [11]
    R. H. Gao, T. H. Oh, K. Grauman, L. Torresani. Listen to look: Action recognition by previewing audio. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10454–10464, 2020. DOI: 10.1109/CVPR42600.2020.01047.
    [12]
    Y. Cheng, R. Z. Wang, Z. H. Pan, R. Feng, Y. J. Zhang. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, USA, pp. 3884–3892, 2020. DOI: 10.1145/3394171.3413869.
    [13]
    Y. B. Chen, Y. Q. Xian, A. S. Koepke, Y. Shan, Z. Akata. Distilling audio-visual knowledge by compositional contrastive learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7012–7021, 2021. DOI: 10.1109/CVPR46437.2021.00694.
    [14]
    H. Y. Lan, Y. Liu, L. Lin. Audio-visual contrastive learning for self-supervised action recognition. [Online], Available: https://arxiv.org/abs/2204.13386, 2022.
    [15]
    Y. Liu, Z. Y. Lu, J. Li, T. Yang. Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2416–2430, 2019. DOI: 10.1109/TCSVT.2018.2868123.
    [16]
    Y. Liu, Z. Y. Lu, J. Li, C. Yao, Y. Z. Deng. Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity, vol. 2018, Article number 5345241, 2018.
    [17]
    Y. Liu, Z. Y. Lu, J. Li, T. Yang, C. Yao. Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing, vol. 29, pp. 3168–3182, 2020. DOI: 10.1109/TIP.2019.2957930.
    [18]
    Y. Liu, K. Z. Wang, G. B. Li, L. Lin. Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing, vol. 30, pp. 5573–5588, 2021. DOI: 10.1109/TIP.2021.3086590.
    [19]
    Y. Y. Zhu, Y. Zhang, L. B. Liu, Y. Liu, G. B. Li, M. Z. Mao, L. Lin. Hybrid-order representation learning for electricity theft detection. IEEE Transactions on Industrial Informatics, to be published. DOI: 10.1109/TII.2022.3179243.
    [20]
    G. B. Li, Y. Xie, L. Lin, Y. Z. Yu. Instance-level salient object segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 247–256, 2017. DOI: 10.1109/CVPR.2017.34.
    [21]
    X. D. Liang, K. Gong, X. H. Shen, L. Lin. Look into person: Joint body parsing & pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 871–885, 2019. DOI: 10.1109/TPAMI.2018.2820063.
    [22]
    S. B. Yang, G. B. Li, Y. Z. Yu. Relationship-embedded representation learning for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 8, pp. 2765–2779, 2021. DOI: 10.1109/TPAMI.2020.2973983.
    [23]
    X. Q. Zhang, R. H. Jiang, C. X. Fan, T. Y. Tong, T. Wang, P. C. Huang. Advances in deep learning methods for visual tracking: Literature review and fundamentals. International Journal of Automation and Computing, vol. 18, no. 3, pp. 311–333, 2021. DOI: 10.1007/s11633-020-1274-8.
    [24]
    Z. W. Wang, Q. She, A. Smolic. ACTION-Net: Multipath excitation for action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 13209–13218, 2021. DOI: 10.1109/CVPR46437.2021.01301.
    [25]
    G. S. Pang, C. Yan, C. H. Shen, A. van den Hengel, X. Bai. Self-trained deep ordinal regression for end-to-end video anomaly detection. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 12170–12179, 2020. DOI: 10.1109/CVPR42600.2020.01219.
    [26]
    Y. Liu, Z. Y. Lu, J. Li, T. Yang, C. Yao. Global temporal representation based cnns for infrared action recognition. IEEE Signal Processing Letters, vol. 25, no. 6, pp. 848–852, 2018. DOI: 10.1109/LSP.2018.2823910.
    [27]
    L. F. Wu, Q. Wang, M. Jian, Y. Qiao, B. X. Zhao. A comprehensive review of group activity recognition in videos. International Journal of Automation and Computing, vol. 18, no. 3, pp. 334–350, 2021. DOI: 10.1007/s11633-020-1258-8.
    [28]
    Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh. Making the v in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6325–6334, 2017. DOI: 10.1109/CVPR.2017.670.
    [29]
    Q. X. Cao, B. L. Li, X. D. Liang, K. Z. Wang, L. Lin. Knowledge-routed visual question reasoning: Challenges for deep representation embedding. IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 7, pp. 2758–2767, 2022. DOI: 10.1109/TNNLS.2020.3045034.
    [30]
    Q. X. Cao, W. T. Wan, K. Z. Wang, X. D. Liang, L. Lin. Linguistically routing capsule network for out-of-distribution visual question answering. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1594–1603, 2021. DOI: 10.1109/ICCV48922.2021.00164.
    [31]
    Y. Liu, J. Li, Z. Y. Lu, T. Yang, Z. J. Liu. Combining multiple features for cross-domain face sketch recognition. In Proceedings of the 11th Chinese Conference on Biometric Recognition, Springer, Chengdu, China, pp. 139–146, 2016. DOI: 10.1007/978-3-319-46654-5_16.
    [32]
    L. C. Wang, Z. M. Ding, Z. Q. Tao, Y. Y. Liu, Y. Fu. Generative multi-view human action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6211–6220, 2019. DOI: 10.1109/ICCV.2019.00631.
    [33]
    J. Y. Ni, R. Sarbajna, Y. Liu, A. H. H. Ngu, Y. Yan. Cross-modal knowledge distillation for vision-to-sensor action recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 4448–4452, 2022. DOI: 10.1109/ICASSP43922.2022.9746752.
    [34]
    R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, W. Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
    [35]
    R. Shetty, B. Schiele, M. Fritz. Not using the car to see the sidewalk—Quantifying and controlling the effects of context in classification and segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 8210–8218, 2019. DOI: 10.1109/CVPR.2019.00841.
    [36]
    D. Hendrycks, T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
    [37]
    A. Azulay, Y. Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? Journal of Machine Learning Research, vol. 20, no. 184, pp. 1–25, 2019.
    [38]
    J. Peters, D. Janzing, B. Schölkopf. Elements of Causal Inference: Foundations and Learning Algorithms, Cambridge, USA: The MIT Press, 2017.
    [39]
    J. Pearl. Causality, 2nd ed., New York, USA: Cambridge University Press, 2009.
    [40]
    B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, Y. Bengio. Toward causal representation learning. Proceedings of the IEEE, vol. 109, no. 5, pp. 612–634, 2021. DOI: 10.1109/JPROC.2021.3058954.
    [41]
    L. Cheng, R. C. Guo, R. Moraffah, P. Sheth, K. S. Candan, H. Liu. Evaluation methods and measures for causal learning algorithms. IEEE Transactions on Artificial Intelligence, to be published. DOI: 10.1109/TAI.2022.3150264.
    [42]
    Q. S. Zhang, S. C. Zhu. Visual interpretability for deep learning: A survey. Frontiers of Information Technology &Electronic Engineering, vol. 19, no. 1, pp. 27–39, 2018. DOI: 10.1631/FITEE.1700808.
    [43]
    Q. S. Zhang, Y. N. Wu, S. C. Zhu. Interpretable convolutional neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 8827–8836, 2018. DOI: 10.1109/CVPR.2018.00920.
    [44]
    Q. S. Zhang, Y. Yang, H. T. Ma, Y. N. Wu. Interpreting CNNs via decision trees. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6254–6263, 2019. DOI: 10.1109/CVPR.2019.00642.
    [45]
    Q. S. Zhang, X. Wang, R. M. Cao, Y. N. Wu, F. Shi, S. C. Zhu. Extraction of an explanatory graph to interpret a CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3863–3877, 2021. DOI: 10.1109/TPAMI.2020.2992207.
    [46]
    Q. S. Zhang, J. Ren, G. Huang, R. M. Cao, Y. N. Wu, S. C. Zhu. Mining interpretable AOG representations from convolutional networks via active question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 3949–3963, 2021. DOI: 10.1109/TPAMI.2020.2993147.
    [47]
    Q. S. Zhang, X. Wang, Y. N. Wu, H. L. Zhou, S. C. Zhu. Interpretable CNNs for object classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3416–3431, 2021. DOI: 10.1109/TPAMI.2020.2982882.
    [48]
    K. Yu, X. J. Guo, L. Liu, J. Y. Li, H. Wang, Z. L. Ling, X. D. Wu. Causality-based feature selection: Methods and evaluations. ACM Computing Surveys, vol. 53, no. 5, Article number 111, 2021. DOI: 10.1145/3409382.
    [49]
    K. Yu, L. Liu, J. Y. Li. A unified view of causal and non-causal feature selection. ACM Transactions on Knowledge Discovery from Data, vol. 15, no. 4, Article number 63, 2021. DOI: 10.1145/3436891.
    [50]
    K. Yu, Y. J. Yang, W. Ding. Causal feature selection with missing data. ACM Transactions on Knowledge Discovery from Data, vol. 16, no. 4, Article number 66, 2022. DOI: 10.1145/3488055.
    [51]
    X. J. Guo, K. Yu, F. Y. Cao, P. P. Li, H. Wang. Error-aware Markov blanket learning for causal feature selection. Information Sciences, vol. 589, pp. 849–877, 2022. DOI: 10.1016/j.ins.2021.12.118.
    [52]
    X. Li, Z. Z. Zhang, G. Q. Wei, C. L. Lan, W. J. Zeng, X. Jin, Z. B. Chen. Confounder identification-free causal visual feature learning. [Online], Available: https://arxiv.org/abs/2111.13420, 2021.
    [53]
    K. Yu, M. Z. Cai, X. Y. Wu, L. Liu, J. Y. Li. Multilabel feature selection: A local causal structure learning approach. IEEE Transactions on Neural Networks and Learning Systems, to be published. DOI: 10.1109/TNNLS.2021.3111288.
    [54]
    M. Y. Yang, F. R. Liu, Z. T. Chen, X. W. Shen, J. Y. Hao, J. Wang. CausalVAE: Disentangled representation learning via neural structural causal models. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, 2021, pp. 9588–9597. DOI: 10.1109/CVPR46437.2021.00947.
    [55]
    S. Yang, H. Wang, K. Yu, F. Y. Cao, X. D. Wu. Towards efficient local causal structure learning. IEEE Transactions on Big Data, to be published. DOI: 10.1109/TBDATA.2021.3062937.
    [56]
    L. Z. Li, Y. J. Lin, H. Zhao, J. K. Chen, S. Z. Li. Causality-based online streaming feature selection. Concurrency and Computation:Practice and Experience, vol. 33, no. 20, Article number e6347, 2021. DOI: 10.1002/cpe.6347.
    [57]
    Z. L. Ling, K. Yu, H. Wang, L. Li, X. D. Wu. Using feature selection for local causal structure learning. IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 4, pp. 530–540, 2021. DOI: 10.1109/TETCI.2020.2978238.
    [58]
    K. Yu, L. Liu, J. Y. Li, W. Ding, T. D. Le. Multi-source causal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2240–2256, 2020. DOI: 10.1109/TPAMI.2019.2908373.
    [59]
    X. Y. Wu, B. B. Jiang, K. Yu, H. Y. Chen, H. H. Chen. Accurate Markov boundary discovery for causal feature selection. IEEE Transactions on Cybernetics, vol. 50, no. 12, pp. 4983–4996, 2020. DOI: 10.1109/TCYB.2019.2940509.
    [60]
    K. Yu, L. Liu, J. Y. Li. Learning Markov blankets from multiple interventional data sets. IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 6, pp. 2005–2019, 2020. DOI: 10.1109/TNNLS.2019.2927636.
    [61]
    T. Wang, C. Zhou, Q. R. Sun, H. W. Zhang. Causal attention for unbiased visual recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 3071–3080, 2021. DOI: 10.1109/ICCV48922.2021.00308.
    [62]
    Z. Q. Yue, T. Wang, Q. R. Sun, X. S. Hua, H. W. Zhang. Counterfactual zero-shot and open-set visual recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 15399–15409, 2021. DOI: 10.1109/CVPR46437.2021.01515.
    [63]
    J. Q. Huang, Y. Qin, J. X. Qi, Q. R. Sun, H. W. Zhang. Deconfounded visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, pp. 998–1006, 2022.
    [64]
    C. Zhang, B. X. Jia, M. Edmonds, S. C. Zhu, Y. X. Zhu. ACRE: Abstract causal reasoning beyond covariation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 10638–10648, 2021. DOI: 10.1109/CVPR46437.2021.01050.
    [65]
    D. Zhang, H. W. Zhang, J. H. Tang, X. S. Hua, Q. R. Sun. Causal intervention for weakly-supervised semantic segmentation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 56, 2020. DOI: 10.5555/3495724.3495780.
    [66]
    K. H. Tang, Y. L. Niu, J. Q. Huang, J. X. Shi, H. W. Zhang. Unbiased scene graph generation from biased training. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 3713–3722, 2020. DOI: 10.1109/CVPR42600.2020.00377.
    [67]
    K. H. Tang, J. Q. Huang, H. W. Zhang. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 128, 2020. DOI: 10.5555/3495724.3495852.
    [68]
    T. Wang, J. Q. Huang, H. W. Zhang, Q. R. Sun. Visual commonsense R-CNN. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10757–10767, 2020. DOI: 10.1109/CVPR42600.2020.01077.
    [69]
    L. Chen, H. W. Zhang, J. Xiao, X. N. He, S. L. Pu, S. F. Chang. Counterfactual critic multi-agent training for scene graph generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 4612–4622, 2019. DOI: 10.1109/ICCV.2019.00471.
    [70]
    J. X. Shi, H. W. Zhang, J. Z. Li. Explainable and explicit visual reasoning over scene graphs. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 8368–8376, 2019. DOI: 10.1109/CVPR.2019.00857.
    [71]
    K. H. Tang, M. Y. Tao, H. W. Zhang. Adversarial visual robustness by causal intervention. [Online], Available: https://arxiv.org/abs/2106.09534, 2021.
    [72]
    X. T. Hu, K. H. Tang, C. Y. Miao, X. S. Hua, H. W. Zhang. Distilling causal effect of data in class-incremental learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3956–3965, 2021. DOI: 10.1109/CVPR46437.2021.00395.
    [73]
    Z. Q. Yue, Q. R. Sun, X. S. Hua, H. W. Zhang. Transporting causal mechanisms for unsupervised domain adaptation. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 8579–8588, 2021. DOI: 10.1109/ICCV48922.2021.00848.
    [74]
    Z. Q. Yue, H. W. Zhang, Q. R. Sun, X. S. Hua. Interventional few-shot learning. In Proceedings of the 34th Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 2734–2746, 2020.
    [75]
    S. Yang, K. Yu, F. Y. Cao, L. Liu, H. Wang, J. Y. Li. Learning causal representations for robust domain adaptation. IEEE Transactions on Knowledge and Data Engineering, to be published. DOI: 10.1109/TKDE.2021.3119185.
    [76]
    R. Christiansen, N. Pfister, M. E. Jakobsen, N. Gnecco, J. Peters. A causal framework for distribution generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: 10.1109/TPAMI.2021.3094760.
    [77]
    C. Z. Mao, A. Cha, A. Gupta, H. Wang, J. F. Yang, C. Vondrick. Generative interventions for causal learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 3946–3955, 2021. DOI: 10.1109/CVPR46437.2021.00394.
    [78]
    T. Kyono, M. van der Schaar. Exploiting causal structure for robust model selection in unsupervised domain adaptation. IEEE Transactions on Artificial Intelligence, vol. 2, no. 6, pp. 494–507, 2021. DOI: 10.1109/TAI.2021.3101185.
    [79]
    F. Wu, X. Y. Duan, J. Xiao, Z. Zhao, S. L. Tang, Y. Zhang, Y. T. Zhuang. Temporal interaction and causal influence in community-based question answering. IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2304–2317, 2017. DOI: 10.1109/TKDE.2017.2720737.
    [80]
    Y. L. Niu, H. W. Zhang. Introspective distillation for robust question answering. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 16292–16304, 2021.
    [81]
    Y. L. Niu, K. H. Tang, H. W. Zhang, Z. W. Lu, X. S. Hua, J. R. Wen. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 12695–12705, 2021. DOI: 10.1109/CVPR46437.2021.01251.
    [82]
    X. Yang, H. W. Zhang, G. J. Qi, J. F. Cai. Causal attention for vision-language tasks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 9842–9852, 2021. DOI: 10.1109/CVPR46437.2021.00972.
    [83]
    J. X. Qi, Y. L. Niu, J. Q. Huang, H. W. Zhang. Two causal principles for improving visual dialog. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10857–10866, 2020. DOI: 10.1109/CVPR42600.2020.01087.
    [84]
    L. Chen, X. Yan, J. Xiao, H. W. Zhang, S. L. Pu, Y. T. Zhuang. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10797–10806, 2020. DOI: 10.1109/CVPR42600.2020.01081.
    [85]
    P. Wu, J. Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, vol. 30, pp. 3513–3527, 2021. DOI: 10.1109/TIP.2021.3062192.
    [86]
    W. J. Shi, G. Huang, S. J. Song, C. Wu. Temporal-spatial causal interpretations for vision-based reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: 10.1109/TPAMI.2021.3133717.
    [87]
    X. H. Zhang, Y. K. Wong, X. F. Wu, J. W. Lu, M. Kankanhalli, X. D. Li, W. D. Geng. Learning causal representation for training cross-domain pose estimator via generative interventions. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 11250–11260, 2021. DOI: 10.1109/ICCV48922.2021.01108.
    [88]
    Z. W. Xu, X. D. Shen, Y. Wong, M. S. Kankanhalli. Unsupervised motion representation learning with capsule autoencoders. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 3205–3217, 2021.
    [89]
    A. Fire, S. C. Zhu. Inferring hidden statuses and actions in video by causal reasoning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Honolulu, USA, pp. 48–56, 2017. DOI: 10.1109/CVPRW.2017.13.
    [90]
    V. N. Gangapure, S. Nanda, A. S. Chowdhury. Superpixel-based causal multisensor video fusion. IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 6, pp. 1263–1272, 2018. DOI: 10.1109/TCSVT.2017.2662743.
    [91]
    C. M. Xiong, N. Shukla, W. L. Xiong, S. C. Zhu. Robot learning with a spatial, temporal, and causal and-or graph. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Stockholm, Sweden, pp. 2144–2151, 2016. DOI: 10.1109/ICRA.2016.7487364.
    [92]
    Y. Liu, K. Z. Wang, H. Y. Lan, L. Lin. Temporal contrastive graph learning for video action recognition and retrieval. [Online], Available: https://arxiv.org/abs/2101.00820, 2021.
    [93]
    X. Yang, H. W. Zhang, J. F. Cai. Deconfounded image captioning: A causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: 10.1109/TPAMI.2021.3121705.
    [94]
    Z. Y. Shen, J. S. Liu, Y. He, X. X. Zhang, R. Z. Xu, H. Yu, P. Cui. Towards out-of-distribution generalization: A survey. [Online], Available: https://arxiv.org/abs/2108.13624, 2021.
    [95]
    J. W. Chen, H. D. Dong, X. Wang, F. L. Feng, M. N. Wang, X. He. Bias and debias in recommender system: A survey and future directions. [Online], Available: https://arxiv.org/abs/2010.03240, 2020.
    [96]
    J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, C. Blundell. Representation learning via invariant causal mechanisms. In Proceedings of the 9th International Conference on Learning Representations, 2021.
    [97]
    X. W. Shen, F. R. Liu, H. Z. Dong, Q. Lian, Z. T. Chen, T. Zhang. Disentangled generative causal representation learning. [Online], Available: https://arxiv.org/abs/2010.02637, 2020.
    [98]
    R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 580–587, 2014. DOI: 10.1109/CVPR.2014.81.
    [99]
    K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015. DOI: 10.1109/TPAMI.2015.2389824.
    [100]
    R. Girshick. Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440–1448, 2015. DOI: 10.1109/ICCV.2015.169.
    [101]
    S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 91–99, 2015. DOI: 10.5555/2969239.2969250.
    [102]
    T. Y. Lin, P. Dollár, R. Girshick, K. M. He, B. Hariharan, S. Belongie. Feature pyramid networks for object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 936–944, 2017. DOI: 10.1109/CVPR.2017.106.
    [103]
    J. F. Dai, Y. Li, K. M. He, J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 379–387, 2016. DOI: 10.5555/3157096.3157139.
    [104]
    K. M. He, G. Gkioxari, P. Dollár, R. Girshick. Mask R-CNN. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2980–2988, 2017. DOI: 10.1109/ICCV.2017.322.
    [105]
    D. Erhan, C. Szegedy, A. Toshev, D. Anguelov. Scalable object detection using deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 2155–2162, 2014. DOI: 10.1109/CVPR.2014.276.
    [106]
    D. Yoo, S. Park, J. Y. Lee, A. S. Paek, I. S. Kweon. Attentionnet: Aggregating weak directions for accurate object detection. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2659–2667, 2015. DOI: 10.1109/ICCV.2015.305.
    [107]
    M. Najibi, M. Rastegari, L. S. Davis. G-CNN: AN iterative grid based object detector. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 2369–2377, 2016. DOI: 10.1109/CVPR.2016.260.
    [108]
    J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 779–788, 2016. DOI: 10.1109/CVPR.2016.91.
    [109]
    W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 21–37, 2016. DOI: 10.1007/978-3-319-46448-0_2.
    [110]
    J. Redmon, A. Farhadi. YOLO9000: Better, faster, stronger. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6517–6525, 2017. DOI: 10.1109/CVPR.2017.690.
    [111]
    Z. Q. Shen, Z. Liu, J. G. Li, Y. G. Jiang, Y. R. Chen, X. Y. Xue. DSOD: Learning deeply supervised object detectors from scratch. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 1937–1945, 2017. DOI: 10.1109/ICCV.2017.212.
    [112]
    C. Y. Fu, W. Liu, A. Ranga, A. Tyagi, A. C. Berg. DSSD: Deconvolutional single shot detector. [Online], Available: https://arxiv.org/abs/1701.06659, 2017.
    [113]
    G. B. Li, Y. Xie, T. H. Wei, K. Z. Wang, L. Lin. Flow guided recurrent neural encoder for video salient object detection. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 3243–3252, 2018. DOI: 10.1109/CVPR.2018.00342.
    [114]
    H. F. Li, G. Q. Chen, G. B. Li, Y. Z. Yu. Motion guided attention for video salient object detection. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7273–7282, 2019. DOI: 10.1109/ICCV.2019.00737.
    [115]
    P. X. Yan, G. B. Li, Y. Xie, Z. Li, C. Wang, T. S. Chen, L. Lin. Semi-supervised video salient object detection using pseudo-labels. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7283–7292, 2019. DOI: 10.1109/ICCV.2019.00738.
    [116]
    I. Armeni, Z. Y. He, A. Zamir, J. Gwak, J. Malik, M. Fischer, S. Savarese. 3D scene graph: A structure for unified semantics, 3D space, and camera. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 5663–5672, 2019. DOI: 10.1109/ICCV.2019.00576.
    [117]
    J. Johnson, R. Krishna, M. Stark, L. J. Li, D. A. Shamma, M. S. Bernstein, F. F. Li. Image retrieval using scene graphs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3668–3678, 2015. DOI: 10.1109/CVPR.2015.7298990.
    [118]
    R. Z. Wang, Z. Y. Wei, P. J. Li, Q. Zhang, X. J. Huang. Storytelling from an image stream using scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 5, 2020, pp. 9185–9192. DOI: 10.1609/aaai.v34i05.6455.
    [119]
    H. Qi, Y. L. Xu, T. Yuan, T. F. Wu, S. C. Zhu. Scene-centric joint parsing of cross-view videos. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, Article number 893, 2018.
    [120]
    B. Dai, Y. Q. Zhang, D. H. Lin. Detecting visual relationships with deep relational networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 3298–3308, 2017. DOI: 10.1109/CVPR.2017.352.
    [121]
    H. W. Zhang, Z. Kyaw, S. F. Chang, T. S. Chua. Visual translation embedding network for visual relation detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 3107–3115, 2017. DOI: 10.1109/CVPR.2017.331.
    [122]
    Z. S. Hung, A. Mallya, S. Lazebnik. Union visual translation embedding for visual relationship detection and scene graph generation. [Online], Available: https://arxiv.org/abs/1905.11624v1, 2019.
    [123]
    Y. N. Chen, Y. J. Wang, Y. Zhang, Y. W. Guo. PANet: A context based predicate association network for scene graph generation. In Proceedings of IEEE International Conference on Multimedia and Expo, Shanghai, China, pp. 508–513, 2019. DOI: 10.1109/ICME.2019.00094.
    [124]
    K. H. Tang, H. W. Zhang, B. Y. Wu, W. H. Luo, W. Liu. Learning to compose dynamic tree structures for visual contexts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6612–6621, 2019. DOI: 10.1109/CVPR.2019.00678.
    [125]
    Y. K. Li, W. L. Ouyang, X. G. Wang, X. O. Tang. ViP-CNN: Visual phrase guided convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 7244–7253, 2017. DOI: 10.1109/CVPR.2017.766.
    [126]
    Y. Z. Liang, Y. L. Bai, W. Zhang, X. M. Qian, L. Zhu, T. Mei. VrR-VG: Refocusing visually-relevant relationships. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 10402–10411, 2019. DOI: 10.1109/ICCV.2019.01050.
    [127]
    Y. K. Li, W. L. Ouyang, B. L. Zhou, J. P. Shi, C. Zhang, X. G. Wang. Factorizable Net: An efficient subgraph-based framework for scene graph generation. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 346–363, 2018. DOI: 10.1007/978-3-030-01246-5_21.
    [128]
    M. S. Qi, W. J. Li, Z. Y. Yang, Y. H. Wang, J. B. Luo. Attentive relational networks for mapping images to scene graphs. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 3952–3961, 2019. DOI: 10.1109/CVPR.2019.00408.
    [129]
    C. W. Lu, R. Krishna, M. Bernstein, F. F. Li. Visual relationship detection with language priors. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 852–869, 2016. DOI: 10.1007/978-3-319-46448-0_51.
    [130]
    T. S. Chen, W. H. Yu, R. Q. Chen, L. Lin. Knowledge-embedded routing network for scene graph generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6156–6164, 2019. DOI: 10.1109/CVPR.2019.00632.
    [131]
    J. X. Gu, H. D. Zhao, Z. Lin, S. Li, J. F. Cai, M. Y. Ling. Scene graph generation with external knowledge and image reconstruction. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1969–1978, 2019. DOI: 10.1109/CVPR.2019.00207.
    [132]
    R. Zellers, M. Yatskar, S. Thomson, Y. Choi. Neural motifs: Scene graph parsing with global context. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 5831–5840, 2018. DOI: 10.1109/CVPR.2018.00611.
    [133]
    B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 1946–1955, 2017. DOI: 10.1109/ICCV.2017.213.
    [134]
    S. B. Yang, G. B. Li, Y. Z. Yu. Cross-modal relationship inference for grounding referring expressions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 4140–4149, 2019. DOI: 10.1109/CVPR.2019.00427.
    [135]
    X. R. Lin, G. B. Li, Y. Z. Yu. Scene-intuitive agent for remote embodied visual grounding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern, IEEE, Nashville, USA, pp. 7032–7041, 2021. DOI: 10.1109/CVPR46437.2021.00696.
    [136]
    H. L. Liu, A. R. Lin, X. G. Han, L. Yang, Y. Z. Yu, S. G. Cui. Refer-it-in-RGBD: A bottom-up approach for 3D visual grounding in RGBD images. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 6028–6037, 2021. DOI: 10.1109/CVPR46437.2021.00597.
    [137]
    M. J. Sun, J. M. Xiao, E. G. Lim. Iterative shrinking for referring expression grounding using deep reinforcement learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 14055–14064, 2021. DOI: 10.1109/CVPR46437.2021.01384.
    [138]
    A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, N. Carion. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1760–1770, 2021. DOI: 10.1109/ICCV48922.2021.00180.
    [139]
    J. J. Deng, Z. Y. Yang, T. L. Chen, W. G. Zhou, H. Q. Li. TransVG: End-to-end visual grounding with transformers. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 1749–1759, 2021. DOI: 10.1109/ICCV48922.2021.00179.
    [140]
    J. Wu, G. B. Li, S. Liu, L. Lin. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12386–12393, 2020. DOI: 10.1609/aaai.v34i07.6924.
    [141]
    L. Chen, W. B. Ma, J. Xiao, H. W. Zhang, S. F. Chang. REF-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, pp. 1036–1044, 2021.
    [142]
    J. Wu, G. B. Li, X. G. Han, L. Lin. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, USA, pp. 1283–1291, 2020. DOI: 10.1145/3394171.3413862.
    [143]
    R. A. Yeh, M. N. Do, A. G. Schwing. Unsupervised textual grounding: Linking words to image concepts. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6125–6134, 2018. DOI: 10.1109/CVPR.2018.00641.
    [144]
    C. L. Zitnick, P. Dollár. Edge boxes: Locating object proposals from edges. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 391–405, 2014. DOI: 10.1007/978-3-319-10602-1_26.
    [145]
    J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, 2013. DOI: 10.1007/s11263-013-0620-5.
    [146]
    Y. F. Liu, B. Wan, L. Ma, X. M. He. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 5608–5617, 2021. DOI: 10.1109/CVPR46437.2021.00556.
    [147]
    L. W. Wang, J. Huang, Y. Li, K. Xu, Z. Y. Yang, D. Yu. Improving weakly supervised visual grounding by contrastive knowledge distillation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 14085–14095, 2021. DOI: 10.1109/CVPR46437.2021.01387.
    [148]
    J. Wang, L. Specia. Phrase localization without paired training examples. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 4662–4671, 2019. DOI: 10.1109/ICCV.2019.00476.
    [149]
    S. B. Yang, G. B. Li, Y. Z. Yu. Dynamic graph attention for referring expression comprehension. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 4643–4652, 2019. DOI: 10.1109/ICCV.2019.00474.
    [150]
    S. B. Yang, G. B. Li, Y. Z. Yu. Graph-structured referring expression reasoning in the wild. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9949–9958, 2020. DOI: 10.1109/CVPR42600.2020.00997.
    [151]
    R. Zellers, Y. Bisk, A. Farhadi, Y. Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 6713–6724, 2019. DOI: 10.1109/CVPR.2019.00688.
    [152]
    A. M. Wu, L. C. Zhu, Y. H. Han, Y. Yang. Connective cognition network for directional visual commonsense reasoning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 509, 2019. DOI: 10.5555/3454287.3454796.
    [153]
    W. J. Yu, J. W. Zhou, W. H. Yu, X. D. Liang, N. Xiao. Heterogeneous graph learning for visual commonsense reasoning. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 2765–2775, 2019.
    [154]
    J. X. Lin, U. Jain, A. G. Schwing. TAB-VCR: Tags and attributes based visual commonsense reasoning baselines. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Vancouver, Canada, 2019.
    [155]
    X. Zhang, F. F. Zhang, C. S. Xu. Multi-level counterfactual contrast for visual commonsense reasoning. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 1793–1802, 2021. DOI: 10.1145/3474085.3475328.
    [156]
    J. S. Lu, D. Batra, D. Parikh, S. Lee. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 2, 2019.
    [157]
    Y. C. Chen, L. J. Li, L. C. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. J. Liu. UNITER: Universal image-text representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 104–120, 2020. DOI: 10.1007/978-3-030-58577-8_7.
    [158]
    W. J. Su, X. Z. Zhu, Y. Cao, B. Li, L. W. Lu, F. R. Wei, J. F. Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [159]
    P. Sharma, N. Ding, S. Goodman, R. Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 2556–2565, 2018. DOI: 10.18653/v1/P18-1238.
    [160]
    J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 4724–4733, 2017. DOI: 10.1109/CVPR.2017.502.
    [161]
    C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, C. Schmid. Actor-centric relation network. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 335–351, 2018. DOI: 10.1007/978-3-030-01252-6_20.
    [162]
    C. Y. Wu, C. Feichtenhofer, H. Q. Fan, K. M. He, P. Krähenbühl, R. Girshick. Long-term feature banks for detailed video understanding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 284–293, 2019. DOI: 10.1109/CVPR.2019.00037.
    [163]
    C. Y. Yang, Y. H. Xu, J. P. Shi, B. Dai, B. L. Zhou. Temporal pyramid network for action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 588–597, 2020. DOI: 10.1109/CVPR42600.2020.00067.
    [164]
    C. Feichtenhofer. X3D: Expanding architectures for efficient video recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 200–210, 2020.
    [165]
    C. Feichtenhofer, H. Q. Fan, J. Malik, K. M. He. Slowfast networks for video recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6201–6210, 2019. DOI: 10.1109/ICCV.2019.00630.
    [166]
    W. T. Bao, Q. Yu, Y. Kong. Evidential deep learning for open set action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 13329–13338, 2021. DOI: 10.1109/ICCV48922.2021.01310.
    [167]
    A. Aich, M. Zheng, S. Karanam, T. Chen, A. K. Roy-Chowdhury, Z. Y. Wu. Spatio-temporal representation factorization for video-based person re-identification. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 152–162, 2021. DOI: 10.1109/ICCV48922.2021.00022.
    [168]
    J. Tan, J. Q. Tang, L. M. Wang, G. S. Wu. Relaxed transformer decoders for direct action proposal generation. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 13506–13515, 2021. DOI: 10.1109/ICCV48922.2021.01327.
    [169]
    G. Bertasius, H. Wang, L. Torresani. Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning, pp. 813–824, 2021.
    [170]
    X. Wang, S. W. Zhang, Z. W. Qing, Y. J. Shao, Z. R. Zuo, C. X. Gao, N. Sang. OadTR: Online action detection with transformers. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 7545–7555, 2021. DOI: 10.1109/ICCV48922.2021.00747.
    [171]
    C. H. Zhang, A. Gupta, A. Zisserman. Temporal query networks for fine-grained video understanding. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, 2021, pp. 4484–4494. DOI: 10.1109/CVPR46437.2021.00446.
    [172]
    S. J. Yan, Y. J. Xiong, D. H. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, Article number. 912, 2018. DOI: 10.5555/3504035.3504947.
    [173]
    C. Y. Si, W. T. Chen, W. Wang, L. Wang, T. N. Tan. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1227–1236, 2019. DOI: 10.1109/CVPR.2019.00132.
    [174]
    L. Shi, Y. F. Zhang, J. Cheng, H. Q. Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 12018–12027, 2019. DOI: 10.1109/CVPR.2019.01230.
    [175]
    K. Lin, L. J. Wang, Z. C. Liu. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 1954–1963, 2021. DOI: 10.1109/CVPR46437.2021.00199.
    [176]
    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the 9th International Conference on Learning Representations, 2021.
    [177]
    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Luçiç, C. Schmid. ViViT: A video vision transformer. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, USA, pp. 6816–6826, 2021. DOI: 10.1109/ICCV48922.2021.00676.
    [178]
    P. Anderson, X. D. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6077–6086, 2018. DOI: 10.1109/CVPR.2018.00636.
    [179]
    S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: 10.1162/neco.1997.9.8.1735.
    [180]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017. DOI: 10.5555/3295222.3295349.
    [181]
    J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, USA, pp. 4171–4186, 2019. DOI: 10.18653/v1/N19-1423.
    [182]
    S. Antol, A. Agrawal, J. S. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh. VQA: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision, Santiago, Chile, pp. 2425–2433, 2015. DOI: 10.1109/ICCV.2015.279.
    [183]
    Z. C. Yang, X. D. He, J. F. Gao, L. Deng, A. Smola. Stacked attention networks for image question answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 21–29, 2016. DOI: 10.1109/CVPR.2016.10.
    [184]
    D. J. Xu, Z. Zhao, J. Xiao, F. Wu, H. W. Zhang, X. N. He, Y. T. Zhuang. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, USA, pp. 1645–1653, 2017. DOI: 10.1145/3123266.3123427.
    [185]
    T. M. Le, V. Le, S. Venkatesh, T. Tran. Hierarchical conditional relation networks for video question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9969–9978, 2020. DOI: 10.1109/CVPR42600.2020.00999.
    [186]
    P. Jiang, Y. H. Han. Reasoning with heterogeneous graph alignment for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 11109–11116, 2020. DOI: 10.1609/aaai.v34i07.6767.
    [187]
    D. Huang, P. H. Chen, R. H. Zeng, Q. Du, M. K. Tan, C. Gan. Location-aware graph convolutional networks for video question answering. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 11021–11028, 2020. DOI: 10.1609/aaai.v34i07.6737.
    [188]
    J. Lei, L. J. Li, L. W. Zhou, Z. Gan, T. L. Berg, M. Bansal, J. J. Liu. Less is more: CLIPBERT for video-and-language learning via sparse sampling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 7327–7337, 2021. DOI: 10.1109/CVPR46437.2021.00725.
    [189]
    F. Liu, J. Liu, W. N. Wang, H. Q. Lu. HAIR: Hierarchical visual-semantic relational reasoning for video question answering. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, USA, pp. 1678–1787, 2021. DOI: 10.1109/ICCV48922.2021.00172.
    [190]
    A. Agrawal, D. Batra, D. Parikh, A. Kembhavi. Don′t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 4971–4980, 2018. DOI: 10.1109/CVPR.2018.00522.
    [191]
    V. Agarwal, R. Shetty, M. Fritz. Towards causal VQA: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9687–9695, 2020. DOI: 10.1109/CVPR42600.2020.00971.
    [192]
    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 2672–2680, 2014. DOI: 10.5555/2969033.2969125.
    [193]
    S. Y. Zhang, T. Jiang, T. Wang, K. Kuang, Z. Zhao, J. K. Zhu, J. Yu, H. X. Yang, F. Wu. DeVLBert: Learning deconfounded visio-linguistic representations. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, USA, pp. 4373–4382, 2020. DOI: 10.1145/3394171.3413518.
    [194]
    Y. C. Li, X. Wang, J. B. Xiao, W. Ji, T. S. Chua. Invariant grounding for video question answering. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2928–2937, 2022.
    [195]
    R. Y. Liu, H. Liu, G. Li, H. D. Hou, T. H. Yu, T. Yang. Contextual debiasing for visual recognition with causal mechanisms. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12755–12765, 2022.
    [196]
    Y. J. Liu, R. Cadei, J. Schweizer, S. Bahmani, A. Alahi. Towards robust and adaptive motion forecasting: A causal representation perspective. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17081–17092, 2022.
    [197]
    F. R. Lv, J. Liang, S. Li, B. Zang, C. H. Liu, Z. T. Wang, D. Liu. Causality inspired representation learning for domain generalization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8046–8056, 2022.
    [198]
    X. R. Lin, Y. Y. Chen, G. B. Li, Y. Z. Yu. A causal inference look at unsupervised video anomaly detection. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, pp. 1620–1629, 2022. DOI: 10.1609/aaai.v36i2.20053.
    [199]
    X. R. Lin, Z. Y. Wu, G. Q. Chen, G. B. Li, Y. Z. Yu. A causal debiasing framework for unsupervised salient object detection. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, pp. 1610–1619, 2022. DOI: 10.1609/aaai.v36i2.20052.
    [200]
    Y. Liu, G. B. Li, L. Lin. Cross-modal causal relational reasoning for event-level visual question answering. [Online], Available: https://arxiv.org/abs/2207.12647, 2022.
    [201]
    M. Ren, Y. L. Wang, Z. F. He. Towards interpretable defense against adversarial attacks via causal inference. Machine Intelligence Research, vol. 19, no. 3, pp. 209–226, 2022. DOI: 10.1007/s11633-022-1330-7.
    [202]
    R. J. Bowden, D. A. Turkington. Instrumental Variables, Cambridge, UK: Cambridge University Press, 1984.
    [203]
    J. Y. Zhu, T. Park, P. Isola, A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 2242–2251, 2017. DOI: 10.1109/ICCV.2017.244.
    [204]
    D. P. Kingma, M. Welling. Auto-encoding variational bayes. [Online], Available: https://arxiv.org/abs/1312.6114, 2013.
    [205]
    P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, D. Parikh. Yin and Yang: Balancing and answering binary visual questions. In Proceedings of IEEE Conference On Computer Vision And Pattern Recognition, Las Vegas, USA, pp. 5014–5022, 2016. DOI: 10.1109/CVPR.2016.542.
    [206]
    H. Tan, M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 5100–5111, 2019. DOI: 10.18653/v1/D19-1514.
    [207]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748–8763, 2021.
    [208]
    L. J. Li, J. Lei, Z. Gan, J. J. Liu. Adversarial VQA: A new benchmark for evaluating the robustness of VQA models. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, Canada, pp. 2022–2031, 2021. DOI: 10.1109/ICCV48922.2021.00205.
    [209]
    X. L. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollár, C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. [Online], Available: https://arxiv.org/abs/1504.00325, 2015.
    [210]
    K. X. Yi, C. Gan, Y. Z. Li, P. Kohli, J. J. Wu, A. Torralba, J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [211]
    V. Gupta, B. N. Patro, H. Parihar, V. P. Namboodiri. VQuAD: Video question answering diagnostic dataset. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, IEEE, Waikoloa, USA, pp. 282–291, 2022. DOI: 10.1109/WACVW54805.2022.00034.
    [212]
    Z. F. Chen, K. X. Yi, Y. Z. Li, M. Y. Ding, A. Torralba, J. B. Tenenbaum, C. Gan. ComPhy: Compositional physical reasoning of objects and events from videos. In Proceedings of the 10th International Conference on Learning Representations, 2022.
    [213]
    M. Grunde-McLaughlin, R. Krishna, M. Agrawala. AGQA: A benchmark for compositional spatio-temporal reasoning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 11282–11292, 2021. DOI: 10.1109/CVPR46437.2021.01113.
    [214]
    L. Xu, H. Huang, J. Liu. SUTD-trafficQA: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, 2021, pp. 9873–9883. DOI: 10.1109/CVPR46437.2021.00975.
    [215]
    J. B. Xiao, X. D. Shang, A. Yao, T. S. Chua. NEXT-QA: Next phase of question-answering to explaining temporal actions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Nashville, USA, pp. 9772–9781, 2021. DOI: 10.1109/CVPR46437.2021.00965.
    [216]
    D. W. Zhang, W. Y. Zeng, J. R. Yao, J. W. Han. Weakly supervised object detection using proposal-and semantic-level relationships. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3349–3363, 2022. DOI: 10.1109/TPAMI.2020.3046647.
    [217]
    D. W. Zhang, J. W. Han, G. Cheng, M. H. Yang. Weakly supervised object localization and detection: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5866–5885, 2022. DOI: 10.1109/TPAMI.2021.3074313.
    [218]
    W. Wang, J. Y. Gao, C. S. Xu. Weakly-supervised video object grounding via causal intervention. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published. DOI: 10.1109/TPAMI.2022.3180025.
    [219]
    E. Tjoa, C. T. Guan. A survey on explainable artificial intelligence (XAI): Toward medical XAI. IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 11, pp. 4793–4813, 2021. DOI: 10.1109/TNNLS.2020.3027314.
    [220]
    Á. Parafita, J. Vitriá. Explaining visual models by causal attribution. In Proceedings of EEE/CVF International Conference on Computer Vision Workshop, IEEE, Seoul, Korea, pp. 4167–4175, 2019. DOI: 10.1109/ICCVW.2019.00512.
    [221]
    T. Narendra, A. Sankaran, D. Vijaykeerthy, S. Mani. Explaining deep learning models using causal inference. [Online], Available: https://arxiv.org/abs/1811.04376, 2018.
    [222]
    M. Harradon, J. Druce, B. Ruttenberg. Causal learning and explanation of deep neural networks via autoencoded activations. [Online], Available: https://arxiv.org/abs/1802.00541, 2018.
    [223]
    A. Chattopadhyay, P. Manupriya, A. Sarkar, V. N. Balasubramanian. Neural network attributions: A causal perspective. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, USA, 2019, pp. 981–990.
    [224]
    R. Moraffah, M. Karami, R. C. Guo, A. Raglin, H. Liu. Causal interpretability for machine learning-problems, methods and evaluation. ACM SIGKDD Explorations Newsletter, vol. 22, no. 1, pp. 18–33, 2020. DOI: 10.1145/3400051.3400058.
    [225]
    M. O′Shaughnessy, G. Canal, M. Connor, M. Davenport, C. Rozell. Generative causal explanations of black-box classifiers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 5453–5467, 2020. DOI: 10.5555/3495724.3496182.
    [226]
    W. Y. Lin, H. Lan, B. C. Li. Generative causal explanations for graph neural networks. In Proceedings of the 38th International Conference on Machine Learning, pp. 6666–6679, 2021.
    [227]
    J. von Kügelgen, L. Gresele, B. Schölkopf. Simpson′s paradox in COVID-19 case fatality rates: A mediation analysis of age-related causal effects. IEEE Transactions on Artificial Intelligence, vol. 2, no. 1, pp. 18–27, 2021. DOI: 10.1109/TAI.2021.3073088.
    [228]
    Y. Zheng, C. Gao, X. Li, X. N. He, Y. Li, D. P. Jin. Disentangling user interest and conformity for recommendation with causal embedding. In Proceedings of the Web Conference, ACM, Ljubljana, Slovenia, pp. 2980–2991, 2021. DOI: 10.1145/3442381.3449788.
    [229]
    D. G. Liu, P. X. Cheng, H. Zhu, Z. H. Dong, X. Q. He, W. K. Pan, Z. Ming. Mitigating confounding bias in recommendation via information bottleneck. In Proceedings of the 15th ACM Conference on Recommender Systems, Amsterdam, The Netherlands, pp. 351–360, 2021. DOI: 10.1145/3460231.3474263.
    [230]
    T. X. Wei, F. L. Feng, J. W. Chen, Z. W. Wu, J. F. Yi, X. N. He. Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, pp. 1791–1800, 2021. DOI: 10.1145/3447548.3467289.
    [231]
    W. J. Wang, F. L. Feng, X. N. He, H. W. Zhang, T. S. Chua. Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1288–1297, 2021. DOI: 10.1145/3404835.3462962.
    [232]
    Y. Zhang, F. Feng, X. He, T. Wei, C. Song, G. Ling, Y. Zhang. Causal intervention for leveraging popularity bias in recommendation. In Proceedings the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–20, 2021.
    [233]
    K. C. Stocking, A. Gopnik, and C. Tomlin. From robot learning to robot understanding: Leveraging causal graphical models for robotics. In Proceedings of Conference on Robot Learning. pp. 1776– 1781, 2022.
    [234]
    T. E. Lee, J. A. Zhao, A. S. Sawhney, S. Girdhar, O. Kroemer. Causal reasoning in simulation for structure and transfer learning of robot manipulation policies. In Proceedings of IEEE International Conference on Robotics and Automation. IEEE, pp. 4776–4782, 2021.
    [235]
    S. C. Smith and S. Ramamoorthy. Counterfactual explanation and causal inference in service of robustness in robot control. In Proceedings of the 10th IEEE Joint International Conference on Development and Learning and Epigenetic Robotics. IEEE, 2020.
    [236]
    F. Hou, Y. Pei, and J. Sun. Mobile Crowd Sensing: Incentive Mechanism Design. Springer, 2019.
    [237]
    Y. Zheng, L. Capra, O. Wolfson, H. Yang. Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology, vol. 5, no. 3, pp. 1–55, 2014.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(14)  / Tables(3)

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (1106) PDF downloads(120) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return