Citation: | Zhi-Wei Xu, Xiao-Jun Wu, Josef Kittler. STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition. International Journal of Automation and Computing, vol. 18, no. 5, pp.718-730, 2021. https://doi.org/10.1007/s11633-021-1289-9 |
[1] |
C. M. Bishop. Pattern Recognition and Machine Learning, New York, USA: Springer, 2006.
|
[2] |
D. Michie, D. J. Spiegelhalter, C. C. Taylor. Machine Learning, Neural and Statistical Classification, Englewood Cliffs, USA Prentice Hall, 1994.
|
[3] |
Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015. DOI: 10.1038/nature14539.
|
[4] |
A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 1097−1105, 2012.
|
[5] |
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Dep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770−778, 2016.
|
[6] |
C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1−9, 2015.
|
[7] |
J. W. Han, D. W. Zhang, G. Cheng, N. A. Liu, D. Xu. Advanced deep-learning techniques for salient and category-specific object detection: A survey. IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018. DOI: 10.1109/Msp.2017.2749125.
|
[8] |
J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 779−788, 2016.
|
[9] |
H. Noh, S. Hong, B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1520−1528, 2015.
|
[10] |
E. Shelhamer, J. Long, T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017. DOI: 10.1109/TPAMI.2016.2572683.
|
[11] |
J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6299−6308, 2017.
|
[12] |
X. F. Ji, Q. Q. Wu, Z. J. Ju, Y. Y. Wang. Study of human action recognition based on improved spatio-temporal features. International Journal of Automation and Computing, vol. 11, no. 5, pp. 500–509, 2014. DOI: 10.1007/s11633-014-0831-4.
|
[13] |
L. M. Wang, Y. Qiao, X. O. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4305−4314, 2015.
|
[14] |
X. L. Wang, A. Farhadi, A. Gupta. Actions ~ Transformations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2658−2667, 2016.
|
[15] |
K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 568−576, 2014.
|
[16] |
L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20−36, 2016.
|
[17] |
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489−4497, 2015.
|
[18] |
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725−1732, 2014.
|
[19] |
B. W. Zhang, L. M. Wang, Z. Wang, Y. Qiao, H. L. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2718−2726, 2016.
|
[20] |
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6450−6459, 2018.
|
[21] |
B. Y. Jiang, M. M. Wang, W. H. Gan, W. Wu, J. J. Yan. STM: SpatioTemporal and motion encoding for action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 2000−2009, 2019.
|
[22] |
Z. G. Tu, H. Y. Li, D. J. Zhang, J. Dauwels, B. X. Li, J. S. Yuan. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2799–2812, 2019. DOI: 10.1109/TIP.2018.2890749.
|
[23] |
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. [Online], Available: https://arxiv.orglabs/1409.1556, 2014.
|
[24] |
G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4700−4708, 2017.
|
[25] |
I. Laptev. On space-time interest points. International Journal of Computer Vision, vol. 64, no. 2–3, pp. 107–123, 2005. DOI: 10.1007/s11263-005-1838-7.
|
[26] |
H. Wang, C. Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3551−3558, 2013.
|
[27] |
L. M. Wang, Y. Qiao, X. O. Tang. MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, vol. 119, no. 3, pp. 254–271, 2016. DOI: 10.1007/s11263-015-0859-0.
|
[28] |
X. L. Song, C. L. Lan, W. J. Zeng, J. L. Xing, X. Y. Sun, J. Y. Yang. Temporal-spatial mapping for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 3, pp. 748–759, 2020. DOI: 10.1109/Tcsvt.2019.2896029.
|
[29] |
S. W. Ji, W. Xu, M. Yang, K. Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2012. DOI: 10.1109/TPAMI.2012.59.
|
[30] |
J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4694−4702, 2015. DOI: 10.1109/CVPR.2015.7299101.
|
[31] |
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2625−2634, 2015.
|
[32] |
S. J. Yan, Y. J. Xiong, D. H. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, pp. 7444−7452, 2018.
|
[33] |
C. Wu, X. J. Wu, J. Kittler. Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision Workshop, IEEE, Seoul, Korea, pp. 1740−1748, 2019.
|
[34] |
H. S. Wang, L. Wang. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4382–4394, 2018. DOI: 10.1109/TIP.2018.2837386.
|
[35] |
B. K. P. Horn, B. G. Schunck. Determining optical flow. Artificial Intelligence, vol. 17, no. 1−3, pp. 185–203, 1981. DOI: 10.1117/12.965761.
|
[36] |
H. Sak, A. W. Senior, F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 338−342, 2014.
|
[37] |
C. H. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Q. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, J. Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6047−6056, 2018.
|
[38] |
R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In IEEE Proceedings of International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5843−5851, 2017.
|
[39] |
L. M. Wang, W. Li, W. Li, L. Van Gool. Appearance-and-relation networks for video classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1430−1439, 2018.
|
[40] |
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. DOI: 10.1109/5.726791.
|
[41] |
M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt. Sequential deep learning for human action recognition. In Proceedings of the 2nd International Workshop on Human Behavior Understanding, Springer, Amsterdam, The Netherlands, pp. 29−39, 2011.
|
[42] |
L. Sun, K. Jia, D. Y. Yeung, B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4597−4605, 2015.
|
[43] |
Z. F. Qiu, T. Yao, T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5533−5541, 2017.
|
[44] |
R. Memisevic. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1829–1846, 2013. DOI: 10.1109/TPAMI.2013.53.
|
[45] |
B. L. Zhou, A. Andonian, A. Oliva, A. Torralba. Temporal relational reasoning in videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 803−818, 2018.
|
[46] |
R. H. Zeng, W. B. Huang, C. Gan, M. K. Tan, Y. Rong, P. L. Zhao, J. Z. Huang. Graph convolutional networks for temporal action localization. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7093−7102, 2019.
|
[47] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Uszkoreit, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 5998−6008, 2017.
|
[48] |
H. Z. Chen, G. H.Tian, G. L. Liu. A selective attention guided initiative semantic cognition algorithm for service robot. International Journal of Automation and Computing, vol. 15, no. 5, pp. 559–569, 2018. DOI: 10.1007/s11633-018-1139-6.
|
[49] |
T. V. Nguyen, Z. Song, S. C. Yan. STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 1, pp. 77–86, 2015. DOI: 10.1109/Tcsvt.2014.2333151.
|
[50] |
X. Long, C. Gan, G. De Melo, J. J. Wu, X. Liu, S. L. Wen. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7834−7843, 2018.
|
[51] |
X. Zhang, Q. Yang. Transfer hierarchical attention network for generative dialog system. International Journal of Automation and Computing, vol. 16, no. 6, pp. 720–736, 2019. DOI: 10.1007/s11633-019-1200-0.
|
[52] |
X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794−7803, 2018.
|
[53] |
C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI, San Francisco, USA, pp. 4278−4284, 2017.
|
[54] |
Y. Z. Zhou, X. Y. Sun, C. Luo, Z. J. Zha, W. J. Zeng. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9829−9838, 2020.
|
[55] |
H. S. Su, J. Su, D. L. Wang, W. H. Gan, W. Wu, M. M. Wang, J. J. Yan, Y. Qiao. Collaborative distillation in the parameter and spectrum domains for video action recognition. [Online], Available: https://arxiv.org/abs/2009.06902, 2020.
|
[56] |
C. Feichtenhofer, H. Q. Fan, J. Malik, K. M. He. Slowfast networks for video recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6201−6210, 2019.
|
[57] |
M. Zolfaghari, K. Singh, T. Brox. ECO: Efficient convolutional network for online video understanding. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 695−712, 2018.
|
[58] |
K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. [Online], Available: https://arxiv.org/abs/1212.0402, 2012.
|
[59] |
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2556−2563, 2011.
|
[60] |
X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, JMLR, Sardinia, Italy, pp. 249−256, 2010.
|
[61] |
A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, L. Van Gool. Spatio-temporal channel correlation networks for action classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 284−299, 2018.
|
[62] |
S. N. Xie, C. Sun, J. Huang, Z. W. Tu, K. Murphy. Rethinking spatiotemporal feature learning for video understanding. [Online], Available: https://arxiv.org/abs/1712.04851, 2017.
|
[63] |
J. Lin, C. Gan, S. Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7082−7092, 2019.
|
[64] |
Y. S. Tang, J. W. Lu, J. Zhou. Comprehensive instructional video analysis: The COIN dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. DOI: 10.1109/TPAMI.2020.2980824.
|