Bidirectional LSTM with saliency-aware 3D-CNN  features for human action recognition

SHEERAZ ARIF; Jing Wang

doi:10.36909/jer.v9i3A.8383

SHEERAZ ARIF Beijing Institute of Technology
Jing Wang Beijing Institute of Technology

DOI: https://doi.org/10.36909/jer.v9i3A.8383

Abstract

Deep convolutional neural network (DCNN) and Recurrent neural network (RNN) received increasing attention in multimedia understanding and obtained remarkable action recognition performance. However, videos contain rich motion information with varying dimensions. Existing recurrent based pipelines fail to capture long-term motion dynamics in videos with various motion scales and complex actions performed by multiple actors. Consideration of contextual and salient features is more important than mapping a video frame into a static video representation. This research work provides a novel pipeline by analyzing and processing the video information using a 3D convolution (C3D) network and newly introduced deep bi-directional LSTM. Like popular two stream convent, we also introduce two stream framework with one modification i.e. we replace the optical flow stream by saliency-aware stream to avoid the computational complexity. First, we generate saliency-aware video stream by applying saliency-aware method. Secondly, two-stream 3D-convolutional network (C3D) is utilized with two kinds of streams i.e., RGB and saliency-aware video stream to extract both spatial and semantic information. Next, deep bi-directional LSTM network is used to learn sequential deep temporal dynamics. Finally, time-series-pooling layer and softmax layers are used to classify the human activity. The introduced system can learn long-term temporal dependencies and able to predict complex human actions. Experimental results demonstrate the significant improvement in action recognition accuracy on different benchmark datasets.

Author Biographies

SHEERAZ ARIF, Beijing Institute of Technology

I am pursuing my PhD from Beijing Institute of Technology in communication Engineering.

Jing Wang, Beijing Institute of Technology

Associate Professor in School of Information and Electronics Engineering, BIT

References

REFERENCES

Lowe, DG. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2): 91–11.

P.Scovanner, S. Ali, M. Shah. 2007. A 3-dimensional SIFT descriptor and its application to action recognition. In Int’l Conf. Multimedia, 357-360.

R. Poppe. 2010. A survey on vision-based human action recognition. Image and Vision Computing, 28(6): 976–990.

N. Dalal, B. Triggs.2005. Histograms of oriented gradients for human detection. In CVPR.

C. Schuldt, I. Laptev, B. 2011. Recognizing human actions: A local SVM approach. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colordo Springs, United States, 3169-3176.

H. Wang, C. Schmid. 2013. Action recognition with improved trajectories. In IEEE Int’l Conf. Computer Vision, 3551-3558.

N. Dalal, B. Triggs, C. Schmid. 2006. Human detection using oriented histograms of flow and appearance. In European Conf. Computer Vision, 428-441.

Z. Zhou, F. Shi, W. Wu. 2015. Learning spatial and temporal extents of human actions for action detection. IEEE Trans. on Multimedia, 17(4): 512–525.

S. Ji, W. Xu, M. Yang. 2013. 3D convolutional neural networks for human action recognition. TPAMI, 35(1).

D. Tran, L. Bourdev, R. Fergus, et al. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proc. 2015 IEEE Int. Conf. Comput. Vis., 4489–4497.

K. Simonyan, A. Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. NIPS.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. 2014. Sukthankar, and F. F. Li. Large-scale video classification with convolutional neural networks. In CVPR.

L. Wang, Y. Xiong, Z. Wang, et al. 2016.Temporal segment networks: towards good practices for deep action recognition. European Conf. on Computer Vision, 20–36

C. Feichtenhofer, A. Pinz, A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 1933–1941.

W. Zaremba, I. Sutskever, O. Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.

J. Donahue, L.Hendricks, S. Guadarrama, et al. 2015. Long-term recurrent convolutional networks for visual recognition and description. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2625–2634.

V. Veeriah, N. Zhuang, G.J. Qi. 2015. Differential recurrent neural networks for action recognition. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 4041–4049.

J. Yue-Hei, M. Hausknecht, S. Vijayanarasimhan, et al. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 4694–4702.

Z. Wu, X. Wang, Y. Jiang, et al. 2015. Modelling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia, 461–470.

N. Srivastava, E. Mansimov, R. Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs. ICML.

S. Hochreiter, J. Schmidhuber. Long short-term memory. 1997. Neural Computation, 9(8): 1735–1780.

Z. Li, E. Gavves, M. Jain, et al. 2016. VideoLSTM convolves, attends and flows for action recognition. volume abs/1607.01794.

S. Xingjian, Z. Chen, H. Wang, et al. 2015. Convolutional lstm network'. A machine learning approach for precipitation nowcasting. In NIPS.

S. Yeung, O. Russakovsky, N. Jin, et al. 2015. Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738.

S. Sharma, R. Kiros, R. Salakhutdinov. 2015. Action recognition using visual attention. CoRR, (2015).

Y. Wang, S. Wang, J. Tang, et al. 2016. Hierarchical attention network for action recognition in videos. in ArXiv.

A. Graves, S. Fernández, J. Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. Artificial Neural Networks: Formal Models and Their Applications–ICANN, 753-753, 2005.

A. Ullah, J. Ahmad, K. Muhammad, et al. 2017. Action Recognition in Video Sequences using Deep Bi-directional LSTM with CNN Features. Visual Surveillance and Biometrics: Practices, Challenges, and Possibilities, 1155 – 1166.

S. Yeung, O. Russakovsky, N. Jin, et al. 2018. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. International Journal of computer vision, 126: 375–389

W. Wang, J. Shen, F. Porikli. Saliency-aware geodesic video object segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recog., (Jun. 2015), 3395–3402.

K. Soomro, A.R. Zamir, M. Shah,. UCF101: A dataset of 101 human action classes from videos in the wild. Center Res. Comput. Vis., Univ. Central Florida, Orlando, FL, USA, Tech. Rep. CRCV-TR-12-01, (Nov. 2012). [Online]. Available: http://crcv.ucf.edu/data/UCF101.php

H. Kuehne, H. Jhuang, E. Garrote, et al. Hmdb: a large video database for human motion recognition. In:IEEE International Conference on Computer Vision, Barcelona, Spain, (2011), 2556 – 2563.

YG. Jiang, J. Liu, Z.A. Roshan, et al. THUMOS challenge: Action recognition with a large number of classes. (2013).

Y. Jia, E. Shelhamer, J. Donahue, et al. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, (2014), 675–678.

L. Sun, K. Jia, K. Chen, et al. Lattice Long Short-Term Memory for Human Action Recognition. IEEE International Conference on Computer Vision, (2017), 2166-2175.

B. Mahasseni, S. Todorovic. Regularizing long short term memory with 3D human-skeleton sequences for action recognition. In CVPR, (2016).

N. Srivastava, E. Mansimov, R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, (2015).

X. Wang, L. Gao, J. Song. Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition. IEEE SIGNAL PROCESSING LETTERS, 24, 4, (2017), 510-514.

Ma, C., Chen, M., Kira, Z., et al.: 'Ts-lstm and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition'. In ArXiv, 2017.

Murthy, V.R., Goecke, R.: 'Ordered trajectories for large scale human action recognition'. In proceeding IEEE conference on computer vision and pattern recognition, 2013, pp. 412-419.

Ni, B., Moulin, P., Yang, X.: 'Motion part regularization: Improving action recognition via trajectory selection'. In Proceeding of IEEE conference on IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3698 – 3706.

L. Wang, Y. Qiao, X. Tang. Mofap: a multi-level representation for action recognition. International Journal of Computer Vision, 119, (2016), 119–254.

J. Seo, H. Kim, Y.M. Ro. Effective and efficient human action recognition using dynamic frame skipping and trajectory rejection. Journal Image and Vision Computing, 58, (2017), 76-85.

L. Sun, K. Jia, B.E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of IEEE International Conference on computer vision (ICCV), Santiago, Chile, (2015), 4597 – 4605.

B. Zhang, L. Wang, Z.Y. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, (2016), 2718 – 2726.

Wang, J.,Wang, W., Wang, R.: 'Deep alternative neural network: exploring contexts as early as possible for action recognition'. Proceedings of 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016, pp.811–819.

Bilen, H., Fernando, B., Gavves, E., et al.: 'Dynamic image networks for action recognition'. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016 pp. 3034-3042.

Yu, S., Cheng, Y., Xie, L.: 'Fully convolutional networks for action recognition'. Institution of Engineering and Technology (IET) Computer vision, 2017. pp. 744-749.

Varol, G., Laptev, I., Schmid, C: 'Long- term temporal convolutions for action recognition'. IEEE transactions on pattern analysis and machine intelligence, vol. 40, 2018 pp.1510-1517.

Wang, X., Gao, L., Wang, P., et al.: 'Two-stream 3D convNet Fusion for Action Recognition in Videos with Arbitrary Size and Length'. IEEE transaction on multimedia, vol. 20, 2017, pp. 1-11.

Zhu, Y., Lan, Z., Newsam, S.: 'Hidden two-stream convolutional networks for action recognition'. in ArXiv, 2017.

L.Bazzani,, Larochelle, H., LTorresani, L.: 'Recurrent mixture density network for spatiotemporal visual attention'. in ICLR, 2017.

S. Yeung, O. Russakovsky, N. Jin. Every moment counts: Dense detailed labelling of actions in complex videos. International Journal of Computer Vision, 126, (2018), 375-389.

D. Wenbin , Y Wang, Y. Qiao. 2018. Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos. TRANSACTIONS ON IMAGE PROCESSING. 1-14.

L. Wang, Y. Qiao, X. Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. Proc. of the IEEE conf. on computer vision and pattern recognition, 4305-4314.

	P.O.Box 17225, Khaldiyah 72453, Kuwait
	jer@ku.edu.kw
	kuwaitjournals@gmail.com
	(+965) 2498 6100 / 2498 4487 / 2481 6261 (Dir)