Spatio-Temporal RGBD Cuboids Feature for Human Activity Recognition

H. A. Sial, M. H. Yousaf, F. Hussain


Human activity recognition is one of the promising research areas in the domain of computer vision. Color sensor cameras are frequently used in the literature for human activity recognition systems. These cameras map 4D real-world activities to 3D digital space by discarding important depth information. Due to the elimination of depth information, the achieved results exhibit degraded performance. Therefore, this research work presents a robust approach to recognize a human activity by using both the aligned RGB and the depth channels to form a combined RGBD.Furthermore, in order to handle the occlusion and background challenges in the RGB domain, Spatial-Temporal Interest Point (STIP) based scheme is employed to deal with both RGB and depth channels. Moreover, the proposed scheme only extracts the interest points from depth video (D-STIP) such that the identical interest points are used to extract the cuboid descriptors from RGB (RGB-DESC) and depth (D-DESC) channels. Finally, aconcatenatedfeature vector, comprising features from both channels is passed to exploit a bag of visual words scheme for human activity recognition. The proposed combined RGBD features based approach has been tested on the challenging MSR activity dataset to show the improved capability of combined approach over a single channel approach.

Full Text:




R. Poppe, "A survey on vision-based human action recognition, "Image and vision computing”, vol. 28, no. 6, pp. 976-990, 2010.

Xia, Lu, and J. K. Aggarwal. "Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera",

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.

L. Wanqing, Z. Zhang and Z. Liu, "Action recognition based on a bag of 3d points", IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010.

J. K. Aggarwal and Michael S. Ryoo. "Human activity analysis:

A review", ACM Computing Surveys (CSUR) vol. 43, no. 3,

pp. 16, 2011.

H. Chris and M. Stephens, "A combined corner and edge detector", Alvey Vision Conference, vol. 15, 1988.

F. Robert, P. Perona and A. Zisserman, "Object class recognition by unsupervised scale-invariant learning", Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 2003

G. L. David, "Distinctive image features from scale-invariant key points, "Int. J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

B. Herbert, T. Tuytelaars and L. V. Gool, "Surf: Speeded up robust features", Computer Vision–ECCV, Springer Berlin Heidelberg,

pp. 404-417, 2006..

L. Ivan, "On space-time interest points", Int. J. Computer Vision, vol. 64, nos. 2-3, pp. 107-123, 2005.

D. Piotr, R. Vincent, G. Cottrell and B. Serge,. "Behavior recognition via sparse spatio-temporal features", 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.

Willems, Geert, T. Tuytelaars and L. V. Gool, "An efficient dense and scale-invariant spatio-temporal interest point detector", Computer Vision–ECCV, Springer Berlin Heidelberg,

pp. 650-663, 2008.

Wang, Heng et al., "Evaluation of local spatio-temporal features for action recognition", British Machine Vision Conference, 2009.

Laptev, Ivan et al., "Learning realistic human actions from movies", IEEE Conference on Computer Vision and Pattern Recognition, 2008.

A. Klaeser, M. Marszalek and C. Schmid, “A Spatio-Temporal Descriptor Based on 3D-Gradients”, Editors: M. Everingham and

C. Needham, Proc. of the British Machine Conference, pp. 99.1

-99.10, BMVA Press, September 2008.

Jiang, Xinghao, et al., "A space-time surf descriptor and its application to action recognition with video words", IEEE 8th International Conference on Fuzzy Systems and Knowledge Discovery, vol. 3, 2011.

P. Scovanner, S. Ali and M. Shah, "A 3-dimensional sift descriptor and its application to action recognition", Proceedings of the 15th International Conference on Multimedia, Augsburg, Germany,

Sept. 23 - 28, 2007.

W. LiMin, Y. Qiao and X. Tang, "Motionlets: Mid-level 3d parts for human motion recognition", IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, pp. 2674-2681, 2013.

Bobick, F. Aaron and J. W. Davis, "The recognition of human movement using temporal templates", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, 2001.


  • There are currently no refbacks.