Activity Recognition - A Camera Base Framework

Introduction

Activity recognition methods have achieved unprecedented success due to advancement in deep learning techniques. The assisted daily living is activity recognition and recommendation mechanism for elderly people living in the apartments. The activities are differentiated at two levels, one at the coarse level and other at the fine-grained level. At the coarse level, simple activities such as drink, pour, cook, walk, sit down, use mobile, read book etc. will be recognized. At the fine-grained level, complex activities, such as drink from the cup, drink from the bottle, drink from the glass, etc. will be recognized.
The boost in activity recognition is also due the availability of the huge number of datasets for activities of daily living(ADL). We use "Toyota Smart home" and "Epic Kitchens" datasets for our project. These datasets are composed of both coarse and fine-grained activities. These datasets comprise of videos of different durations. Durations of videos for different activities may vary greatly.

Some activities (e.g., walk, sit down) can be recognized using pose estimation techniques. For the other activities (e.g., drink from a bottle, cup or can) we need to detect and recognize the objects (e.g., bottle, cup or can).

Recently, the activities of daily living has gained more attention from the research community. In the ADL recognition process there are some challenges. The object detection, when the object is occluded, is difficult. Recognizing simultaneous activities is also a key challenge.

Background

In this section, we discuss some definitions. Camera framing: how the video is captured. The camera is fixed (the low camera framing) at a point and video is recorded. Duration variation: for different activities, the duration of the video may vary largely. More variation means more challenges for activity recognition. Coarse activities: these are simple activities such as walk, drink, etc. Fine-grained activities: these activities have detailed information about involved objects such as bottle, cup, etc. Drink from bottle is a fine-grained activity.
State of the art

The traditional approaches use local features for activity recognition. Local feature extraction is the main part of these approaches, but these are effective on only small datasets. For large datasets, to learn the features, Convolutional neural networks(CNN) are used and these learned features are concatenated with local features for better recognition[4, 5, 8]. Spatial feature based recognition done by Donahue et al.[5]. Firstly, spatial feature are extraction using CNN and then, these features are fed to recurrent network. This method had a limitation that it didn't learn the temporal dynamics.

3D pose information [7,11] are used for human activity recognition. These pose-based methods are successful for simple activities (e.g., walk, run, sit down). These methods fail when other objects are involved in the activity (e.g., activities in the kitchen). 3D Convolutional networks are used to learn Spatiotemporal features[10]. These features are used to recognize the human activity. This method succeeds with high accuracy, but doesn't able to capitalize the salient parts in videos.

Attention mechanisms based activity recognition, are able to focus on the salient parts in videos for target activity. The spatiotemporal attention mechanism (STA) is proposed in the works of [9,12]. In this spatiotemporal attention mechanism, the salient parts are learnt in the spatial attention part and the important frames are identified in the temporal attention part. In the spatial attention part, the convolutional networks are used to learn the salient parts in the feature map while recurrent networks are used for temporal attention.

On Kinetics dataset, the Inflated 3D convolution (I3D) is used for activity recognition[3, 6]. I3D is used to extract the relevant features from the videos. The spatiotemporal convolution mechanism with object detection techniques, performs well even for complex activities. Wang et al. [2] proposed non-local blocks in I3D convolution mechanism. These non-local blocks are able to capture long dependencies. The non-local operation at non-local blocks computes attention as a weighted sum. These weighted sum measured at all positions. Due to the dependency on the appearance of activity, this non-local mechanism fails to recognize similar activities. To overcome this problem Das et al. [1] proposed separable spatiotemporal attention mechanism (separable STA).

In the separable STA, there is mechanism to separate spatial and temporal attention. Hence in the separable STA, these two mechanisms are separated. The complex activities, in which the human object interactions are involved, requires spatial attention. Simple activities such as walk, sit down, etc. requires only temporal attention.

In the separable STA, convolution feature maps are obtained from I3D convolution. the spatial and temporal attention weights are applied separately on these convolution feature maps,. These weights are computed using RNN attention model. Input to this attention model is 3D skeleton pose.

Research Challenges

Activities of daily living poses a great challenge in activity recognition. In activities of daily living, human interacts with different types of objects. In ADL, the presence of many objects makes activity recognition difficult. The same class of objects may have different shapes. A cup may have different shapes and texture. The detection of objects of the same class but with a different shape and correspondence of these objects to the ground truth class is difficult. So the generalization problem is a key challenge.

The occlusion of objects is another challenge. Some time the whole object may get occluded. For example, while taking tea from a cup, the cup may be so small that whole or most of the part, get occluded. In these situations, the recognition becomes difficult.

Recognizing the fine-grained activities poses some key challenges. A human may be interacting with more than one objects. A person is taking coffee from a cup and a book is on the table in front of the person. So to recognize this that actually the person is taking coffee from the cup. It may be the situation that the cup is occluded partially and the book is fully visible. So to recognize that actually the person is taking coffee from the cup, not reading the book, is a key challenge.

Most of the activity recognition approaches depends on the large availability of datasets. These datasets provide labeled data. One of the other key challenges is how to reduce the dependence on these datasets.

Motivation

The activity recognition has many applications ranging from sports to smart homes. Activity recognition can be applied for the surveillance, security, life logging and even for assistance. The ADL recognition has the application in life logging and assistance. Elderly people living alone in the apartments need these types of assistance. With accurate techniques, we can help the elderly people for their better life.

The availability of the RGB, 3D and skeleton data related to ADL, motivates us to work on the activity recognition of daily living. Recent advancement in deep learning techniques leads to the better accuracy. The unprecedented accuracy achieved by deep learning techniques also motivate us to use these techniques.

References

[1] Srijan Das, Rui Dai, Michal Koperski, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, and Gianpiero Francesca. Toyota smarthome: Real-world activities of daily living. In ICCV, 2019

[2] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.

[3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733. IEEE, 2017.

[4] Guilhem Cheron, Ivan Laptev, and Cordelia Schmid. P-cnn: Pose-based cnn features for action recognition. In ICCV, 2015.

[5] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[6] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. Conference on Computer Vision and Pattern Recognition(CVPR), 2018.

[7] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+d: A large scale dataset for 3d human activity analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[8] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568– 576, 2014.

[9] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI Conference on Artificial Intelligence, pages 4263– 4270, 2017.

[10]Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, pages 4489–4497, Washington, DC, USA, 2015. IEEE Computer Society

[11]Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[12]Fabien Baradel, Christian Wolf, and Julien Mille. Human action recognition: Pose-based attention draws focus to hands. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 604–613, Oct 2017.

Useful Resources:

How To
Machine Learning
PyTorch
Python Programming
Computer Vision
Next Post: How to deepcopy a model in PyTorch?

Previous Post: Unsupervised Machine Learning, K-Means, PCA

Binary Study

Search This Blog