Detecting Action performed in a video using resnet34 for spatial and temporal stream
Action recognition task involves the identification of different actions from videos where the action may or may not be performed throughout the entire duration of the video.
Classifying actions (jumping, picking, playing etc) in still images has been solved with a high accuracy using Deep Convolutional Networks. However, for action recognition in videos, there exists only limited work with poor accuracy.
Action recognition from videos requires capturing context from the whole video rather than just capturing information from each frame as compared to an image.
We aim to extend the existing model of Deep Convolutional Neural Network used for classifying action in still images. Relevant work already exists for the same which uses stacked frames for input to CNN, but the results are poor. So, to improve upon this architecture, we implement a new architecture for action recognition in videos which is based on the human visual system. This architecture is based on the human visual cortex system which uses two-streams to detect and recognise motion:
We can decompose a video into two components i.e. spatial and temporal. Each of the two streams are implemented using deep CNN:
We are using the famous dataset called UCF101 for our
project. UCF101 is a dataset of human actions.We extracted 1. UCF101
have 101 classes which are divided into five types:
- Human-Object Interaction
- Body-Motion Only
- Human-Human Interaction
- Playing Musical Instruments
- Sports
The dataset consists of realistic user-uploaded videos containing
cluttered background and camera motion. It gives the largest diversity in terms of actions and with the presence of large variations in camera motion, cluttered background, illumination conditions. We created our own UCF10 Dataset, which basically include 2K total videos and 200 videos of each class.
Train and test data list is present in Data folder. which were extracted from UCF101 for our project.
The motivation is to capture both, the spatial feature of the video as well as Temporal ones. For example, To classify either the action being performed by a person in the video is archery or playing basketball, Our spatial network captures the still-frame information about the action being performed. Basically it’s doing classification task on frame of video. Temporal network tries to distinguish the action using motion in the video. For this motion we are using segmentation based optical flow in both, horizontal and vertical directions. Both the models uses ResNet34 as the underlying network.