In this research, we propose a novel articulated pose estimation framework for the simultaneous detection of multiple humans (as a whole and their constituent body parts) in crowded scenes, and for a subsequent tracking of those humans and their body parts. As an exemplary intended application, the framework is tested in the identification of high emotions (represented by abnormal body posture fluctuations) in the crowds of sport event spectators. The detection component of the framework is based on a single discriminative classifier, linear support vector machine (SVM) trained on histogram of oriented gradients (HOG) features, that searches for dependent limbs thereby alleviating the independent inference limitation of other state-of-the-art models. The proposed detection framework is a hierarchical model that detects humans at both macro and micro levels by fusing global and local detectors. The methodology is validated using a publicly available crowd dataset captured indoors in a sports stadium. Detection results are assessed using the percentage of correctly localized parts (PCP) evaluation metric and compared against competing baselines. Our experimental results report mean detection accuracy of 85% for the global upper body, 95% for the head, 82% for the torso, 71% and 60% for upper and lower arms, respectively. A systematic analysis of results also verifies that the proposed model outperforms the state-ofthe- art models in terms of detection rate, accuracy and computational cost. For the subsequent tracking of people detected by the aforementioned hierarchical framework, the chosen state-of-the-art models with embedded particle filters have been adapted to tracking multiple humans and their body parts in crowd scenarios. Specifically, incremental visual tracking (IVT), joint probabilistic data association filter (JPDAF) and their integration (IVT+JPDAF) have been included. These approaches are validated using same crowd dataset, where the spectators undergo large changes in pose, rotations, and occlusions. Performances are assessed by PCP evaluation metric that compares the obtained tracking results with the corresponding manually labelled ground-truth locations. II For tracking multiple people in crowd video sequences of the benchmark dataset, the extensive experiments and systematic analysis of results verify that the JPDAF algorithm alone significantly outperforms IVT and IVT+JPDAF approaches in terms of tracking-todetection rate and accuracy. Specifically, the tracking-to-detection rate attained by JPDAF approach is 100% while IVT and IVT+JPDAF recorded 77% and 60%, respectively. In addition, the average accuracy of tracking 6 body parts over all frames is 80% by JPDAF, 62% by JPDAF+IVT and 50% by IVT. Emotions of multiple people constituting a crowd in a given area under surveillance are estimated by tracking individuals and the collective analysis of their (social) behaviors. Escalating anti-social behavior in the crowd is often derived from changes in emotions that typically lead to aggression and violence. Therefore, emotion recognition is of paramount importance to assess the security threats and assist in providing and managing an instantaneous inference and response. Accordingly, a possible application of the developed framework of detection-and-tracking has been investigated, wherein a novel machine learning framework is proposed for classification of crowd emotions. In the proposed model, emotions are inferred only from upper-body gestures and postures (i.e. facial expressions are excluded). A neural network (NN) methodology is used to classify the upper-body patterns into normal or abnormal emotions. The NN classifier converts the sequential motions of body parts of tracked individuals into their emotion classifications so that the distribution of abnormal emotions within the crowd can be assessed. Using same public crowd dataset, crowd emotion detection results are assessed using the cross-validation evaluation metric that compares the classifications predicted by the NN classifier with the corresponding manually labelled ground-truth decisions over the tested video fragments. Overall, classification accuracy of predictions is almost 60% and low performances of NN model can be attributed to the small number of abnormal cases experienced in training stage.
| Date of Award | May 2018 |
|---|
| Original language | American English |
|---|
| Supervisor | Andrzej Sluzek (Supervisor) |
|---|
- Crowd scenes
- detection
- tracking
- emotion inference
- hierarchical model
- joint model
- particle filters
- neural network
- HOG
- SVM
- IVT
- JPDAF
- PCP.
Body Parts Detection and Tracking from
Stationary Video Feeds for Crowd
Emotion Identification and Visualization
Alyammahi, S. (Author). May 2018
Student thesis: Doctoral Thesis