Video understanding through spatial and temporal modeling
This project studies action recognition from video by extracting frame-level spatial features with a CNN and comparing RNN/LSTM sequence models for temporal classification.
Challenge
- Video classification requires modeling how visual features change over time.
- A frame classifier alone cannot represent temporal movement patterns.
- Comparisons should explain why sequence models differ, not only report accuracy.
System architecture
Data and inputs
- UCF50 Action Recognition dataset from Kaggle.
- 6,681 videos across 50 action categories.
- 15 evenly spaced frames sampled per video, resized to 64×64 and normalized to [0, 1].
Technical approach
- Use a three-block CNN to extract frame-level features.
- Feed feature sequences into RNN and LSTM classifiers.
- Compare validation accuracy and macro F1 to understand temporal modeling behavior.
Evaluation and results
6,681 videos
50 action classes
CNN+LSTM validation F1 0.4548
- CNN+LSTM achieved 0.4734 validation accuracy and 0.4548 validation macro F1.
- CNN+RNN achieved 0.4263 validation accuracy and 0.4071 validation macro F1.
- The project emphasizes the comparison under constrained resolution and training budget.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.
Future development
- Use higher-resolution frames and longer training.
- Compare 3D CNNs or transformer-based video models.
- Add per-class confusion analysis for visually similar actions.
Technical contribution
The project demonstrates the difference between image classification and video understanding by combining spatial feature extraction with temporal sequence modeling.