Visual Intelligence & Deep Learning Nov 6, 2025 Published project
UCF50 CNN-LSTM Action Recognition

Video understanding through spatial and temporal modeling

This project studies action recognition from video by extracting frame-level spatial features with a CNN and comparing RNN/LSTM sequence models for temporal classification.

PythonTensorFlowKerasCNNRNNLSTM

Challenge

  • Video classification requires modeling how visual features change over time.
  • A frame classifier alone cannot represent temporal movement patterns.
  • Comparisons should explain why sequence models differ, not only report accuracy.

System architecture

Video clipsSampled frames
CNN encoderSpatial features
Sequence modelRNN or LSTM
Action prediction50-class evaluation

Data and inputs

  • UCF50 Action Recognition dataset from Kaggle.
  • 6,681 videos across 50 action categories.
  • 15 evenly spaced frames sampled per video, resized to 64×64 and normalized to [0, 1].

Technical approach

  • Use a three-block CNN to extract frame-level features.
  • Feed feature sequences into RNN and LSTM classifiers.
  • Compare validation accuracy and macro F1 to understand temporal modeling behavior.

Evaluation and results

Key indicators

6,681 videos

Key indicators

50 action classes

Key indicators

CNN+LSTM validation F1 0.4548

  • CNN+LSTM achieved 0.4734 validation accuracy and 0.4548 validation macro F1.
  • CNN+RNN achieved 0.4263 validation accuracy and 0.4071 validation macro F1.
  • The project emphasizes the comparison under constrained resolution and training budget.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.

Future development

  • Use higher-resolution frames and longer training.
  • Compare 3D CNNs or transformer-based video models.
  • Add per-class confusion analysis for visually similar actions.

Technical contribution

The project demonstrates the difference between image classification and video understanding by combining spatial feature extraction with temporal sequence modeling.