Accuracy–efficiency tradeoffs in visual classification
This project compares a fine-tuned Vision Transformer with a lightweight CNN baseline on an MNIST subset. It focuses on the practical tradeoff between transfer-learning accuracy and training efficiency.
Challenge
- Higher accuracy can come with substantial compute cost.
- Simple baselines are valuable when speed and resource constraints matter.
- Model selection should consider accuracy, latency, training cost, and the intended use context.
System architecture
Data and inputs
- MNIST subset with 4,000 training images and 1,000 test images.
- Vision Transformer branch resizes grayscale digits to 224×224 RGB.
- CNN branch keeps the original 28×28 grayscale format.
Technical approach
- Fine-tune a pre-trained Vision Transformer for digit classification.
- Train a simple CNN baseline from scratch.
- Compare confusion matrices, accuracy, and training time.
Evaluation and results
4,000 training images
ViT accuracy 98.10%
CNN trained about 117× faster
- The Vision Transformer reached 98.10% accuracy.
- The CNN reached 91.50% accuracy but trained in 3.76 seconds compared with 441.03 seconds for ViT.
- The CNN trained about 117× faster, making the tradeoff clear.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.
Future development
- Evaluate more architectures and larger sample sizes.
- Add latency and memory-use comparisons.
- Test robustness on shifted or noisy digit inputs.
Technical contribution
The project demonstrates disciplined model comparison by weighing accuracy against speed, simplicity, and practical use constraints.