MNIST ViT vs CNN Comparison

Visual Intelligence & Deep Learning Dec 8, 2025 Published project

Accuracy–efficiency tradeoffs in visual classification

This project compares a fine-tuned Vision Transformer with a lightweight CNN baseline on an MNIST subset. It focuses on the practical tradeoff between transfer-learning accuracy and training efficiency.

View source code Back to Projects

PythonPyTorchTensorFlowViTCNNMNIST

Share project

Challenge

Higher accuracy can come with substantial compute cost.
Simple baselines are valuable when speed and resource constraints matter.
Model selection should consider accuracy, latency, training cost, and the intended use context.

System architecture

MNIST subsetDigits 0–9

ViT branchTransfer learning

CNN baselineLightweight training

Tradeoff reviewAccuracy and time

Data and inputs

MNIST subset with 4,000 training images and 1,000 test images.
Vision Transformer branch resizes grayscale digits to 224×224 RGB.
CNN branch keeps the original 28×28 grayscale format.

Technical approach

Fine-tune a pre-trained Vision Transformer for digit classification.
Train a simple CNN baseline from scratch.
Compare confusion matrices, accuracy, and training time.

Evaluation and results

Key indicators

4,000 training images

Key indicators

ViT accuracy 98.10%

Key indicators

CNN trained about 117× faster

The Vision Transformer reached 98.10% accuracy.
The CNN reached 91.50% accuracy but trained in 3.76 seconds compared with 441.03 seconds for ViT.
The CNN trained about 117× faster, making the tradeoff clear.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.

Future development

Evaluate more architectures and larger sample sizes.
Add latency and memory-use comparisons.
Test robustness on shifted or noisy digit inputs.

Technical contribution

The project demonstrates disciplined model comparison by weighing accuracy against speed, simplicity, and practical use constraints.