Comparative transformer strategies for text classification
This project compares several transformer-based strategies for SMS spam detection: zero-shot classification, fine-tuned BERT, few-shot BERT, and Flan-T5 prompt-based classification.
Challenge
- Text classification performance depends on task framing and label wording.
- Zero-shot behavior can fail when labels carry domain-specific meaning.
- A fair comparison needs consistent splits and method-level interpretation.
System architecture
Data and inputs
- SMS Spam Collection dataset with 5,572 messages.
- Binary labels: ham and spam.
- 3,900 training messages, 836 validation messages, and 836 test messages.
Technical approach
- Compare zero-shot NLI-style classification, supervised fine-tuning, few-shot fine-tuning, and generative prompting.
- Evaluate accuracy and confusion matrices for each approach.
- Interpret failures caused by label ambiguity and output parsing.
Evaluation and results
5,572 SMS messages
Fine-tuned BERT accuracy 0.9952
Few-shot BERT accuracy 0.9294
- Fine-tuned BERT achieved 0.9952 accuracy and the strongest class-level scores.
- Few-shot BERT achieved 0.9294 accuracy with only 20 labeled training examples.
- Zero-shot and generative prompting results showed sensitivity to label wording and prompt/output handling.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project focuses on language-data modeling and evaluation. Broader use would require domain-specific validation, edge-case assessment, monitoring, and testing on fresh data.
Future development
- Add calibration and cost-sensitive evaluation for spam filtering.
- Test additional label wording and prompt templates.
- Compare lightweight deployable models for latency-sensitive settings.
Technical contribution
The project demonstrates that modern language models need careful task framing, evaluation, and method comparison rather than assuming one strategy always works best.