Language Systems & Model Evaluation Dec 8, 2025 Published project
SMS Spam Transformer Classification

Comparative transformer strategies for text classification

This project compares several transformer-based strategies for SMS spam detection: zero-shot classification, fine-tuned BERT, few-shot BERT, and Flan-T5 prompt-based classification.

PythonTransformersBERTFlan-T5Zero-shotFew-shot

Challenge

  • Text classification performance depends on task framing and label wording.
  • Zero-shot behavior can fail when labels carry domain-specific meaning.
  • A fair comparison needs consistent splits and method-level interpretation.

System architecture

SMS datasetHam and spam labels
Task framingZero-shot / few-shot
Transformer modelBERT and Flan-T5
ComparisonAccuracy and confusion matrices

Data and inputs

  • SMS Spam Collection dataset with 5,572 messages.
  • Binary labels: ham and spam.
  • 3,900 training messages, 836 validation messages, and 836 test messages.

Technical approach

  • Compare zero-shot NLI-style classification, supervised fine-tuning, few-shot fine-tuning, and generative prompting.
  • Evaluate accuracy and confusion matrices for each approach.
  • Interpret failures caused by label ambiguity and output parsing.

Evaluation and results

Key indicators

5,572 SMS messages

Key indicators

Fine-tuned BERT accuracy 0.9952

Key indicators

Few-shot BERT accuracy 0.9294

  • Fine-tuned BERT achieved 0.9952 accuracy and the strongest class-level scores.
  • Few-shot BERT achieved 0.9294 accuracy with only 20 labeled training examples.
  • Zero-shot and generative prompting results showed sensitivity to label wording and prompt/output handling.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project focuses on language-data modeling and evaluation. Broader use would require domain-specific validation, edge-case assessment, monitoring, and testing on fresh data.

Future development

  • Add calibration and cost-sensitive evaluation for spam filtering.
  • Test additional label wording and prompt templates.
  • Compare lightweight deployable models for latency-sensitive settings.

Technical contribution

The project demonstrates that modern language models need careful task framing, evaluation, and method comparison rather than assuming one strategy always works best.