Security text classification and model comparison
This project builds and evaluates a phishing-email detection workflow using labeled email text. It compares classical machine-learning models with attention to phishing recall, false positives, feature interpretation, and practical monitoring needs.
Challenge
- Security classification needs more than headline accuracy; phishing recall and false positives matter.
- Text features can capture useful cues but may also learn dataset-specific artifacts.
- A useful model should balance performance, simplicity, and interpretability.
System architecture
Data and inputs
- 18,650 raw rows and 17,538 rows after cleaning.
- Safe Email and Phishing Email classes with an 80/20 train-test split.
- 5,000 TF-IDF features used to represent email text.
Technical approach
- Clean missing and duplicate email records before modeling.
- Represent email text with TF-IDF features.
- Compare Random Forest, Support Vector Machine, and Logistic Regression models.
- Review confusion matrices, phishing recall, and top phishing-related terms.
Evaluation and results
17,538 cleaned email rows
5,000 TF-IDF features
Selected linear SVM accuracy 0.9763
- All evaluated models reached strong performance above 97% accuracy.
- Linear SVM achieved 0.9763 accuracy and a strong balance of phishing recall and operational simplicity.
- RBF SVM reached 0.9772 accuracy, but the linear SVM remained simpler to interpret.
- The selected model detected about 98% of phishing emails in the test set.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project demonstrates detection modeling on available data. Operational security use would require continuous data refresh, monitoring, adversarial testing, and privacy-aware logging.
Future development
- Evaluate on newer and more diverse phishing datasets.
- Add calibration, threshold tuning, and cost-sensitive error analysis.
- Test transformer-based encoders against the classical TF-IDF baseline.
Technical contribution
The project demonstrates careful security-oriented model evaluation: comparing baselines, prioritizing recall, analyzing false positives, and interpreting text features with domain awareness.