Security Analytics & Risk Detection Dec 5, 2025 Published project
Malicious URL Detection with ML

Lexical security classification and model evaluation

This project builds a lightweight machine-learning workflow for classifying URLs as benign or malicious without visiting the target website. It extracts structural and lexical URL patterns, compares multiple models, and evaluates detection quality with confusion matrices, ROC, and precision-recall analysis.

PythonScikit-learnXGBoostRandom ForestLogistic RegressionFeature EngineeringROC/PR

Challenge

  • Malicious URLs can support phishing, malware delivery, defacement, spam, and fraud.
  • A safe detection workflow should not require opening suspicious websites.
  • Security use cases need recall, precision, and false-negative analysis rather than headline accuracy alone.

System architecture

URL corpusbenign and malicious labels
Feature extractionlexical URL signals
Model comparisonLR · RF · XGBoost
EvaluationROC, PR, and errors

Data and inputs

  • 651,191 URLs with original labels for benign, phishing, malware, and defacement.
  • The task is converted to binary classification: benign versus malicious.
  • The split uses an 80/20 stratified train-test setup with lexical URL features.

Technical approach

  • Extract features such as URL length, digit count, special-character count, dot count, IP-address flag, HTTPS flag, and security-related keywords.
  • Train and compare Logistic Regression, Random Forest, and XGBoost.
  • Review accuracy, precision, recall, F1-score, ROC AUC, average precision, confusion matrix, and feature importance.
  • Frame the model as a decision-support layer, not a sole blocking authority.

Evaluation and results

Key indicators

651,191 URLs

Key indicators

Best model: Random Forest

Key indicators

Accuracy 0.876 · ROC AUC 0.934

  • Random Forest achieved the strongest reported test performance with 0.876 accuracy, 0.840 precision, 0.788 recall, and 0.813 F1-score.
  • XGBoost performed competitively with 0.866 accuracy and 0.796 F1-score.
  • Random Forest reached ROC AUC 0.934 and Average Precision 0.901.
  • Feature importance highlighted structural signals such as special characters, dots, URL length, and digit count.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project demonstrates detection modeling on available data. Operational security use would require continuous data refresh, monitoring, adversarial testing, and privacy-aware logging.

Future development

  • Add DNS, content-based, and network-level features.
  • Evaluate character-level deep learning models and adversarial manipulation.
  • Build a lightweight warning interface for real-time URL screening.

Technical contribution

The project demonstrates security-oriented model evaluation: designing safe features, comparing baselines, interpreting errors, and treating model outputs as support signals rather than automatic enforcement.