Health-data classification workflow
This project builds a supervised classification workflow for diagnostic tabular data. It focuses on exploratory analysis, feature scaling, probabilistic modeling, and evaluation that distinguishes overall accuracy from class-level recall and precision.
Challenge
- Clinical-style tabular datasets require careful handling because class-level recall can be more important than a single accuracy value.
- Numerical diagnostic features need exploration, scaling, and interpretation before model results can be trusted.
- A lightweight baseline is useful for understanding whether meaningful class separation exists before introducing more complex models.
System architecture
Data and inputs
- Breast Cancer Wisconsin dataset from scikit-learn.
- 569 samples with 30 numerical diagnostic features.
- Binary target with benign and malignant classes.
Technical approach
- Review feature distributions, correlations, and class patterns.
- Scale numerical features before model training.
- Train Gaussian Naive Bayes as a fast probabilistic baseline.
- Evaluate accuracy, precision, recall, F1-score, and confusion matrix behavior.
Evaluation and results
569 samples
30 diagnostic features
96% test accuracy
0.99 malignant recall
- The model achieved 96% test accuracy.
- The malignant class reached 0.99 recall in the reported evaluation.
- Train and test accuracy were close, suggesting no severe overfitting in this baseline workflow.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
This project demonstrates modeling and evaluation on health-related data and is not intended for clinical decision-making. Any clinical use would require external validation, expert review, calibration, and regulatory oversight.
Future development
- Compare additional models and calibrated probability outputs.
- Add explainability views for influential diagnostic features.
- Evaluate robustness across external datasets and different train/test splits.
Technical contribution
The project demonstrates disciplined model evaluation for sensitive tabular classification: exploring the data, building an interpretable baseline, and reading class-level metrics instead of relying on accuracy alone.