Structured-data decision modeling workflow
This project builds a supervised classification workflow for the Titanic survival task. It focuses on exploratory analysis, careful feature engineering, preprocessing decisions, class-imbalance review, and model comparison for interpretable predictive modeling.
Challenge
- Structured data often contains missing values, skewed variables, categorical fields, and correlated features.
- Predictive modeling needs careful preprocessing before model comparison can be meaningful.
- Decision tradeoffs should be evaluated with class-level precision, recall, and F1-score rather than accuracy alone.
System architecture
Data and inputs
- 891 passenger records with the
Survivedtarget. - Stratified 80/20 train-test split with 712 training rows and 179 test rows.
- Final feature matrix includes 21 engineered or encoded features.
Technical approach
- Analyze missing values, outliers, survival distribution, and survival-related patterns.
- Engineer features such as title, deck, fare transformation, family size, travel-alone flag, age bins, and fare bins.
- Apply imputation, encoding, scaling, stratified splitting, and a SMOTE experiment after the split to avoid leakage.
- Compare Logistic Regression, SVC, Gaussian Naive Bayes, SMOTE variants, and GridSearchCV-tuned models.
Evaluation and results
891 passenger records
21 final features
Tuned test accuracy 0.83799
- Tuned Logistic Regression and tuned SVC both reached 0.83799 test accuracy.
- Tuned Logistic Regression showed slightly stronger survivor recall and F1-score.
- Tuned SVC showed stronger survivor precision, making the final choice dependent on the preferred error tradeoff.
Implementation and code
Implementation focus
The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.
Source code
The code is available for exploring the implementation details and extending the experiment when needed.
Scope and responsible use
The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.
Future development
- Add calibrated probability outputs and explainability with SHAP.
- Compare ensemble stacking and additional cross-validation diagnostics.
- Package the workflow into a more reusable training and evaluation pipeline.
Technical contribution
The project demonstrates disciplined structured-data modeling: feature design, preprocessing, imbalance review, model tuning, and interpretation of precision-recall tradeoffs.