Titanic Survival Prediction

Data Science & Decision Modeling May 8, 2025 Published project

Structured-data decision modeling workflow

This project builds a supervised classification workflow for the Titanic survival task. It focuses on exploratory analysis, careful feature engineering, preprocessing decisions, class-imbalance review, and model comparison for interpretable predictive modeling.

View source code Back to Projects

PythonPandasScikit-learnSMOTEGridSearchCVLogistic RegressionSVC

Share project

Challenge

Structured data often contains missing values, skewed variables, categorical fields, and correlated features.
Predictive modeling needs careful preprocessing before model comparison can be meaningful.
Decision tradeoffs should be evaluated with class-level precision, recall, and F1-score rather than accuracy alone.

System architecture

Passenger datademographic and ticket fields

Feature designimpute, encode, transform

Model comparisonbaseline and tuned

Evaluationtradeoff interpretation

Data and inputs

891 passenger records with the Survived target.
Stratified 80/20 train-test split with 712 training rows and 179 test rows.
Final feature matrix includes 21 engineered or encoded features.

Technical approach

Analyze missing values, outliers, survival distribution, and survival-related patterns.
Engineer features such as title, deck, fare transformation, family size, travel-alone flag, age bins, and fare bins.
Apply imputation, encoding, scaling, stratified splitting, and a SMOTE experiment after the split to avoid leakage.
Compare Logistic Regression, SVC, Gaussian Naive Bayes, SMOTE variants, and GridSearchCV-tuned models.

Evaluation and results

Key indicators

891 passenger records

Key indicators

21 final features

Key indicators

Tuned test accuracy 0.83799

Tuned Logistic Regression and tuned SVC both reached 0.83799 test accuracy.
Tuned Logistic Regression showed slightly stronger survivor recall and F1-score.
Tuned SVC showed stronger survivor precision, making the final choice dependent on the preferred error tradeoff.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.

Future development

Add calibrated probability outputs and explainability with SHAP.
Compare ensemble stacking and additional cross-validation diagnostics.
Package the workflow into a more reusable training and evaluation pipeline.

Technical contribution

The project demonstrates disciplined structured-data modeling: feature design, preprocessing, imbalance review, model tuning, and interpretation of precision-recall tradeoffs.