Data Science & Decision Modeling May 8, 2025 Published project
Titanic Survival Prediction

Structured-data decision modeling workflow

This project builds a supervised classification workflow for the Titanic survival task. It focuses on exploratory analysis, careful feature engineering, preprocessing decisions, class-imbalance review, and model comparison for interpretable predictive modeling.

PythonPandasScikit-learnSMOTEGridSearchCVLogistic RegressionSVC

Challenge

  • Structured data often contains missing values, skewed variables, categorical fields, and correlated features.
  • Predictive modeling needs careful preprocessing before model comparison can be meaningful.
  • Decision tradeoffs should be evaluated with class-level precision, recall, and F1-score rather than accuracy alone.

System architecture

Passenger datademographic and ticket fields
Feature designimpute, encode, transform
Model comparisonbaseline and tuned
Evaluationtradeoff interpretation

Data and inputs

  • 891 passenger records with the Survived target.
  • Stratified 80/20 train-test split with 712 training rows and 179 test rows.
  • Final feature matrix includes 21 engineered or encoded features.

Technical approach

  • Analyze missing values, outliers, survival distribution, and survival-related patterns.
  • Engineer features such as title, deck, fare transformation, family size, travel-alone flag, age bins, and fare bins.
  • Apply imputation, encoding, scaling, stratified splitting, and a SMOTE experiment after the split to avoid leakage.
  • Compare Logistic Regression, SVC, Gaussian Naive Bayes, SMOTE variants, and GridSearchCV-tuned models.

Evaluation and results

Key indicators

891 passenger records

Key indicators

21 final features

Key indicators

Tuned test accuracy 0.83799

  • Tuned Logistic Regression and tuned SVC both reached 0.83799 test accuracy.
  • Tuned Logistic Regression showed slightly stronger survivor recall and F1-score.
  • Tuned SVC showed stronger survivor precision, making the final choice dependent on the preferred error tradeoff.

Implementation and code

Implementation focus

The implementation connects data preparation, modeling, evaluation, and interpretation in a structured workflow that makes the technical decisions clear.

Source code

The code is available for exploring the implementation details and extending the experiment when needed.

Open source code

Scope and responsible use

The project is a focused modeling and evaluation study. Broader use should be supported by validation on additional data, robustness checks, monitoring, and domain-specific evaluation.

Future development

  • Add calibrated probability outputs and explainability with SHAP.
  • Compare ensemble stacking and additional cross-validation diagnostics.
  • Package the workflow into a more reusable training and evaluation pipeline.

Technical contribution

The project demonstrates disciplined structured-data modeling: feature design, preprocessing, imbalance review, model tuning, and interpretation of precision-recall tradeoffs.