This is an initiation project to introduce RAMP and get you to know how it works.
The goal is to develop prediction models able to identify people who survived from the sinking of the Titanic, based on gender, age, and ticketing information.
The data we will manipulate is from the Titanic kaggle challenge.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
We can use some utilities from RAMP to load the public train and test sets.
import problem
X_train, y_train = problem.get_train_data()
X_train
contains information about each person which has a ticket to go on the boat while y_train
is the information if a passenger survived or did not.
X_train.head()
y_train[:5]
We can first look at some descriptive statistics regarding our data.
X_train.describe()
Then, this is good to have an overlook regarding the type of data, we are dealing with and if there is any missing value.
X_train.info()
The original training data frame has 891 rows. In the starting kit, we give you a subset of 356 rows. Some passengers have missing information: in particular Age
and Cabin
info can be missing. The meaning of the columns is explained on the challenge website.
The goal is to predict whether a passenger has survived from other known attributes. Let us group the data according to the Survived
columns:
_ = pd.Series(y_train).value_counts().rename({0: "deceased", 1: "survived"}).plot(kind="bar")
About two thirds of the passengers perished in the event. A dummy classifier that systematically returns "0" would have an accuracy of 62%, higher than that of a random model.
A scatterplot matrix allows us to visualize:
df = X_train.copy()
df["Survived"] = y_train
features = ['Fare', 'Age', 'Survived']
_ = sns.pairplot(
df[features], hue="Survived"
)
g = sns.catplot(
x="Pclass", y="Survived", hue="Sex", data=df,
height=6, kind="bar", palette="muted"
)
The Fare
variable has a very heavy tail. We can log-transform it.
df["LogFare"] = np.log(df["Fare"] + 10)
features = ['LogFare', 'Age', 'Survived']
_ = sns.pairplot(
df[features], hue="Survived"
)
Another way of visualizing relationships between variables is to plot their bivariate distributions.
_ = sns.jointplot(
df.Age[df.Survived == 1],
df.LogFare[df.Survived == 1],
kind="kde", height=7, space=0, color="b"
)
_ = sns.jointplot(
df.Age[df.Survived == 0],
df.LogFare[df.Survived == 0],
kind="kde", height=7, space=0, color="y"
)
A basic prediction workflow, using scikit-learn, will be presented below.
First, we will perform some simple preprocessing of our data:
Sex
, Pclass
, Embarked
Age
, SibSp
, Parch
, Fare
, fill in missing values with a default value (-1
)This can be done succintly with make_column_transformer
which performs specific transformations on specific features.
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
categorical_cols = ['Sex', 'Pclass', 'Embarked']
categorical_pipeline = make_pipeline(OneHotEncoder(handle_unknown='ignore'))
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']
numerical_pipeline = make_pipeline(
StandardScaler(), SimpleImputer(strategy='constant', fill_value=-1)
)
preprocessor = make_column_transformer(
(categorical_pipeline, categorical_cols),
(numerical_pipeline, numerical_cols),
)
The preprocessor
object created with make_column_transformer
can be used in a scikit-learn pipeline
. A pipeline
assembles several steps together and can be used to cross validate an entire workflow. Generally, transformation steps are combined with a final estimator.
We will create a pipeline consisting of the preprocessor
created above and a final estimator, LogisticRegression
.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('transformer', preprocessor),
('classifier', LogisticRegression()),
])
We can cross-validate our pipeline
using cross_val_score
. Below we will have specified cv=8
meaning KFold cross-valdiation splitting will be used, with 8 folds. The Area Under the Receiver Operating Characteristic Curve (ROC AUC) score is calculated for each split. The output score
will be an array of 8 scores from each KFold. The score mean and standard deviation of the 8 scores is printed at the end.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=8, scoring='roc_auc')
print("mean: %e (+/- %e)" % (scores.mean(), scores.std()))
Once you have created a model with cross-valdiation scores you are happy with, you can test how well your model performs on the independent test data.
First we will read in our test data:
X_test, y_test = problem.get_test_data()
X_test.head()
Next we need to fit our pipeline on our training data:
clf = pipeline.fit(X_train, y_train)
Now we can predict on our test data:
y_pred = pipeline.predict(X_test)
Finally, we can calculate how well our model performed on the test data:
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test, y_pred)
score
To submit your code, you can refer to the online documentation.