Benoit Playe (Institut Curie/Mines ParisTech), Chloé-Agathe Azencott (Institut Curie/Mines ParisTech), Alex Gramfort (LTCI/Télécom ParisTech), Balázs Kégl (LAL/CNRS)
This is an initiation project to introduce RAMP and get you to know how it works.
The goal is to develop prediction models able to identify people who survived from the sinking of the Titanic, based on gender, age, and ticketing information.
The data we will manipulate is from the Titanic kaggle challenge.
%matplotlib inline
import os
import glob
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pandas as pd
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename)
y_df = data['Survived']
X_df = data.drop(['Survived', 'PassengerId'], axis=1)
X_df.head(5)
data.describe()
data.count()
The original training data frame has 891 rows. In the starting kit, we give you a subset of 445 rows. Some passengers have missing information: in particular Age
and Cabin
info can be missing. The meaning of the columns is explained on the challenge website:
The goal is to predict whether a passenger has survived from other known attributes. Let us group the data according to the Survived
columns:
data.groupby('Survived').count()
About two thirds of the passengers perished in the event. A dummy classifier that systematically returns "0" would have an accuracy of 62%, higher than that of a random model.
A scatterplot matrix allows us to visualize:
from pandas.plotting import scatter_matrix
scatter_matrix(data.get(['Fare', 'Pclass', 'Age']), alpha=0.2,
figsize=(8, 8), diagonal='kde');
The Fare
variable has a very heavy tail. We can log-transform it.
data_plot = data.get(['Age', 'Survived'])
data_plot = data.assign(LogFare=lambda x : np.log(x.Fare + 10.))
scatter_matrix(data_plot.get(['Age', 'LogFare']), alpha=0.2, figsize=(8, 8), diagonal='kde');
data_plot.plot(kind='scatter', x='Age', y='LogFare', c='Survived', s=50, cmap=plt.cm.Paired);
Another way of visualizing relationships between variables is to plot their bivariate distributions.
import seaborn as sns
sns.set()
sns.set_style("whitegrid")
sns.jointplot(data_plot.Age[data_plot.Survived == 1],
data_plot.LogFare[data_plot.Survived == 1],
kind="kde", size=7, space=0, color="b");
sns.jointplot(data_plot.Age[data_plot.Survived == 0],
data_plot.LogFare[data_plot.Survived == 0],
kind="kde", size=7, space=0, color="y");
For submitting at the RAMP site, you will have to write two classes, saved in two different files:
FeatureExtractor
, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features). Classifier
to predict survivalThe feature extractor implements a transform
member function. It is saved in the file submissions/starting_kit/feature_extractor.py
. It receives the pandas dataframe X_df
defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.
Note that the following code cells are not executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.
%%file submissions/starting_kit/feature_extractor.py
import pandas as pd
class FeatureExtractor():
def __init__(self):
pass
def fit(self, X_df, y):
pass
def transform(self, X_df):
X_df_new = pd.concat(
[X_df.get(['Fare', 'Age', 'SibSp', 'Parch']),
pd.get_dummies(X_df.Sex, prefix='Sex', drop_first=True),
pd.get_dummies(X_df.Pclass, prefix='Pclass', drop_first=True),
pd.get_dummies(
X_df.Embarked, prefix='Embarked', drop_first=True)],
axis=1)
X_df_new = X_df_new.fillna(-1)
XX = X_df_new.values
return XX
The classifier follows a classical scikit-learn classifier template. It should be saved in the file submissions/starting_kit/classifier.py
. In its simplest form it takes a scikit-learn pipeline, assigns it to self.clf
in __init__
, then calls its fit
and predict_proba
functions in the corresponding member funtions.
%%file submissions/starting_kit/classifier.py
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
class Classifier(BaseEstimator):
def __init__(self):
self.clf = Pipeline([
('imputer', Imputer(strategy='median')),
('classifier', LogisticRegression(C=1.))
])
def fit(self, X, y):
self.clf.fit(X, y)
def predict_proba(self, X):
return self.clf.predict_proba(X)
It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit
, not on the classes defined in the cells of this notebook.
First pip install ramp-workflow
or install it from the github repo. Make sure that the python files classifier.py
and feature_extractor.py
are in the submissions/starting_kit
folder, and the data train.csv
and test.csv
are in data
. Then run
ramp_test_submission
If it runs and print training and test errors on each fold, then you can submit the code.
!ramp_test_submission
Once you found a good feature extractor and classifier, you can submit them to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event titanic for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) feature_extractor.py
and classifier.py
from submissions/starting_kit
. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission
. The script prints mean cross-validation scores
----------------------------
train auc = 0.85 ± 0.005
train acc = 0.81 ± 0.006
train nll = 0.45 ± 0.007
valid auc = 0.87 ± 0.023
valid acc = 0.81 ± 0.02
valid nll = 0.44 ± 0.024
test auc = 0.83 ± 0.006
test acc = 0.76 ± 0.003
test nll = 0.5 ± 0.005
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is area under the roc curve ("auc"), so the line that is relevant in the output of ramp_test_submission
is valid auc = 0.87 ± 0.023
. When the score is good enough, you can submit it at the RAMP.
You can also keep several other submissions in your work directory submissions
. random_forest_20_5
uses the same feature extractor as starting_kit
but another classifier.
%%file submissions/random_forest_20_5/feature_extractor.py
import pandas as pd
class FeatureExtractor():
def __init__(self):
pass
def fit(self, X_df, y):
pass
def transform(self, X_df):
X_df_new = pd.concat(
[X_df.get(['Fare', 'Age', 'SibSp', 'Parch']),
pd.get_dummies(X_df.Sex, prefix='Sex', drop_first=True),
pd.get_dummies(X_df.Pclass, prefix='Pclass', drop_first=True),
pd.get_dummies(
X_df.Embarked, prefix='Embarked', drop_first=True)],
axis=1)
X_df_new = X_df_new.fillna(-1)
XX = X_df_new.values
return XX
%%file submissions/random_forest_20_5/classifier.py
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
class Classifier(BaseEstimator):
def __init__(self):
self.clf = Pipeline([
('imputer', Imputer(strategy='median')),
('classifier', LogisticRegression(C=1.0))
])
def fit(self, X, y):
self.clf.fit(X, y)
def predict_proba(self, X):
return self.clf.predict_proba(X)
You can train this by specifying the subfolder in the test script.
!ramp_test_submission --submission random_forest_20_5
When you are developing and debugging your submission, you may want to stay in the notebook and execute the workflow step by step. You can import problem.py
and call the ingredients directly, or even deconstruct the code from ramp-workflow.
import imp
problem = imp.load_source('', 'problem.py')
Get the training data.
X_train, y_train = problem.get_train_data()
Get the first cv fold, creating training and validation indices.
train_is, test_is = list(problem.get_cv(X_train, y_train))[0]
test_is
Train your starting kit.
fe, clf = problem.workflow.train_submission(
'submissions/starting_kit', X_train, y_train, train_is)
Get the full prediction (train and validation).
y_pred = problem.workflow.test_submission((fe, clf), X_train)
Print the training and validation scores.
score_function = problem.score_types[0]
score_function
is callable, wrapping scikit-learn's roc_auc_score
. It expects a 0/1 vector as ground truth (since out labels are 0 and 1, y_train
can be passed as is), and a 1D vector of predicted probabilities of class '1', which means we need the second column of y_pred
.
score_train = score_function(y_train[train_is], y_pred[:, 1][train_is])
print(score_train)
score_valid = score_function(y_train[test_is], y_pred[:, 1][test_is])
print(score_valid)
You can check that it is just a wrapper of roc_auc_score
.
from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_train[train_is], y_pred[:, 1][train_is]))
Get the independent test data.
X_test, y_test = problem.get_test_data()
Test the submission on it.
y_test_pred = problem.workflow.test_submission((fe, clf), X_test)
score_test = score_function(y_test, y_test_pred[:, 1])
print(score_test)
If you want to execute training step by step, go to the feature_extractor_classifier
, feature_extractor
, and classifier
workflows and deconstruct them.
First load the submission files and instantiate the feature extractor and regressor objects.
feature_extractor = imp.load_source(
'', 'submissions/starting_kit/feature_extractor.py')
fe = feature_extractor.FeatureExtractor()
classifier = imp.load_source(
'', 'submissions/starting_kit/classifier.py')
clf = classifier.Classifier()
Select the training folds.
X_train_train_df = X_train.iloc[train_is]
y_train_train = y_train[train_is]
Fit the feature extractor.
fe.fit(X_train_train_df, y_train_train)
Transform the training dataframe into numpy array.
X_train_train_array = fe.transform(X_train_train_df)
Fit the classifier.
clf.fit(X_train_train_array, y_train_train)
Transform the whole (training + validation) dataframe into a numpy array and compute the prediction.
X_train_array = fe.transform(X_train)
y_pred = clf.predict_proba(X_train_array)
Print the errors.
score_train = score_function(y_train[train_is], y_pred[:, 1][train_is])
print(score_train)
score_valid = score_function(y_train[test_is], y_pred[:, 1][test_is])
print(score_valid)
You can find more information in the README of the ramp-workflow library.
Don't hesitate to contact us.