Balázs Kégl (LAL/CNRS), Alex Gramfort (LTCI/Telecom ParisTech), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS)
The data set was donated to us by an unnamed company handling flight ticket reservations. The data is thin, it contains
log_PAX
which is related to the number of passengers (the actual number were changed for privacy reasons)
The goal is to predict the log_PAX
column. The prediction quality is measured by RMSE.
The data is obviously limited, but since data and location informations are available, it can be joined to external data sets. The challenge in this RAMP is to find good data that can be correlated to flight traffic.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
# !pip install -U seaborn # if you don't have it, or pip3 for python3
# optional
import seaborn as sns; sns.set()
data = pd.read_csv("data/train.csv.bz2")
print(min(data['DateOfDeparture']))
print(max(data['DateOfDeparture']))
data.head()
data['Departure'].unique()
data.hist(column='log_PAX', bins=50);
data.hist('std_wtd', bins=50);
data.hist('WeeksToDeparture', bins=50);
data.describe()
data.dtypes
data.shape
print(data['log_PAX'].mean())
print(data['log_PAX'].std())
Getting dates into numerical columns is a common operation when time series are analyzed with non-parametric predictors. The code below makes all possible choices: ordered columns for the year, month, day, weekday, week, and day in the year, and one-hot columns for year month, day, weekday, and week.
The departure and arrival airports are also converted into one-hot columns.
data_encoded = data
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['Departure'], prefix='d'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['Arrival'], prefix='a'))
data_encoded = data_encoded.drop('Departure', axis=1)
data_encoded = data_encoded.drop('Arrival', axis=1)
# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
data_encoded['DateOfDeparture'] = pd.to_datetime(data_encoded['DateOfDeparture'])
data_encoded['year'] = data_encoded['DateOfDeparture'].dt.year
data_encoded['month'] = data_encoded['DateOfDeparture'].dt.month
data_encoded['day'] = data_encoded['DateOfDeparture'].dt.day
data_encoded['weekday'] = data_encoded['DateOfDeparture'].dt.weekday
data_encoded['week'] = data_encoded['DateOfDeparture'].dt.week
data_encoded['n_days'] = data_encoded['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['year'], prefix='y'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['month'], prefix='m'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['day'], prefix='d'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['weekday'], prefix='wd'))
data_encoded = data_encoded.join(pd.get_dummies(data_encoded['week'], prefix='w'))
data_encoded.tail(5)
We drop the target column and the original data column.
features = data_encoded.drop(['log_PAX','DateOfDeparture'], axis=1)
X_columns = data_encoded.columns.drop(['log_PAX','DateOfDeparture'])
X = features.values
y = data_encoded['log_PAX'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0)
It gives us a pretty nice improvement above baseline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='mean_squared_error')
print("log RMSE: {:.4f} +/-{:.4f}".format(
np.mean(np.sqrt(-scores)), np.std(np.sqrt(-scores))))
Exercise: Visualize the coefficients, try to make sense of them.
%%time
from sklearn.ensemble import RandomForestRegressor
n_estimators = 10
max_depth = 10
max_features = 10
reg = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features)
scores = cross_val_score(reg, X_train, y_train, cv=5, scoring='mean_squared_error',n_jobs=3)
print("log RMSE: {:.4f} +/-{:.4f}".format(
np.mean(np.sqrt(-scores)), np.std(np.sqrt(-scores))))
reg.fit(X_train, y_train)
len(X_columns)
plt.figure(figsize=(15, 5))
ordering = np.argsort(reg.feature_importances_)[::-1][:50]
importances = reg.feature_importances_[ordering]
feature_names = X_columns[ordering]
x = np.arange(len(feature_names))
plt.bar(x, importances)
plt.xticks(x + 0.5, feature_names, rotation=90, fontsize=15);
The feature extractor implements a single transform
function. It receives the full pandas object X_df (without the labels). It should produce a numpy array representing the features extracted. If you want to use the (training) labels to save some state of the feature extractor, you can do it in the fit function.
The starting kit feature extractor shows you how to join your data to external data. You will have the possibility to submit a single external csv for each of your submission (so if you have several data sets, you first have to do the join offline, and save it as a csv). In this case it is whether data, joined to the database on the DateOfDeparture
and Arrival
fields. Attention: when you join the data, make sure that the order of the rows in the data frame does not change.
The following cell is not executed in the notebook, rather it saves its content in the file in the first line, so you can edit your submission before running the local test below and submitting it at the RAMP site.
%%file submissions/starting_kit/feature_extractor.py
import pandas as pd
import os
class FeatureExtractor(object):
def __init__(self):
pass
def fit(self, X_df, y_array):
pass
def transform(self, X_df):
X_encoded = X_df
path = os.path.dirname(__file__)
data_weather = pd.read_csv(os.path.join(path, 'external_data.csv'))
X_weather = data_weather[['Date', 'AirPort', 'Max TemperatureC']]
X_weather = X_weather.rename(
columns={'Date': 'DateOfDeparture', 'AirPort': 'Arrival'})
X_encoded = pd.merge(
X_encoded, X_weather, how='left',
left_on=['DateOfDeparture', 'Arrival'],
right_on=['DateOfDeparture', 'Arrival'],
sort=False)
X_encoded = X_encoded.join(pd.get_dummies(
X_encoded['Departure'], prefix='d'))
X_encoded = X_encoded.join(
pd.get_dummies(X_encoded['Arrival'], prefix='a'))
X_encoded = X_encoded.drop('Departure', axis=1)
X_encoded = X_encoded.drop('Arrival', axis=1)
X_encoded = X_encoded.drop('DateOfDeparture', axis=1)
X_array = X_encoded.values
return X_array
The regressor should implement an sklearn-like regressor with fit and predict functions.
%%file submissions/starting_kit/regressor.py
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator
class Regressor(BaseEstimator):
def __init__(self):
self.clf = RandomForestRegressor(
n_estimators=10, max_depth=10, max_features=10)
def fit(self, X, y):
self.clf.fit(X, y)
def predict(self, X):
return self.clf.predict(X)
It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit
.
First pip install ramp-workflow
or install it from the github repo. Make sure that the python files feature_extractor.py
, regressor.py
, and external_data.csv
are in the submissions/starting_kit
folder, and the data train.csv.bz2
and test.csv.bz2
are in data
. Then run
ramp_test_submission
If it runs and print training and test errors on each fold, then you can submit the code.
!ramp_test_submission
Alternatively, load and execute rampwf.utils.testing.py
, and call assert_submission
. This may be useful if you would like to understand how we instantiate the workflow, the scores, the data connectors, and the cross validation scheme defined in problem.py
, and how we insert and train/test your submission.
# %load https://raw.githubusercontent.com/paris-saclay-cds/ramp-workflow/master/rampwf/utils/testing.py
# assert_submission()
Once you found a good model, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event DSSP 6 for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) feature_extractor.py
, regressor.py
, and external_data.csv
from submissions/starting_kit
. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission
. The script prints mean cross-validation scores
----------------------------
train rmse = 0.748 ± 0.0117
valid rmse = 0.858 ± 0.0111
test rmse = 0.881 ± 0.005
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is root mean squared error ("rmse"), so the line that is relevant in the output of ramp_test_submission
is valid rmse = 0.858 ± 0.0111
. When the score is good enough, you can submit it at the RAMP.
You can find more information in the README of the ramp-workflow library.
Don't hesitate to contact us.