Balázs Kégl (LAL/CNRS), Alex Gramfort (Inria), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS)
The data set was donated to us by an unnamed company handling flight ticket reservations. The data is thin, it contains
log_PAX
which is related to the number of passengers (the actual number were changed for privacy reasons)
The goal is to predict the log_PAX
column. The prediction quality is measured by RMSE.
The data is obviously limited, but since data and location informations are available, it can be joined to external data sets. The challenge in this RAMP is to find good data that can be correlated to flight traffic.
%matplotlib inline
import os
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
The training and testing data are located in the folder data
. They are compressed csv
file (i.e. csv.bz2
). We can load the dataset using pandas.
data = pd.read_csv(
os.path.join('data', 'train.csv.bz2')
)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8902 entries, 0 to 8901 Data columns (total 6 columns): DateOfDeparture 8902 non-null object Departure 8902 non-null object Arrival 8902 non-null object WeeksToDeparture 8902 non-null float64 log_PAX 8902 non-null float64 std_wtd 8902 non-null float64 dtypes: float64(3), object(3) memory usage: 417.4+ KB
So as stated earlier, the column log_PAX
is the target for our regression problem. The other columns are the features which will be used for our prediction problem. If we focus on the data type of the column, we can see that DateOfDeparture
, Departure
, and Arrival
are of object
dtype, meaning they are strings.
data[['DateOfDeparture', 'Departure', 'Arrival']].head()
DateOfDeparture | Departure | Arrival | |
---|---|---|---|
0 | 2012-06-19 | ORD | DFW |
1 | 2012-09-10 | LAS | DEN |
2 | 2012-10-05 | DEN | LAX |
3 | 2011-10-09 | ATL | ORD |
4 | 2012-02-21 | DEN | SFO |
While it makes Departure
and Arrival
are the code of the airport, we see that the DateOfDeparture
should be a date instead of string. We can use pandas to convert this data.
data.loc[:, 'DateOfDeparture'] = pd.to_datetime(data.loc[:, 'DateOfDeparture'])
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8902 entries, 0 to 8901 Data columns (total 6 columns): DateOfDeparture 8902 non-null datetime64[ns] Departure 8902 non-null object Arrival 8902 non-null object WeeksToDeparture 8902 non-null float64 log_PAX 8902 non-null float64 std_wtd 8902 non-null float64 dtypes: datetime64[ns](1), float64(3), object(2) memory usage: 417.4+ KB
When you will create a submission, ramp-workflow
will load the data for you and split into a data matrix X
and a target vector y
. It will also take care about splitting the data into a training and testing set. These utilities are available in the module problem.py
which we will load.
import problem
The function get_train_data()
loads the training data and returns a pandas dataframe X
and a numpy vector y
.
X, y = problem.get_train_data()
type(X)
pandas.core.frame.DataFrame
type(y)
numpy.ndarray
We can check the information of the data X
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8902 entries, 0 to 8901 Data columns (total 5 columns): DateOfDeparture 8902 non-null object Departure 8902 non-null object Arrival 8902 non-null object WeeksToDeparture 8902 non-null float64 std_wtd 8902 non-null float64 dtypes: float64(2), object(3) memory usage: 347.9+ KB
Thus, this is important to see that ramp-workflow
does not convert the DateOfDeparture
column into a datetime
format. Thus, keep in mind that you might need to make a conversion when prototyping your machine learning pipeline later on. Let's check some statistics regarding our dataset.
print(min(X['DateOfDeparture']))
print(max(X['DateOfDeparture']))
2011-09-01 2013-03-05
X['Departure'].unique()
array(['ORD', 'LAS', 'DEN', 'ATL', 'SFO', 'EWR', 'IAH', 'LAX', 'DFW', 'SEA', 'JFK', 'PHL', 'MIA', 'DTW', 'BOS', 'MSP', 'CLT', 'MCO', 'PHX', 'LGA'], dtype=object)
_ = plt.hist(y, bins=50)
_ = X.hist('std_wtd', bins=50)
_ = X.hist('WeeksToDeparture', bins=50)
X.describe()
WeeksToDeparture | std_wtd | |
---|---|---|
count | 8902.000000 | 8902.000000 |
mean | 11.446469 | 8.617773 |
std | 2.787140 | 2.139604 |
min | 2.625000 | 2.160247 |
25% | 9.523810 | 7.089538 |
50% | 11.300000 | 8.571116 |
75% | 13.240000 | 10.140521 |
max | 23.163265 | 15.862216 |
X.shape
(8902, 5)
print(y.mean())
print(y.std())
10.99904767212102 0.9938894125318564
Getting dates into numerical columns is a common operation when time series data is analyzed with non-parametric predictors. The code below makes the following transformations:
# Make a copy of the original data to avoid writing on the original data
X_encoded = X.copy()
# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
X_encoded['year'] = X_encoded['DateOfDeparture'].dt.year
X_encoded['month'] = X_encoded['DateOfDeparture'].dt.month
X_encoded['day'] = X_encoded['DateOfDeparture'].dt.day
X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
X_encoded['week'] = X_encoded['DateOfDeparture'].dt.week
X_encoded['n_days'] = X_encoded['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)
X_encoded.tail(5)
DateOfDeparture | Departure | Arrival | WeeksToDeparture | std_wtd | year | month | day | weekday | week | n_days | |
---|---|---|---|---|---|---|---|---|---|---|---|
8897 | 2011-10-02 | DTW | ATL | 9.263158 | 7.316967 | 2011 | 10 | 2 | 6 | 39 | 15249 |
8898 | 2012-09-25 | DFW | ORD | 12.772727 | 10.641034 | 2012 | 9 | 25 | 1 | 39 | 15608 |
8899 | 2012-01-19 | SFO | LAS | 11.047619 | 7.908705 | 2012 | 1 | 19 | 3 | 3 | 15358 |
8900 | 2013-02-03 | ORD | PHL | 6.076923 | 4.030334 | 2013 | 2 | 3 | 6 | 5 | 15739 |
8901 | 2011-11-26 | DTW | ATL | 9.526316 | 6.167733 | 2011 | 11 | 26 | 5 | 47 | 15304 |
We will perform all preprocessing steps within a scikit-learn pipeline which chains together tranformation and estimator steps. This offers offers convenience and safety (help avoid leaking statistics from your test data into the trained model in cross-validation) and the whole pipeline can be evaluated with cross_val_score
.
To perform the above encoding within a scikit-learn pipeline we will a function and using FunctionTransformer
to make it compatible with scikit-learn API.
from sklearn.preprocessing import FunctionTransformer
def _encode_dates(X):
# With pandas < 1.0, we wil get a SettingWithCopyWarning
# In our case, we will avoid this warning by triggering a copy
# More information can be found at:
# https://github.com/scikit-learn/scikit-learn/issues/16191
X_encoded = X.copy()
# Make sure that DateOfDeparture is of datetime format
X_encoded.loc[:, 'DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
# Encode the DateOfDeparture
X_encoded.loc[:, 'year'] = X_encoded['DateOfDeparture'].dt.year
X_encoded.loc[:, 'month'] = X_encoded['DateOfDeparture'].dt.month
X_encoded.loc[:, 'day'] = X_encoded['DateOfDeparture'].dt.day
X_encoded.loc[:, 'weekday'] = X_encoded['DateOfDeparture'].dt.weekday
X_encoded.loc[:, 'week'] = X_encoded['DateOfDeparture'].dt.week
X_encoded.loc[:, 'n_days'] = X_encoded['DateOfDeparture'].apply(
lambda date: (date - pd.to_datetime("1970-01-01")).days
)
# Once we did the encoding, we will not need DateOfDeparture
return X_encoded.drop(columns=["DateOfDeparture"])
date_encoder = FunctionTransformer(_encode_dates)
date_encoder.fit_transform(X).head()
Departure | Arrival | WeeksToDeparture | std_wtd | year | month | day | weekday | week | n_days | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ORD | DFW | 12.875000 | 9.812647 | 2012 | 6 | 19 | 1 | 25 | 15510 |
1 | LAS | DEN | 14.285714 | 9.466734 | 2012 | 9 | 10 | 0 | 37 | 15593 |
2 | DEN | LAX | 10.863636 | 9.035883 | 2012 | 10 | 5 | 4 | 40 | 15618 |
3 | ATL | ORD | 11.480000 | 7.990202 | 2011 | 10 | 9 | 6 | 40 | 15256 |
4 | DEN | SFO | 11.450000 | 9.517159 | 2012 | 2 | 21 | 1 | 8 | 15391 |
Tree-based algorithms requires less complex preprocessing than linear-models. We will first present a machine-learning pipeline where we will use a random forest. In this pipeline, we will need to:
Thus, we want to perform three different processes on different columns of the original data X
. In scikit-learn, we can use make_column_transformer
to perform such processing.
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]
categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure"]
preprocessor = make_column_transformer(
(date_encoder, date_cols),
(categorical_encoder, categorical_cols),
remainder='passthrough', # passthrough numerical columns as they are
)
We can combine our preprocessor with an estimator (RandomForestRegressor
in this case), allowing us to make predictions.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
n_estimators = 10
max_depth = 10
max_features = 10
regressor = RandomForestRegressor(
n_estimators=n_estimators, max_depth=max_depth, max_features=max_features
)
pipeline = make_pipeline(preprocessor, regressor)
We can cross-validate our pipeline
using cross_val_score
. Below we will have specified cv=5
meaning KFold cross-valdiation splitting will be used, with 8 folds. The mean squared error regression loss is calculated for each split. The output score
will be an array of 5 scores from each KFold. The score mean and standard deviation of the 5 scores is printed at the end.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)
print(
f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6277 +/- 0.0186
When dealing with a linear model, we need to one-hot encode categorical variables instead of ordinal encoding and standardize numerical variables. Thus we will:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]
categorical_encoder = OneHotEncoder(handle_unknown="ignore")
categorical_cols = [
"Arrival", "Departure", "year", "month", "day",
"weekday", "week", "n_days"
]
numerical_scaler = StandardScaler()
numerical_cols = ["WeeksToDeparture", "std_wtd"]
preprocessor = make_column_transformer(
(categorical_encoder, categorical_cols),
(numerical_scaler, numerical_cols)
)
We can now combine our preprocessor
with the LinearRegression
estimator in a Pipeline
:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
pipeline = make_pipeline(date_encoder, preprocessor, regressor)
And we can evaluate our linear-model pipeline:
scores = cross_val_score(
pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)
print(
f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6117 +/- 0.0149
The objective in this RAMP data challenge is to find good data that can be correlated to flight traffic. We will use some weather data (saved in submissions/starting_kit
) to provide an example of how to merge external data in a scikit-learn pipeline.
Your external data will need to be included in your submissions folder - see RAMP submissions for more details.
First we will define a function that merges the external data to our feature data.
# when submitting a kit, the `__file__` variable will corresponds to the
# path to `estimator.py`. However, this variable is not defined in the
# notebook and thus we must define the `__file__` variable to imitate
# how a submission `.py` would work.
__file__ = os.path.join('submissions', 'starting_kit', 'estimator.py')
filepath = os.path.join(os.path.dirname(__file__), 'external_data.csv')
filepath
'submissions/starting_kit/external_data.csv'
pd.read_csv(filepath).head()
Date | AirPort | Max TemperatureC | Mean TemperatureC | Min TemperatureC | Dew PointC | MeanDew PointC | Min DewpointC | Max Humidity | Mean Humidity | Min Humidity | Max Sea Level PressurehPa | Mean Sea Level PressurehPa | Min Sea Level PressurehPa | Max VisibilityKm | Mean VisibilityKm | Min VisibilitykM | Max Wind SpeedKm/h | Mean Wind SpeedKm/h | Max Gust SpeedKm/h | Precipitationmm | CloudCover | Events | WindDirDegrees | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-09-01 | ATL | 35 | 29 | 24 | 21 | 18 | 14 | 79 | 56 | 32 | 1022 | 1019 | 1017 | 16 | 16 | 11 | 19 | 6 | 26.0 | 0.00 | 3 | NaN | 129 |
1 | 2011-09-02 | ATL | 36 | 29 | 22 | 17 | 15 | 14 | 61 | 46 | 30 | 1019 | 1016 | 1014 | 16 | 16 | 16 | 24 | 7 | 34.0 | 0.00 | 2 | NaN | 185 |
2 | 2011-09-03 | ATL | 35 | 29 | 23 | 17 | 16 | 14 | 64 | 47 | 30 | 1015 | 1013 | 1011 | 16 | 16 | 16 | 19 | 7 | 26.0 | 0.00 | 4 | NaN | 147 |
3 | 2011-09-04 | ATL | 27 | 24 | 22 | 22 | 19 | 16 | 93 | 72 | 51 | 1014 | 1012 | 1011 | 16 | 14 | 4 | 21 | 9 | 26.0 | 6.10 | 6 | Rain | 139 |
4 | 2011-09-05 | ATL | 26 | 24 | 22 | 23 | 22 | 20 | 94 | 91 | 87 | 1010 | 1005 | 999 | 16 | 13 | 3 | 32 | 16 | 45.0 | 16.00 | 8 | Rain-Thunderstorm | 149 |
def _merge_external_data(X):
filepath = os.path.join(
os.path.dirname(__file__), 'external_data.csv'
)
X = X.copy() # to avoid raising SettingOnCopyWarning
# Make sure that DateOfDeparture is of dtype datetime
X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
# Parse date to also be of dtype datetime
data_weather = pd.read_csv(filepath, parse_dates=["Date"])
X_weather = data_weather[['Date', 'AirPort', 'Max TemperatureC']]
X_weather = X_weather.rename(
columns={'Date': 'DateOfDeparture', 'AirPort': 'Arrival'})
X_merged = pd.merge(
X, X_weather, how='left', on=['DateOfDeparture', 'Arrival'], sort=False
)
return X_merged
data_merger = FunctionTransformer(_merge_external_data)
Double check that our function works:
data_merger.fit_transform(X).head()
DateOfDeparture | Departure | Arrival | WeeksToDeparture | std_wtd | Max TemperatureC | |
---|---|---|---|---|---|---|
0 | 2012-06-19 | ORD | DFW | 12.875000 | 9.812647 | 34 |
1 | 2012-09-10 | LAS | DEN | 14.285714 | 9.466734 | 33 |
2 | 2012-10-05 | DEN | LAX | 10.863636 | 9.035883 | 22 |
3 | 2011-10-09 | ATL | ORD | 11.480000 | 7.990202 | 27 |
4 | 2012-02-21 | DEN | SFO | 11.450000 | 9.517159 | 16 |
Use FunctionTransformer
to make our function compatible with scikit-learn API:
We can now assemble our pipeline using the same data_merger
and preprocessor
as above:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer
date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]
categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure"]
preprocessor = make_column_transformer(
(date_encoder, date_cols),
(categorical_encoder, categorical_cols),
remainder='passthrough', # passthrough numerical columns as they are
)
n_estimators = 10
max_depth = 10
max_features = 10
regressor = RandomForestRegressor(
n_estimators=n_estimators, max_depth=max_depth, max_features=max_features
)
pipeline = make_pipeline(data_merger, preprocessor, regressor)
scores = cross_val_score(
pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)
print(
f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6330 +/- 0.0108
We can check the feature importances using the function sklearn.inspection.permutation_importances
. Since the first step of our pipeline adds the new external feature Max TemperatureC
, we want to apply this transformation after adding Max TemperatureC
, to check the importances of all features. Indeed, we can perform sklearn.inspection.permutation_importances
at any stage of the pipeline, as we will see later on.
The code below:
transform
on the first step of the pipeline (pipeline[0]
) producing the transformed train (X_train_augmented
) and test (X_test_augmented
) dataNote that pipelines can be sliced. pipeline[0]
obtains the first step (tuple) of the pipeline. You can further slice to obtain either the transformer/estimator (first item in each tuple) or column list (second item within each tuple) inside each tuple. For example pipeline[0][0]
obtains the transformer of the first step of the pipeline (first item of the first tuple).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=42
)
merger = pipeline[0]
X_train_augmented = merger.transform(X_train)
X_test_augmented = merger.transform(X_test)
predictor = pipeline[1:]
predictor.fit(X_train_augmented, y_train).score(X_test_augmented, y_test)
0.5939918926708109
With the fitted pipeline, we can now use permutation_importance
to calculate feature importances:
from sklearn.inspection import permutation_importance
feature_importances = permutation_importance(
predictor, X_train_augmented, y_train, n_repeats=10
)
Here, we plot the permutation importance using the training set. The higher the value, more important the feature is.
sorted_idx = feature_importances.importances_mean.argsort()
fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
vert=False, labels=X_train_augmented.columns[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()
We can replicate the same processing on the test set and see if we can observe the same trend.
from sklearn.inspection import permutation_importance
feature_importances = permutation_importance(
predictor, X_test_augmented, y_test, n_repeats=10
)
sorted_idx = feature_importances.importances_mean.argsort()
fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
vert=False, labels=X_test_augmented.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()
With the current version of scikit-learn, it is not handy but still possible to check the feature importances at the latest stage of the pipeline (once all features have been preprocessed).
The difficult part is to get the name of the features.
preprocessor = pipeline[:-1]
predictor = pipeline[-1]
X_train_augmented = preprocessor.transform(X_train)
X_test_augmented = preprocessor.transform(X_test)
Let's find out the feature names (in the future, scikit-learn will provide a get_feature_names
function to handle this case).
date_cols_name = (date_encoder.transform(X_train[date_cols])
.columns.tolist())
categorical_cols_name = categorical_cols
numerical_cols_name = (pipeline[0].transform(X_train)
.columns[pipeline[1].transformers_[-1][-1]]
.tolist())
feature_names = np.array(
date_cols_name + categorical_cols_name + numerical_cols_name
)
feature_names
array(['year', 'month', 'day', 'weekday', 'week', 'n_days', 'Arrival', 'Departure', 'WeeksToDeparture', 'std_wtd', 'Max TemperatureC'], dtype='<U16')
We can repeat the previous processing at this finer grain, where the transformed date columns are included.
from sklearn.inspection import permutation_importance
feature_importances = permutation_importance(
predictor, X_train_augmented, y_train, n_repeats=10
)
Here, we plot the permutation importance using the training set. Basically, higher the value, more important is the feature.
sorted_idx = feature_importances.importances_mean.argsort()
fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()
We can replicate the same processing on the test set and see if we can observe the same trend.
from sklearn.inspection import permutation_importance
feature_importances = permutation_importance(
predictor, X_test_augmented, y_test, n_repeats=10
)
sorted_idx = feature_importances.importances_mean.argsort()
fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()
To submit your code, you can refer to the online documentation.