Paris Saclay Center for Data Science¶

RAMP on predicting the number of air passengers

Balázs Kégl (LAL/CNRS), Alex Gramfort (Inria), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS)

Introduction¶

The data set was donated to us by an unnamed company handling flight ticket reservations. The data is thin, it contains

  • the date of departure
  • the departure airport
  • the arrival airport
  • the mean and standard deviation of the number of weeks of the reservations made before the departure date
  • a field called log_PAX which is related to the number of passengers (the actual number were changed for privacy reasons)

The goal is to predict the log_PAX column. The prediction quality is measured by RMSE.

The data is obviously limited, but since data and location informations are available, it can be joined to external data sets. The challenge in this RAMP is to find good data that can be correlated to flight traffic.

In [1]:
%matplotlib inline
import os
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

Load the dataset using pandas¶

The training and testing data are located in the folder data. They are compressed csv file (i.e. csv.bz2). We can load the dataset using pandas.

In [2]:
data = pd.read_csv(
    os.path.join('data', 'train.csv.bz2')
)
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 6 columns):
DateOfDeparture     8902 non-null object
Departure           8902 non-null object
Arrival             8902 non-null object
WeeksToDeparture    8902 non-null float64
log_PAX             8902 non-null float64
std_wtd             8902 non-null float64
dtypes: float64(3), object(3)
memory usage: 417.4+ KB

So as stated earlier, the column log_PAX is the target for our regression problem. The other columns are the features which will be used for our prediction problem. If we focus on the data type of the column, we can see that DateOfDeparture, Departure, and Arrival are of object dtype, meaning they are strings.

In [4]:
data[['DateOfDeparture', 'Departure', 'Arrival']].head()
Out[4]:
DateOfDeparture Departure Arrival
0 2012-06-19 ORD DFW
1 2012-09-10 LAS DEN
2 2012-10-05 DEN LAX
3 2011-10-09 ATL ORD
4 2012-02-21 DEN SFO

While it makes Departure and Arrival are the code of the airport, we see that the DateOfDeparture should be a date instead of string. We can use pandas to convert this data.

In [5]:
data.loc[:, 'DateOfDeparture'] = pd.to_datetime(data.loc[:, 'DateOfDeparture'])
In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 6 columns):
DateOfDeparture     8902 non-null datetime64[ns]
Departure           8902 non-null object
Arrival             8902 non-null object
WeeksToDeparture    8902 non-null float64
log_PAX             8902 non-null float64
std_wtd             8902 non-null float64
dtypes: datetime64[ns](1), float64(3), object(2)
memory usage: 417.4+ KB

When you will create a submission, ramp-workflow will load the data for you and split into a data matrix X and a target vector y. It will also take care about splitting the data into a training and testing set. These utilities are available in the module problem.py which we will load.

In [7]:
import problem

The function get_train_data() loads the training data and returns a pandas dataframe X and a numpy vector y.

In [8]:
X, y = problem.get_train_data()
In [9]:
type(X)
Out[9]:
pandas.core.frame.DataFrame
In [10]:
type(y)
Out[10]:
numpy.ndarray

We can check the information of the data X

In [11]:
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8902 entries, 0 to 8901
Data columns (total 5 columns):
DateOfDeparture     8902 non-null object
Departure           8902 non-null object
Arrival             8902 non-null object
WeeksToDeparture    8902 non-null float64
std_wtd             8902 non-null float64
dtypes: float64(2), object(3)
memory usage: 347.9+ KB

Thus, this is important to see that ramp-workflow does not convert the DateOfDeparture column into a datetime format. Thus, keep in mind that you might need to make a conversion when prototyping your machine learning pipeline later on. Let's check some statistics regarding our dataset.

In [12]:
print(min(X['DateOfDeparture']))
print(max(X['DateOfDeparture']))
2011-09-01
2013-03-05
In [13]:
X['Departure'].unique()
Out[13]:
array(['ORD', 'LAS', 'DEN', 'ATL', 'SFO', 'EWR', 'IAH', 'LAX', 'DFW',
       'SEA', 'JFK', 'PHL', 'MIA', 'DTW', 'BOS', 'MSP', 'CLT', 'MCO',
       'PHX', 'LGA'], dtype=object)
In [14]:
_ = plt.hist(y, bins=50)
In [15]:
_ = X.hist('std_wtd', bins=50)
In [16]:
_ = X.hist('WeeksToDeparture', bins=50)
In [17]:
X.describe()
Out[17]:
WeeksToDeparture std_wtd
count 8902.000000 8902.000000
mean 11.446469 8.617773
std 2.787140 2.139604
min 2.625000 2.160247
25% 9.523810 7.089538
50% 11.300000 8.571116
75% 13.240000 10.140521
max 23.163265 15.862216
In [18]:
X.shape
Out[18]:
(8902, 5)
In [19]:
print(y.mean())
print(y.std())
10.99904767212102
0.9938894125318564

Preprocessing dates¶

Getting dates into numerical columns is a common operation when time series data is analyzed with non-parametric predictors. The code below makes the following transformations:

  • numerical columns for year (2011-2012), month (1-12), day of the month (1-31), day of the week (0-6), and week of the year (1-52)
  • number of days since 1970-01-01
In [20]:
# Make a copy of the original data to avoid writing on the original data
X_encoded = X.copy()

# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
X_encoded['year'] = X_encoded['DateOfDeparture'].dt.year
X_encoded['month'] = X_encoded['DateOfDeparture'].dt.month
X_encoded['day'] = X_encoded['DateOfDeparture'].dt.day
X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
X_encoded['week'] = X_encoded['DateOfDeparture'].dt.week
X_encoded['n_days'] = X_encoded['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)
In [21]:
X_encoded.tail(5)
Out[21]:
DateOfDeparture Departure Arrival WeeksToDeparture std_wtd year month day weekday week n_days
8897 2011-10-02 DTW ATL 9.263158 7.316967 2011 10 2 6 39 15249
8898 2012-09-25 DFW ORD 12.772727 10.641034 2012 9 25 1 39 15608
8899 2012-01-19 SFO LAS 11.047619 7.908705 2012 1 19 3 3 15358
8900 2013-02-03 ORD PHL 6.076923 4.030334 2013 2 3 6 5 15739
8901 2011-11-26 DTW ATL 9.526316 6.167733 2011 11 26 5 47 15304

We will perform all preprocessing steps within a scikit-learn pipeline which chains together tranformation and estimator steps. This offers offers convenience and safety (help avoid leaking statistics from your test data into the trained model in cross-validation) and the whole pipeline can be evaluated with cross_val_score.

To perform the above encoding within a scikit-learn pipeline we will a function and using FunctionTransformer to make it compatible with scikit-learn API.

In [22]:
from sklearn.preprocessing import FunctionTransformer

def _encode_dates(X):
    # With pandas < 1.0, we wil get a SettingWithCopyWarning
    # In our case, we will avoid this warning by triggering a copy
    # More information can be found at:
    # https://github.com/scikit-learn/scikit-learn/issues/16191
    X_encoded = X.copy()

    # Make sure that DateOfDeparture is of datetime format
    X_encoded.loc[:, 'DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
    # Encode the DateOfDeparture
    X_encoded.loc[:, 'year'] = X_encoded['DateOfDeparture'].dt.year
    X_encoded.loc[:, 'month'] = X_encoded['DateOfDeparture'].dt.month
    X_encoded.loc[:, 'day'] = X_encoded['DateOfDeparture'].dt.day
    X_encoded.loc[:, 'weekday'] = X_encoded['DateOfDeparture'].dt.weekday
    X_encoded.loc[:, 'week'] = X_encoded['DateOfDeparture'].dt.week
    X_encoded.loc[:, 'n_days'] = X_encoded['DateOfDeparture'].apply(
        lambda date: (date - pd.to_datetime("1970-01-01")).days
    )
    # Once we did the encoding, we will not need DateOfDeparture
    return X_encoded.drop(columns=["DateOfDeparture"])

date_encoder = FunctionTransformer(_encode_dates)
In [23]:
date_encoder.fit_transform(X).head()
Out[23]:
Departure Arrival WeeksToDeparture std_wtd year month day weekday week n_days
0 ORD DFW 12.875000 9.812647 2012 6 19 1 25 15510
1 LAS DEN 14.285714 9.466734 2012 9 10 0 37 15593
2 DEN LAX 10.863636 9.035883 2012 10 5 4 40 15618
3 ATL ORD 11.480000 7.990202 2011 10 9 6 40 15256
4 DEN SFO 11.450000 9.517159 2012 2 21 1 8 15391

Random Forests¶

Tree-based algorithms requires less complex preprocessing than linear-models. We will first present a machine-learning pipeline where we will use a random forest. In this pipeline, we will need to:

  • encode the date to numerical values (as presented in the section above);
  • oridinal encode the other categorical values to get numerical number;
  • keep numerical features as they are.

Thus, we want to perform three different processes on different columns of the original data X. In scikit-learn, we can use make_column_transformer to perform such processing.

In [24]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer

date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]

categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure"]

preprocessor = make_column_transformer(
    (date_encoder, date_cols),
    (categorical_encoder, categorical_cols),
    remainder='passthrough',  # passthrough numerical columns as they are
)

We can combine our preprocessor with an estimator (RandomForestRegressor in this case), allowing us to make predictions.

In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline

n_estimators = 10
max_depth = 10
max_features = 10

regressor = RandomForestRegressor(
    n_estimators=n_estimators, max_depth=max_depth, max_features=max_features
)

pipeline = make_pipeline(preprocessor, regressor)

We can cross-validate our pipeline using cross_val_score. Below we will have specified cv=5 meaning KFold cross-valdiation splitting will be used, with 8 folds. The mean squared error regression loss is calculated for each split. The output score will be an array of 5 scores from each KFold. The score mean and standard deviation of the 5 scores is printed at the end.

In [26]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6277 +/- 0.0186

Linear regressor¶

When dealing with a linear model, we need to one-hot encode categorical variables instead of ordinal encoding and standardize numerical variables. Thus we will:

  • encode the date;
  • then, one-hot encode all categorical columns, including the encoded date as well;
  • standardize the numerical columns.
In [27]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]

categorical_encoder = OneHotEncoder(handle_unknown="ignore")
categorical_cols = [
    "Arrival", "Departure", "year", "month", "day",
    "weekday", "week", "n_days"
]

numerical_scaler = StandardScaler()
numerical_cols = ["WeeksToDeparture", "std_wtd"]

preprocessor = make_column_transformer(
    (categorical_encoder, categorical_cols),
    (numerical_scaler, numerical_cols)
)

We can now combine our preprocessor with the LinearRegression estimator in a Pipeline:

In [28]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

pipeline = make_pipeline(date_encoder, preprocessor, regressor)

And we can evaluate our linear-model pipeline:

In [29]:
scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6117 +/- 0.0149

Merging external data¶

The objective in this RAMP data challenge is to find good data that can be correlated to flight traffic. We will use some weather data (saved in submissions/starting_kit) to provide an example of how to merge external data in a scikit-learn pipeline.

Your external data will need to be included in your submissions folder - see RAMP submissions for more details.

First we will define a function that merges the external data to our feature data.

In [30]:
# when submitting a kit, the `__file__` variable will corresponds to the
# path to `estimator.py`. However, this variable is not defined in the
# notebook and thus we must define the `__file__` variable to imitate
# how a submission `.py` would work.
__file__ = os.path.join('submissions', 'starting_kit', 'estimator.py')
filepath = os.path.join(os.path.dirname(__file__), 'external_data.csv')
filepath
Out[30]:
'submissions/starting_kit/external_data.csv'
In [31]:
pd.read_csv(filepath).head()
Out[31]:
Date AirPort Max TemperatureC Mean TemperatureC Min TemperatureC Dew PointC MeanDew PointC Min DewpointC Max Humidity Mean Humidity Min Humidity Max Sea Level PressurehPa Mean Sea Level PressurehPa Min Sea Level PressurehPa Max VisibilityKm Mean VisibilityKm Min VisibilitykM Max Wind SpeedKm/h Mean Wind SpeedKm/h Max Gust SpeedKm/h Precipitationmm CloudCover Events WindDirDegrees
0 2011-09-01 ATL 35 29 24 21 18 14 79 56 32 1022 1019 1017 16 16 11 19 6 26.0 0.00 3 NaN 129
1 2011-09-02 ATL 36 29 22 17 15 14 61 46 30 1019 1016 1014 16 16 16 24 7 34.0 0.00 2 NaN 185
2 2011-09-03 ATL 35 29 23 17 16 14 64 47 30 1015 1013 1011 16 16 16 19 7 26.0 0.00 4 NaN 147
3 2011-09-04 ATL 27 24 22 22 19 16 93 72 51 1014 1012 1011 16 14 4 21 9 26.0 6.10 6 Rain 139
4 2011-09-05 ATL 26 24 22 23 22 20 94 91 87 1010 1005 999 16 13 3 32 16 45.0 16.00 8 Rain-Thunderstorm 149
In [32]:
def _merge_external_data(X):
    filepath = os.path.join(
        os.path.dirname(__file__), 'external_data.csv'
    )
    
    X = X.copy()  # to avoid raising SettingOnCopyWarning
    # Make sure that DateOfDeparture is of dtype datetime
    X.loc[:, "DateOfDeparture"] = pd.to_datetime(X['DateOfDeparture'])
    # Parse date to also be of dtype datetime
    data_weather = pd.read_csv(filepath, parse_dates=["Date"])

    X_weather = data_weather[['Date', 'AirPort', 'Max TemperatureC']]
    X_weather = X_weather.rename(
        columns={'Date': 'DateOfDeparture', 'AirPort': 'Arrival'})
    X_merged = pd.merge(
        X, X_weather, how='left', on=['DateOfDeparture', 'Arrival'], sort=False
    )
    return X_merged

data_merger = FunctionTransformer(_merge_external_data)

Double check that our function works:

In [33]:
data_merger.fit_transform(X).head()
Out[33]:
DateOfDeparture Departure Arrival WeeksToDeparture std_wtd Max TemperatureC
0 2012-06-19 ORD DFW 12.875000 9.812647 34
1 2012-09-10 LAS DEN 14.285714 9.466734 33
2 2012-10-05 DEN LAX 10.863636 9.035883 22
3 2011-10-09 ATL ORD 11.480000 7.990202 27
4 2012-02-21 DEN SFO 11.450000 9.517159 16

Use FunctionTransformer to make our function compatible with scikit-learn API:

We can now assemble our pipeline using the same data_merger and preprocessor as above:

In [34]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer

date_encoder = FunctionTransformer(_encode_dates)
date_cols = ["DateOfDeparture"]

categorical_encoder = OrdinalEncoder()
categorical_cols = ["Arrival", "Departure"]

preprocessor = make_column_transformer(
    (date_encoder, date_cols),
    (categorical_encoder, categorical_cols),
    remainder='passthrough',  # passthrough numerical columns as they are
)
In [35]:
n_estimators = 10
max_depth = 10
max_features = 10

regressor = RandomForestRegressor(
    n_estimators=n_estimators, max_depth=max_depth, max_features=max_features
)

pipeline = make_pipeline(data_merger, preprocessor, regressor)
In [36]:
scores = cross_val_score(
    pipeline, X, y, cv=5, scoring='neg_mean_squared_error'
)
rmse_scores = np.sqrt(-scores)

print(
    f"RMSE: {np.mean(rmse_scores):.4f} +/- {np.std(rmse_scores):.4f}"
)
RMSE: 0.6330 +/- 0.0108

Feature importances¶

We can check the feature importances using the function sklearn.inspection.permutation_importances. Since the first step of our pipeline adds the new external feature Max TemperatureC, we want to apply this transformation after adding Max TemperatureC, to check the importances of all features. Indeed, we can perform sklearn.inspection.permutation_importances at any stage of the pipeline, as we will see later on.

The code below:

  • performs transform on the first step of the pipeline (pipeline[0]) producing the transformed train (X_train_augmented) and test (X_test_augmented) data
  • the transformed data is used to fit the pipeline from the second step onwards

Note that pipelines can be sliced. pipeline[0] obtains the first step (tuple) of the pipeline. You can further slice to obtain either the transformer/estimator (first item in each tuple) or column list (second item within each tuple) inside each tuple. For example pipeline[0][0] obtains the transformer of the first step of the pipeline (first item of the first tuple).

In [37]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

merger = pipeline[0]
X_train_augmented = merger.transform(X_train)
X_test_augmented = merger.transform(X_test)

predictor = pipeline[1:]
predictor.fit(X_train_augmented, y_train).score(X_test_augmented, y_test)
Out[37]:
0.5939918926708109

With the fitted pipeline, we can now use permutation_importance to calculate feature importances:

In [38]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_train_augmented, y_train, n_repeats=10
)

Here, we plot the permutation importance using the training set. The higher the value, more important the feature is.

In [39]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=X_train_augmented.columns[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()

We can replicate the same processing on the test set and see if we can observe the same trend.

In [40]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_test_augmented, y_test, n_repeats=10
)
In [41]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=X_test_augmented.columns[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

With the current version of scikit-learn, it is not handy but still possible to check the feature importances at the latest stage of the pipeline (once all features have been preprocessed).

The difficult part is to get the name of the features.

In [42]:
preprocessor = pipeline[:-1]
predictor = pipeline[-1]

X_train_augmented = preprocessor.transform(X_train)
X_test_augmented = preprocessor.transform(X_test)

Let's find out the feature names (in the future, scikit-learn will provide a get_feature_names function to handle this case).

In [43]:
date_cols_name = (date_encoder.transform(X_train[date_cols])
                              .columns.tolist())
categorical_cols_name = categorical_cols
numerical_cols_name = (pipeline[0].transform(X_train)
                                  .columns[pipeline[1].transformers_[-1][-1]]
                                  .tolist())
In [44]:
feature_names = np.array(
    date_cols_name + categorical_cols_name + numerical_cols_name
)
feature_names
Out[44]:
array(['year', 'month', 'day', 'weekday', 'week', 'n_days', 'Arrival',
       'Departure', 'WeeksToDeparture', 'std_wtd', 'Max TemperatureC'],
      dtype='<U16')

We can repeat the previous processing at this finer grain, where the transformed date columns are included.

In [45]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_train_augmented, y_train, n_repeats=10
)

Here, we plot the permutation importance using the training set. Basically, higher the value, more important is the feature.

In [46]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importances (train set)")
fig.tight_layout()
plt.show()

We can replicate the same processing on the test set and see if we can observe the same trend.

In [47]:
from sklearn.inspection import permutation_importance

feature_importances = permutation_importance(
    predictor, X_test_augmented, y_test, n_repeats=10
)
In [48]:
sorted_idx = feature_importances.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(feature_importances.importances[sorted_idx].T,
           vert=False, labels=feature_names[sorted_idx])
ax.set_title("Permutation Importances (test set)")
fig.tight_layout()
plt.show()

Submission¶

To submit your code, you can refer to the online documentation.