RAMP on predicting cyclist traffic in Paris¶

Authors: Roman Yurchak (Symerio); also partially inspired by the air_passengers starting kit.

Introduction¶

The dataset was collected with cyclist counters installed by Paris city council in multiple locations. It contains hourly information about cyclist traffic, as well as the following features,

  • counter name
  • counter site name
  • date
  • counter installation date
  • latitude and longitude

Available features are quite scarce. However, we can also use any external data that can help us to predict the target variable.

In [3]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

Loading the data with pandas¶

First, download the data files,

  • train.parquet
  • test.parquet

and put them to into the data folder.

Data is stored in Parquet format, an efficient columnar data format. We can load the train set with pandas,

In [4]:
data = pd.read_parquet(Path("data") / "train.parquet")
In [5]:
data.head()
Out[5]:
counter_id counter_name site_id site_name bike_count date counter_installation_date counter_technical_id latitude longitude log_bike_count
48321 100007049-102007049 28 boulevard Diderot E-O 100007049 28 boulevard Diderot 0.0 2020-09-01 02:00:00 2013-01-18 Y2H15027244 48.846028 2.375429 0.000000
48324 100007049-102007049 28 boulevard Diderot E-O 100007049 28 boulevard Diderot 1.0 2020-09-01 03:00:00 2013-01-18 Y2H15027244 48.846028 2.375429 0.693147
48327 100007049-102007049 28 boulevard Diderot E-O 100007049 28 boulevard Diderot 0.0 2020-09-01 04:00:00 2013-01-18 Y2H15027244 48.846028 2.375429 0.000000
48330 100007049-102007049 28 boulevard Diderot E-O 100007049 28 boulevard Diderot 4.0 2020-09-01 15:00:00 2013-01-18 Y2H15027244 48.846028 2.375429 1.609438
48333 100007049-102007049 28 boulevard Diderot E-O 100007049 28 boulevard Diderot 9.0 2020-09-01 18:00:00 2013-01-18 Y2H15027244 48.846028 2.375429 2.302585

We can check general information about different columns,

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 455163 entries, 48321 to 928462
Data columns (total 11 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   counter_id                 455163 non-null  category      
 1   counter_name               455163 non-null  category      
 2   site_id                    455163 non-null  int64         
 3   site_name                  455163 non-null  category      
 4   bike_count                 455163 non-null  float64       
 5   date                       455163 non-null  datetime64[ns]
 6   counter_installation_date  455163 non-null  datetime64[ns]
 7   counter_technical_id       455163 non-null  category      
 8   latitude                   455163 non-null  float64       
 9   longitude                  455163 non-null  float64       
 10  log_bike_count             455163 non-null  float64       
dtypes: category(4), datetime64[ns](2), float64(4), int64(1)
memory usage: 29.5 MB

and in particular the number of unique entries in each column,

In [7]:
data.nunique(axis=0)
Out[7]:
counter_id                     56
counter_name                   56
site_id                        30
site_name                      30
bike_count                    977
date                         8230
counter_installation_date      22
counter_technical_id           30
latitude                       30
longitude                      30
log_bike_count                977
dtype: int64

We have a 30 counting sites where sometimes multiple counters are installed per location. Let's look at the most frequented stations,

In [8]:
data.groupby(["site_name", "counter_name"])["bike_count"].sum().sort_values(
    ascending=False
).head(10).to_frame()
Out[8]:
bike_count
site_name counter_name
Totem 73 boulevard de Sébastopol Totem 73 boulevard de Sébastopol S-N 1809231.0
Totem 64 Rue de Rivoli Totem 64 Rue de Rivoli O-E 1406900.0
Totem 73 boulevard de Sébastopol Totem 73 boulevard de Sébastopol N-S 1357868.0
67 boulevard Voltaire SE-NO 67 boulevard Voltaire SE-NO 1036575.0
Totem 64 Rue de Rivoli Totem 64 Rue de Rivoli E-O 914089.0
27 quai de la Tournelle 27 quai de la Tournelle SE-NO 888717.0
Quai d'Orsay Quai d'Orsay E-O 849724.0
Totem Cours la Reine Totem Cours la Reine O-E 806149.0
Face au 48 quai de la marne Face au 48 quai de la marne SO-NE 806071.0
Face au 48 quai de la marne NE-SO 759194.0

Visualizing the data¶

Let's visualize the data, starting from the spatial distribution of counters on the map

In [9]:
import folium

m = folium.Map(location=data[["latitude", "longitude"]].mean(axis=0), zoom_start=13)

for _, row in (
    data[["counter_name", "latitude", "longitude"]]
    .drop_duplicates("counter_name")
    .iterrows()
):
    folium.Marker(
        row[["latitude", "longitude"]].values.tolist(), popup=row["counter_name"]
    ).add_to(m)

m
Out[9]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Note that in this RAMP problem we consider only the 30 most frequented counting sites, to limit data size.

Next we will look into the temporal distribution of the most frequented bike counter. If we plot it directly we will not see much because there are half a million data points,

In [10]:
mask = data["counter_name"] == "Totem 73 boulevard de Sébastopol S-N"

data[mask].plot(x="date", y="bike_count")
Out[10]:
<AxesSubplot: xlabel='date'>

Instead we aggregate the data, for instance, by week to have a clearer overall picture,

In [11]:
mask = data["counter_name"] == "Totem 73 boulevard de Sébastopol S-N"

data[mask].groupby(pd.Grouper(freq="1w", key="date"))[["bike_count"]].sum().plot()
Out[11]:
<AxesSubplot: xlabel='date'>

While at the same time, we can zoom on a week in particular for a more short-term visualization,

In [12]:
fig, ax = plt.subplots(figsize=(10, 4))

mask = (
    (data["counter_name"] == "Totem 73 boulevard de Sébastopol S-N")
    & (data["date"] > pd.to_datetime("2021/03/01"))
    & (data["date"] < pd.to_datetime("2021/03/08"))
)

data[mask].plot(x="date", y="bike_count", ax=ax)
Out[12]:
<AxesSubplot: xlabel='date'>

The hourly pattern has a clear variation between work days and weekends (7 and 8 March 2021).

If we look at the distribution of the target variable it skewed and non normal,

In [13]:
import seaborn as sns


ax = sns.histplot(data, x="bike_count", kde=True, bins=50)

Least square loss would not be appropriate to model it since it is designed for normal error distributions. One way to precede would be to transform the variable with a logarithmic transformation,

data['log_bike_count'] = np.log(1 + data['bike_count'])
In [14]:
ax = sns.histplot(data, x="log_bike_count", kde=True, bins=50)

which has a more pronounced central mode, but is still non symmetric. In the following, we use log_bike_count as the target variable as otherwise bike_count ranges over 3 orders of magnitude and least square loss would be dominated by the few large values.

Feature extraction¶

To account for the temporal aspects of the data, we cannot input the date field directly into the model. Instead we extract the features on different time-scales from the date field,

In [15]:
def _encode_dates(X):
    X = X.copy()  # modify a copy of X
    # Encode the date information from the DateOfDeparture columns
    X.loc[:, "year"] = X["date"].dt.year
    X.loc[:, "month"] = X["date"].dt.month
    X.loc[:, "day"] = X["date"].dt.day
    X.loc[:, "weekday"] = X["date"].dt.weekday
    X.loc[:, "hour"] = X["date"].dt.hour

    # Finally we can drop the original columns from the dataframe
    return X.drop(columns=["date"])
In [16]:
data["date"].head()
Out[16]:
48321   2020-09-01 02:00:00
48324   2020-09-01 03:00:00
48327   2020-09-01 04:00:00
48330   2020-09-01 15:00:00
48333   2020-09-01 18:00:00
Name: date, dtype: datetime64[ns]
In [17]:
_encode_dates(data[["date"]].head())
Out[17]:
year month day weekday hour
48321 2020 9 1 1 2
48324 2020 9 1 1 3
48327 2020 9 1 1 4
48330 2020 9 1 1 15
48333 2020 9 1 1 18

To use this function with scikit-learn estimators we wrap it with FunctionTransformer,

In [18]:
from sklearn.preprocessing import FunctionTransformer

date_encoder = FunctionTransformer(_encode_dates, validate=False)
date_encoder.fit_transform(data[["date"]]).head()
Out[18]:
year month day weekday hour
48321 2020 9 1 1 2
48324 2020 9 1 1 3
48327 2020 9 1 1 4
48330 2020 9 1 1 15
48333 2020 9 1 1 18

Since it is unlikely that, for instance, that hour is linearly correlated with the target variable, we would need to additionally encode categorical features for linear models. This is classically done with OneHotEncoder, though other encoding strategies exist.

In [19]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse=False)

enc.fit_transform(_encode_dates(data[["date"]])[["hour"]].head())
Out[19]:
array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

Linear model¶

Let's now construct our first linear model with Ridge. We use a few helper functions defined in problem.py of the starting kit to load the public train and test data:

In [20]:
import problem

X_train, y_train = problem.get_train_data()
X_test, y_test = problem.get_test_data()
In [21]:
X_train.head(2)
Out[21]:
counter_id counter_name site_id site_name date counter_installation_date counter_technical_id latitude longitude
400125 100049407-353255860 152 boulevard du Montparnasse E-O 100049407 152 boulevard du Montparnasse 2020-09-01 01:00:00 2018-12-07 Y2H19070373 48.840801 2.333233
408305 100049407-353255859 152 boulevard du Montparnasse O-E 100049407 152 boulevard du Montparnasse 2020-09-01 01:00:00 2018-12-07 Y2H19070373 48.840801 2.333233

and

In [22]:
y_train
Out[22]:
array([1.60943791, 1.38629436, 0.        , ..., 2.48490665, 1.60943791,
       1.38629436])

Where y contains the log_bike_count variable.

The test set is in the future as compared to the train set,

In [23]:
print(
    f'Train: n_samples={X_train.shape[0]},  {X_train["date"].min()} to {X_train["date"].max()}'
)
print(
    f'Test: n_samples={X_test.shape[0]},  {X_test["date"].min()} to {X_test["date"].max()}'
)
Train: n_samples=455163,  2020-09-01 01:00:00 to 2021-08-09 23:00:00
Test: n_samples=41608,  2021-08-10 01:00:00 to 2021-09-09 23:00:00
In [24]:
_encode_dates(X_train[["date"]]).columns.tolist()
Out[24]:
['year', 'month', 'day', 'weekday', 'hour']
In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

date_encoder = FunctionTransformer(_encode_dates)
date_cols = _encode_dates(X_train[["date"]]).columns.tolist()

categorical_encoder = OneHotEncoder(handle_unknown="ignore")
categorical_cols = ["counter_name", "site_name"]

preprocessor = ColumnTransformer(
    [
        ("date", OneHotEncoder(handle_unknown="ignore"), date_cols),
        ("cat", categorical_encoder, categorical_cols),
    ]
)

regressor = Ridge()

pipe = make_pipeline(date_encoder, preprocessor, regressor)
pipe.fit(X_train, y_train)
Out[25]:
Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _encode_dates at 0x7f5e41c58550>)),
                ('columntransformer',
                 ColumnTransformer(transformers=[('date',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['year', 'month', 'day',
                                                   'weekday', 'hour']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['counter_name',
                                                   'site_name'])])),
                ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer',
                 FunctionTransformer(func=<function _encode_dates at 0x7f5e41c58550>)),
                ('columntransformer',
                 ColumnTransformer(transformers=[('date',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['year', 'month', 'day',
                                                   'weekday', 'hour']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['counter_name',
                                                   'site_name'])])),
                ('ridge', Ridge())])
FunctionTransformer(func=<function _encode_dates at 0x7f5e41c58550>)
ColumnTransformer(transformers=[('date', OneHotEncoder(handle_unknown='ignore'),
                                 ['year', 'month', 'day', 'weekday', 'hour']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['counter_name', 'site_name'])])
['year', 'month', 'day', 'weekday', 'hour']
OneHotEncoder(handle_unknown='ignore')
['counter_name', 'site_name']
OneHotEncoder(handle_unknown='ignore')
Ridge()

We then evaluate this model with the RMSE metric,

In [26]:
from sklearn.metrics import mean_squared_error

print(
    f"Train set, RMSE={mean_squared_error(y_train, pipe.predict(X_train), squared=False):.2f}"
)
print(
    f"Test set, RMSE={mean_squared_error(y_test, pipe.predict(X_test), squared=False):.2f}"
)
Train set, RMSE=0.80
Test set, RMSE=0.73

The model doesn't have enough capacity to generalize on the train set, since we have lots of data with relatively few parameters. However it happened to work somewhat better on the test set. We can compare these results with the baseline predicting the mean value,

In [27]:
print("Baseline mean prediction.")
print(
    f"Train set, RMSE={mean_squared_error(y_train, np.full(y_train.shape, y_train.mean()), squared=False):.2f}"
)
print(
    f"Test set, RMSE={mean_squared_error(y_test, np.full(y_test.shape, y_test.mean()), squared=False):.2f}"
)
Baseline mean prediction.
Train set, RMSE=1.68
Test set, RMSE=1.44

which illustrates that we are performing better than the baseline.

Let's visualize the predictions for one of the stations,

In [28]:
mask = (
    (X_test["counter_name"] == "Totem 73 boulevard de Sébastopol S-N")
    & (X_test["date"] > pd.to_datetime("2021/09/01"))
    & (X_test["date"] < pd.to_datetime("2021/09/08"))
)

df_viz = X_test.loc[mask].copy()
df_viz["bike_count"] = np.exp(y_test[mask.values]) - 1
df_viz["bike_count (predicted)"] = np.exp(pipe.predict(X_test[mask])) - 1
In [29]:
fig, ax = plt.subplots(figsize=(12, 4))

df_viz.plot(x="date", y="bike_count", ax=ax)
df_viz.plot(x="date", y="bike_count (predicted)", ax=ax, ls="--")
ax.set_title("Predictions with Ridge")
ax.set_ylabel("bike_count")
Out[29]:
Text(0, 0.5, 'bike_count')

So we start to see the daily trend, and some of the week day differences are accounted for, however we still miss the details and the spikes in the evening are under-estimated.

A useful way to visualize the error is to plot y_pred as a function of y_true,

In [30]:
fig, ax = plt.subplots()

df_viz = pd.DataFrame({"y_true": y_test, "y_pred": pipe.predict(X_test)}).sample(
    10000, random_state=0
)

df_viz.plot.scatter(x="y_true", y="y_pred", s=8, alpha=0.1, ax=ax)
Out[30]:
<AxesSubplot: xlabel='y_true', ylabel='y_pred'>

It is recommended to use cross-validation for hyper-parameter tuning with GridSearchCV or more reliable model evaluation with cross_val_score. In this case, because we want the test data to always be in the future as compared to the train data, we can use TimeSeriesSplit,

The disadvantage, is that we can either have the training set size be different for each fold which is not ideal for hyper-parameter tuning (current figure), or have constant sized small training set which is also not ideal given the data periodicity. This explains that generally we will have worse cross-validation scores than test scores,

In [31]:
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

cv = TimeSeriesSplit(n_splits=6)

# When using a scorer in scikit-learn it always needs to be better when smaller, hence the minus sign.
scores = cross_val_score(
    pipe, X_train, y_train, cv=cv, scoring="neg_root_mean_squared_error"
)
print("RMSE: ", scores)
print(f"RMSE (all folds): {-scores.mean():.3} ± {(-scores).std():.3}")
RMSE:  [-0.96320864 -0.87311411 -0.85383867 -0.87000261 -1.0607788  -0.97297134]
RMSE (all folds): 0.932 ± 0.0738
In [ ]:
 
In [ ]: