Boston housing price regression
Current events on this problem
Keywords
boston_housing_starting_kit

Paris Saclay Center for Data Science

Test <a href=http://www.ramp.studio/problems/boston_housing>RAMP on Boston housing</a>

Balázs Kégl (LAL/CNRS)

Introduction

Boston housing is a small standard regression data set from the UCI Machine Learning Repository.

In [1]:
from __future__ import print_function

%matplotlib inline
import numpy as np
import pandas as pd
import pylab as plt
import seaborn as sns; sns.set()

Fetch the data and load it in pandas

In [2]:
local_filename = 'data/train.csv'

# Open file and print the first 3 lines
with open(local_filename) as fid:
    for line in fid.readlines()[:3]:
        print(line)
crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv

0.05646,0.0,12.83,0,0.437,6.232,53.7,5.0141,5,398,18.7,386.4,12.34,21.2

0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9

In [3]:
data = pd.read_csv(local_filename)
In [4]:
data.head()
Out[4]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 0.05646 0.0 12.83 0 0.437 6.232 53.7 5.0141 5 398 18.7 386.40 12.34 21.2
1 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9
2 0.01870 85.0 4.15 0 0.429 6.516 27.7 8.5353 4 351 17.9 392.43 6.36 23.1
3 4.64689 0.0 18.10 0 0.614 6.980 67.6 2.5329 24 666 20.2 374.68 11.66 29.8
4 0.08244 30.0 4.93 0 0.428 6.481 18.5 6.1899 6 300 16.6 379.41 6.36 23.7
In [5]:
data.shape
Out[5]:
(323, 14)
In [6]:
data.describe()
Out[6]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
count 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000 323.000000
mean 3.294199 11.673375 11.005356 0.077399 0.555259 6.272378 68.941486 3.902841 9.287926 406.448916 18.401548 358.280898 12.691393 22.306502
std 7.276906 23.892603 6.897771 0.267639 0.119381 0.697429 27.979315 2.222404 8.570537 164.833568 2.210779 88.044671 7.078999 8.986193
min 0.009060 0.000000 0.460000 0.000000 0.385000 3.561000 6.000000 1.137000 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.075700 0.000000 5.130000 0.000000 0.448000 5.879500 45.750000 2.079450 4.000000 282.500000 16.900000 374.555000 7.190000 16.700000
50% 0.239120 0.000000 8.560000 0.000000 0.538000 6.195000 79.200000 3.317500 5.000000 337.000000 18.800000 391.340000 11.500000 20.800000
75% 3.043800 12.500000 18.100000 0.000000 0.624000 6.597000 94.100000 5.408500 8.000000 666.000000 20.200000 396.175000 17.025000 24.800000
max 67.920800 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
In [7]:
data.hist(figsize=(10, 20), bins=50, layout=(7, 3));

Let's look as scatter plots between pairs of variables

In [8]:
sns.pairplot(data.iloc[:, :5]);  # take only 5 to make it fast enough

Building predictive models

The initial regressor in your sandbox.

In [ ]:
# %load submissions/starting_kit/regressor.py
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestRegressor


class Regressor(BaseEstimator):
    def __init__(self):
        pass

    def fit(self, X, y):
        self.reg = RandomForestRegressor(
            n_estimators=2, max_leaf_nodes=2, random_state=61)
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)

Local testing (before submission)

It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit, not on the classes defined in the cells of this notebook.

First pip install ramp-workflow or install it from the github repo. Make sure that the python file regressor.py is in the submissions/starting_kit folder, and the data train.csv and test.csv are in data. Then run

ramp_test_submission

If it runs and print training and test errors on each fold, then you can submit the code.

In [1]:
!ramp_test_submission
Testing Boston housing price regression
Reading train and test files from ./data ...
Training ./submissions/starting_kit ...
CV fold 0
	train rmse = 6.09
	valid rmse = 7.45
	test rmse = 7.86
	train rel_rmse = 0.45
	valid rel_rmse = 0.42
	test rel_rmse = 0.55
CV fold 1
	train rmse = 6.12
	valid rmse = 6.71
	test rmse = 7.76
	train rel_rmse = 0.45
	valid rel_rmse = 0.37
	test rel_rmse = 0.58
CV fold 2
	train rmse = 5.82
	valid rmse = 5.18
	test rmse = 6.85
	train rel_rmse = 0.4
	valid rel_rmse = 0.29
	test rel_rmse = 0.49
CV fold 3
	train rmse = 5.84
	valid rmse = 7.4
	test rmse = 7.73
	train rel_rmse = 0.42
	valid rel_rmse = 0.57
	test rel_rmse = 0.58
CV fold 4
	train rmse = 6.36
	valid rmse = 6.78
	test rmse = 8.22
	train rel_rmse = 0.46
	valid rel_rmse = 0.44
	test rel_rmse = 0.66
CV fold 5
	train rmse = 6.09
	valid rmse = 6.62
	test rmse = 7.39
	train rel_rmse = 0.43
	valid rel_rmse = 0.47
	test rel_rmse = 0.54
CV fold 6
	train rmse = 5.65
	valid rmse = 4.91
	test rmse = 6.53
	train rel_rmse = 0.4
	valid rel_rmse = 0.29
	test rel_rmse = 0.45
CV fold 7
	train rmse = 5.63
	valid rmse = 6.91
	test rmse = 7.72
	train rel_rmse = 0.46
	valid rel_rmse = 0.26
	test rel_rmse = 0.54
----------------------------
train rmse = 5.95 ± 0.24
train rel_rmse = 0.43 ± 0.023
valid rmse = 6.49 ± 0.886
valid rel_rmse = 0.39 ± 0.099
test rmse = 7.51 ± 0.523
test rel_rmse = 0.55 ± 0.061
----------------------------
Testing if the notebook can be converted to html
[NbConvertApp] Converting notebook ./boston_housing_starting_kit.ipynb to html
[NbConvertApp] Writing 432277 bytes to ./boston_housing_starting_kit.html

Submitting to ramp.studio

Once you found a good regressor, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event iris_test for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) regressor.py from submissions/starting_kit. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard. If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints mean cross-validation scores

----------------------------
train rmse = 5.95 ± 0.24
train rel_rmse = 0.43 ± 0.023
valid rmse = 6.49 ± 0.886
valid rel_rmse = 0.39 ± 0.099
test rmse = 7.51 ± 0.523
test rel_rmse = 0.55 ± 0.061

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is root mean squared error ("rmse"), so the line that is relevant in the output of ramp_test_submission is valid rmse = 6.49 ± 0.886. When the score is good enough, you can submit it at the RAMP.

More information

You can find more information in the README of the ramp-workflow library.

Contact

Don't hesitate to contact us.