Mechanics classification

Current events on this problem

- Saclay M2 Data Camp 2018/19,
number of participants =
**145**, number of submissions =**39**, combined score =**1.16**, click here for score vs time plot

Keywords

The goal of this challenge is to find ways to simulate the theoretical model finding process in physics or similar sciences. A good example is the paradigm shift between the Ptolemy and Copernican models of the solar system.

In the challenge we start by putting ourselves in Ptolemy's shoes, approach the same problem in imaginary systems in which the motion of the objects in the sky could have completely different underlying mechanics that we don't know of.

In the beginning, the task was simply to predict the future. Considering the early history of the birth of science, the capability of predicting the future was something that gave influential power to individuals; hence the investment of rulers to astrologers, and eventually to astronomers.

Scientific method was invented through this effort, marked by the Copernican revolution, which significantly boosted the predictive power of the models. While Ptolemy had a model that was able to (with sufficient amount of parameters) explain what has been observed, Copernican model was able to predict what came next.

In the terminology of ML, Ptolemy was able to achieve a good training score, but Copernicus was able to optimize training and test scores simultaneously. This observation led to the triumph of scientific method.

Inspired by this, we first focus on the regression problem, and then discuss how this leads to the formula classification to help solve it better.

The challenge consists of two parts: Classification and Regression. If the two tasks are wished to be solved simultaneously, the FeatureExtractor element can be used. The workflow can be briefly sketched as:

In this tutorial, we will discuss the workflow elements in reverse order: Regressor, Classifier, FeatureExtractor, in order to understand the motivations better. We consider the regression to be the absolute problem, and every step added before is a tool that eases the task of its successor.

We will observe the angular position (phi) of the planet, and try to predict what this value will be in the future, 50 steps after the last column. The phi value takes values between [-$\pi$,$\pi$] and the error is not supposed to scale with the value, therefore we have chosen RMS error as the score metric.

The main data is the time series of angles of a planet, observed from the point of reference (not necessarily an inertial frame). Although the motion of the planet might be something very different than what we know of, the motion is restricted to the two dimensional plane for sake of simplicity.

In [3]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

In [4]:

```
# Main data format
data_df = pd.read_csv("data/train_small.csv")
print("shape : ", data_df.shape)
data_df.iloc[:, [0, 1, 2, 3, 197, 198, 199, -3, -2, -1]].head(10)
```

Out[4]:

The indices on the horizontal axis represent time steps with equal intervals, can be considered as the unit of time. The values of phi for these times are listed. The "future" column represents the value of phi at 50 steps ahead. This is what we will try to predict in the regression problem. The "system" column represents one of the five mechanical systems (set of formulas) that was used to generate the time series. This will be the classification target. Each system comes with a bunch of "hyperparameters" which we draw randomly from a distribution. We will use different distributions which are indexed in the "distribution" column. This variable will be available both at training and test time, but it may have values in the test set not present in the training set. So this is not a target but not really a feature either, more like a "hint" (see cross-validation below).

In [5]:

```
y_future = data_df['future'].values
y_system = data_df['system'].values
x_distribution = data_df['distribution'].values
x_series = data_df.drop(['future', 'system', 'distribution'], axis=1).values
```

In [6]:

```
x_series
```

Out[6]:

In [7]:

```
x_distribution
```

Out[7]:

In [8]:

```
y_system
```

Out[8]:

In [9]:

```
y_future
```

Out[9]:

The starting kit only includes the small data files, 5 rows only. The full public train and test datasets are to be downloaded with the script:

In [10]:

```
!python download_data.py
```

Let's now pick one row and try to predict the future position of the planet by modeling the motion. To illustrate the challenge better, we have some additional set of files with extra information, namely the 2D-positions of the planets involved.

In [11]:

```
file = "illustration"
data = pd.read_csv("data/" + file + "_phi.csv")
# Additional data items that are only for illustration:
data_x = pd.read_csv("data/" + file + "_x.csv")
data_y = pd.read_csv("data/" + file + "_y.csv")
ref_x = pd.read_csv("data/" + file + "_ref_x.csv")
ref_y = pd.read_csv("data/" + file + "_ref_y.csv")
```

In [12]:

```
def plot_observation(i_sample, t_min=0, t_max=200):
phis = data.iloc[i_sample, range(t_max)].values[t_min:t_max]
ax = plt.gca()
ax.cla()
ax.set_xlim([0, len(phis)])
ax.set_ylim([-np.pi,np.pi])
plt.xlabel('time')
plt.ylabel('$\phi$')
plt.scatter(range(len(phis)), phis)
plt.show()
```

In [13]:

```
def plot_orbit(i_sample, t_min=0, t_max=200):
x = data_x.iloc[i_sample, range(t_max)].values[t_min:t_max]
y = data_y.iloc[i_sample, range(t_max)].values[t_min:t_max]
x0 = ref_x.iloc[i_sample, range(t_max)].values[t_min:t_max]
y0 = ref_y.iloc[i_sample, range(t_max)].values[t_min:t_max]
ax = plt.gca()
ax.cla()
xmax = np.amax(np.abs([x, y, x0, y0])) + 20
ax.set_xlim([-xmax, xmax])
ax.set_ylim([-xmax, xmax])
plt.axes().set_aspect('equal')
plt.scatter(x0, y0, color='r')
plt.plot(x0, y0, color='r')
plt.scatter(x, y)
for i, t in enumerate(x):
xl = np.array([x0[i], x[i]])
yl = np.array([y0[i], y[i]])
plot = plt.Line2D(xl, yl, color='k', linewidth=0.5, linestyle='dotted')
ax.add_artist(plot)
plt.show()
```

In [14]:

```
plot_observation(0)
```

In [15]:

```
plot_orbit(0)
```

Notice that this is a single row of the dataset. With only the information in a single row, we can already make a prediction of the future, if we make some assumptions. For example...

In [14]:

```
# Function that returns the final angle of several epicycles joined successively
def f_phi(a, w, p, t):
x = np.sum(a * np.cos(w * t + p))
y = np.sum(a * np.sin(w * t + p))
return np.arctan2(y, x)
```

In [15]:

```
# Look at the 0th row
phis = data.iloc[0, range(200)].values
```

In [16]:

```
# Some initial guess of parameters by eye (and some knowledge of physics):
c_initial = np.array([51., 2.*np.sqrt(1./51.), np.pi, 105., np.sqrt(1./105.), 0.])
c = c_initial
a = c[0::3]
w = c[1::3]
p = c[2::3]
times = np.array(range(250))
phi_pred = [f_phi(a, w, p, i) for i in times]
plt.scatter(times[:len(phis)], phis)
plt.plot(times, phi_pred)
plt.show()
```

Now, let's fit the parameters to describe the function better:

In [17]:

```
def epicycle_error(c):
a = c[0::3]
w = c[1::3]
p = c[2::3]
phi_epi = [f_phi(a, w, p, i) for i in range(200)]
return np.sum((np.unwrap(phis - phi_epi))**2)
```

In [18]:

```
c = np.zeros(6)
use_cma = False
if(use_cma):
import cma
es = cma.CMAEvolutionStrategy(x0=c_initial, sigma0 = 0.001)
es.optimize(epicycle_error)
es.result_pretty()
c = np.array(es.result.xbest).reshape(-1, 1)
else:
from scipy.optimize import minimize
res = minimize(fun=epicycle_error, x0=c_initial,
method='Nelder-Mead', tol=1e-10,
options={'xtol': 0.001, 'maxiter': 100000})
c = res.x
a = c[0::3]
w = c[1::3]
p = c[2::3]
times = np.array(range(len(phis)))
phi_pred = [f_phi(a, w, p, i) for i in range(251)]
plt.plot(range(251), phi_pred)
plt.scatter(times, phis)
plt.show()
print("Fitted parameters : ", c)
```

The above is already what Copernican model would correctly describe. Let's now insist that the Mars is rotating in a bunch of epicycles around the earth, which are sucessively smaller. In order to do that, add more epicycles, and L1 regularization to kill those that are not needed. Below is a time-expensive computation, not recommended for the analysis of the full dataset. Since the functions involved are differentiable, it is perhaps possible to implement a faster method for optimization, e.g. gradient descent. An example is in the example: submissions/tf_ptolemy/

In [19]:

```
def epicycle_error_l1reg(c):
a = c[0::3]
w = c[1::3]
p = c[2::3]
phi_epi = [f_phi(a, w, p, i) for i in range(200)]
return np.sum((np.unwrap(phis - phi_epi))**2) + 0.1 * np.sum(np.abs(a) / np.abs(a[0]))
```

In [20]:

```
c_overfit = np.array([51., 0.3, 0.,
10., 0.1, 0.,
10., 0.5, 0.,
5., 0.1, 0.,
5., 0.5, 0.,
])
c = np.zeros(18)
use_cma = False
if(use_cma):
import cma
es = cma.CMAEvolutionStrategy(x0=c_overfit, sigma0 = 0.001)
es.optimize(epicycle_error)
es.result_pretty()
c = np.array(es.result.xbest).reshape(-1, 1)
else:
from scipy.optimize import minimize
res = minimize(fun=epicycle_error_l1reg, x0=c_overfit,
method='Nelder-Mead', tol=1e-10,
options={'xtol': 0.001, 'maxiter': 10000})
c = res.x
a = c[0::3]
w = c[1::3]
p = c[2::3]
times = np.array(range(len(phis)))
phi_pred = [f_phi(a, w, p, i) for i in range(251)]
plt.scatter(times, phis)
plt.plot(range(251), phi_pred)
plt.show()
print("Fitted parameters : ", c)
```

Let's see whether LSTM can predict that far ahead...

In [28]:

```
from keras.models import Sequential
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
```

In [121]:

```
n_time = 50
target_future = 50
predict_future = 50
n_train = 100
sample_step = 1
n_steps = len(phis)
X_ts = np.ndarray(shape=(n_train, n_time, 1))
for i in range(0, n_train):
X_ts[i, :, 0] = phis[i * sample_step:i * sample_step + n_time] / np.pi
target = phis[n_time + predict_future: n_train * sample_step + n_time + predict_future: sample_step] / np.pi
inputs = Input(shape=(n_time, 1), dtype='float', name='main_input')
layer = LSTM(12)(inputs)
predictions = Dense(1)(layer)
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='adam',
loss='mean_squared_error')
model.fit(X_ts, target, epochs=8, batch_size=1, verbose=0)
```

Out[121]:

In [122]:

```
x_pred = np.ndarray(shape=(n_steps + target_future, n_time, 1))
for i in range(0, n_steps - n_time):
x_pred[i, :, 0] = phis[i:i + n_time] / np.pi
for i in range(n_steps - n_time, n_steps + target_future):
x_pred[i, :, 0] = model.predict(x_pred[i - n_time - predict_future].reshape(1, n_time, 1))[0]
plt.scatter(times, phis)
plt.plot(range(n_steps + target_future), x_pred[:, 0, 0] * np.pi)
plt.show()
```

Notice that this was a single row in a dataset. The challenge is to do this analysis for each row separately. The dataset now consists of alternative physics/mechanics scenarios, in which some formulas/dynamics are made up. Each row in the dataset is another physics system with unknown dynamics, and randomized initial state, for which the participants are to determine the relevant model first and re-fit for row-specific parameters.

The participant can follow many different strategies. Some examples are provided, which are explained below, but any solution that returns the right classification and regression results is fine.

The above process could be performed either inside the predict() function, or, it can be performed in a transform() function in FeatureExtractor and the predict() simply returns the corresponding result.

Although so far the classification part of the problem has been described as a helpful side task, it is essential to the main motivation of this challenge: Invent theoretical models in science.

Ideally, we would like to find the best functions that describe the data. In this challenge, we simplified this effort by assuming that the first step in generating the formula is to be able to classify a dataset into a set of given formulas.

As the formula space is extended, the problem gets similar to theory generation.

The starting kit is basically a template for all the functions to be implemented. It attempts to solve the classification and regression problems without any knowledge of the context and the discussion above. You can use this example to start implementing your own approach from complete scratch.

In [ ]:

```
!ramp_test_submission --submission starting_kit
```

Three different LSTM-based examples are provided, distinct by their scope of training.

The simplest version basically takes every row in the dataset, and trains the model, then predicts the result when a new row is given.

In [ ]:

```
!ramp_test_submission --submission lstm_all_sample
```

One can first split the dataset into the different systems and train separate models for each. At the prediction time, we first predict the system type and then use the correspondingly trained model for regression.

In [ ]:

```
!ramp_test_submission --submission lstm_by_class
```

As each row is a time series, one can re-train a new LSTM within each row and predict the 50 steps ahead by running the model recursively. This is already what was illustrated in the above fitting exercise with a single row. This approach, however, will be costly in computing, therefore it should be tested carefully before submitting.

In [ ]:

```
!ramp_test_submission --submission lstm_by_row
```

Although there are two scores involved, we want to set the regression as the main score and treat the classification problem as a side product of the optimization for regression. We want, however, the participants to at least submit a non-trivial solution for classification therefore we assign a small but non-zero proportion for the combined score. The classification score is the fraction of false identifications (printed as "err"), and has a value between 0 and 1. A random assignment is expected to have an error of 0.80.
For the regression, RMS-error is used (printed as "rmse"), which is wrapped around [$\pi$] and [-$\pi$]. It's maximum value is [$\pi$] and a random assignment is expected to have an error around 1.8.
The final error (being lower the better) is 0.1 * err + 0.9 * rmse, printed as "combined".

Each row in the data sample is generated with random values of parameters that go into the formula. As these parameters cannot cover the infinite phase space, they come from certain distributions which may be different for training and testing datasets. To test the robustness of the learning approach, one can already test the dependency of parameter values, by using different input distributions in traning and validation. In order to do that, we have provided a "distribution" variable in the dataset, by selection of which one can have an handle of sub-datasets with different parameter distributions. The the get_cv function in problem.py implements this already, hence the 3 CV folds in the starting kit reflect the above. The "distribution" variable appears in the data X, however make sure you do not train your models with this variable, as the values of this variable in the test sample will be different.

It is **important that you test your submission files before submitting them**. For this we provide a unit test. Note that the test runs on your files in `submissions/starting_kit`

, not on the classes defined in the cells of this notebook.

Check list

- Make sure you have installed
`ramp-workflow`

locally (see above) - Make sure that the python files
`object_detector.py`

is in the`submissions/your_submission`

folder, and the train and test data are in`data`

- If you haven't yet, download the images by executing
`python download_data.py`

Finally, make sure the local processing goes through by running the

`ramp_test_submission --submission your_submission`

If you want to quickly test the that there are no obvious code errors, use the `--quick-test`

keyword to only use data from the first 30 images.

`ramp_test_submission --submission your_submission --quick-test`

In [ ]:

```
!ramp_test_submission --quick-test
```

If you get to see the train and test scores, and no errors, then you can submit your model to the ramp.studio.

Once you found a good model, you can submit them to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, the event mars_craters for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there **can be a delay between asking for signup and being able to submit**.

Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) your files. The submission is trained and tested on our backend in the same way as `ramp_test_submission`

does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with `ramp_test_submission`

), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., *locally*, and checking them with `ramp_test_submission`

. The script prints scores for each cross-validation fold

```
score combined err rmse
train 1.322 0.423 1.422
valid 1.740 0.730 1.852
test 1.775 0.745 1.889
```

and the mean cross-validation score at the end

```
----------------------------
Mean CV scores
----------------------------
score combined err rmse
train 1.34 ± 0.0264 0.455 ± 0.024 1.438 ± 0.0305
valid 1.668 ± 0.1004 0.726 ± 0.0605 1.773 ± 0.1066
test 1.722 ± 0.0278 0.706 ± 0.025 1.835 ± 0.0293
----------------------------
Bagged scores
----------------------------
score combined
valid 1.669
test 1.686
```

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is the mean average precision AP, that is the area under the precision/recall curve. When the score is good enough, you can submit it at the RAMP.

You can find more information in the README of the ramp-workflow library.

Don't hesitate to contact us.

In [16]:

```
file = "illustration"
data = pd.read_csv("data/" + file + "_phi.csv")
# Additional data items that are only for illustration:
data_x = pd.read_csv("data/" + file + "_x.csv")
data_y = pd.read_csv("data/" + file + "_y.csv")
ref_x = pd.read_csv("data/" + file + "_ref_x.csv")
ref_y = pd.read_csv("data/" + file + "_ref_y.csv")
```

In [17]:

```
def plot_data(i, t_min=0, t_max=200):
plot_orbit(i, t_min, t_max)
plot_observation(i, t_min, t_max)
```

In [ ]:

```
for i in range(len(data.index)):
plot_data(i, 0, 200)
```