Mechanics classification
Current events on this problem
Keywords
Notebook

RAMP on Mechanics determination

Yetkin Yilmaz (LAL), Balazs Kegl (CDS, LAL)

The goal of this challenge is to find ways to simulate the theoretical model finding process in physics or similar sciences. A good example is the paradigm shift between the Ptolemy and Copernican models of the solar system.

In the challenge we start by putting ourselves in Ptolemy's shoes, approach the same problem in imaginary systems in which the motion of the objects in the sky could have completely different underlying mechanics that we don't know of.

Drawing

In the beginning, the task was simply to predict the future. Considering the early history of the birth of science, the capability of predicting the future was something that gave influential power to individuals; hence the investment of rulers to astrologers, and eventually to astronomers.

Scientific method was invented through this effort, marked by the Copernican revolution, which significantly boosted the predictive power of the models. While Ptolemy had a model that was able to (with sufficient amount of parameters) explain what has been observed, Copernican model was able to predict what came next.

In the terminology of ML, Ptolemy was able to achieve a good training score, but Copernicus was able to optimize training and test scores simultaneously. This observation led to the triumph of scientific method.

Inspired by this, we first focus on the regression problem, and then discuss how this leads to the formula classification to help solve it better.

The challenge consists of two parts: Classification and Regression. If the two tasks are wished to be solved simultaneously, the FeatureExtractor element can be used. The workflow can be briefly sketched as:

In this tutorial, we will discuss the workflow elements in reverse order: Regressor, Classifier, FeatureExtractor, in order to understand the motivations better. We consider the regression to be the absolute problem, and every step added before is a tool that eases the task of its successor.

Regression problem

We will observe the angular position (phi) of the planet, and try to predict what this value will be in the future, 50 steps after the last column. The phi value takes values between [-$\pi$,$\pi$] and the error is not supposed to scale with the value, therefore we have chosen RMS error as the score metric.

Data format

The main data is the time series of angles of a planet, observed from the point of reference (not necessarily an inertial frame). Although the motion of the planet might be something very different than what we know of, the motion is restricted to the two dimensional plane for sake of simplicity.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [4]:
# Main data format
data_df = pd.read_csv("data/train_small.csv")

print("shape : ", data_df.shape)
data_df.iloc[:, [0, 1, 2, 3, 197, 198, 199, -3, -2, -1]].head(10)
shape :  (200, 203)
Out[4]:
0 1 2 3 197 198 199 system distribution future
0 0.46730 0.47928 0.49104 0.50261 2.71572 2.72576 2.73603 C 12 1.55645
1 0.77730 0.85519 0.91081 0.93460 0.66001 0.79071 0.92095 B 12 -1.99754
2 -2.26211 -2.37106 -2.48448 -2.60144 1.50304 1.35650 1.20993 E 14 -0.27651
3 -1.61153 -1.65357 -1.69793 -1.74406 -1.33709 -1.30292 -1.27312 C 14 -1.30723
4 -0.11265 -0.10658 -0.10057 -0.09412 -0.10147 -0.06264 -0.02405 B 10 1.86705
5 1.65886 1.70642 1.73964 1.75591 -1.06408 -0.88441 -0.71081 A 13 -3.11365
6 -0.95944 -0.87065 -0.77946 -0.68566 0.66309 0.78207 0.90190 A 11 -0.20693
7 2.52243 2.55029 2.58098 2.61402 -2.42054 -2.35892 -2.29736 D 14 1.52666
8 -0.68677 -0.64949 -0.61257 -0.57598 2.25664 2.31062 2.36461 D 11 -1.02371
9 -0.56130 -0.52931 -0.51508 -0.51869 -2.09188 -2.19330 -2.29531 A 10 1.48291

The indices on the horizontal axis represent time steps with equal intervals, can be considered as the unit of time. The values of phi for these times are listed. The "future" column represents the value of phi at 50 steps ahead. This is what we will try to predict in the regression problem. The "system" column represents one of the five mechanical systems (set of formulas) that was used to generate the time series. This will be the classification target. Each system comes with a bunch of "hyperparameters" which we draw randomly from a distribution. We will use different distributions which are indexed in the "distribution" column. This variable will be available both at training and test time, but it may have values in the test set not present in the training set. So this is not a target but not really a feature either, more like a "hint" (see cross-validation below).

In [5]:
y_future = data_df['future'].values
y_system = data_df['system'].values
x_distribution = data_df['distribution'].values
x_series = data_df.drop(['future', 'system', 'distribution'], axis=1).values
In [6]:
x_series
Out[6]:
array([[ 0.4673 ,  0.47928,  0.49104, ...,  2.71572,  2.72576,  2.73603],
       [ 0.7773 ,  0.85519,  0.91081, ...,  0.66001,  0.79071,  0.92095],
       [-2.26211, -2.37106, -2.48448, ...,  1.50304,  1.3565 ,  1.20993],
       ...,
       [-0.39613, -0.37815, -0.3584 , ...,  0.27087,  0.30834,  0.34622],
       [-3.10665, -2.98594, -2.86553, ..., -0.5604 , -0.48937, -0.41904],
       [-1.97592, -1.93239, -1.88605, ..., -1.24425, -1.19202, -1.13827]])
In [7]:
x_distribution
Out[7]:
array([12, 12, 14, 14, 10, 13, 11, 14, 11, 10, 12, 13, 12, 12, 14, 13, 11,
       12, 12, 11, 12, 14, 12, 13, 11, 11, 14, 14, 12, 10, 12, 14, 12, 11,
       13, 12, 13, 10, 13, 14, 13, 14, 14, 12, 12, 13, 11, 12, 11, 13, 10,
       14, 14, 13, 14, 14, 13, 11, 11, 14, 13, 13, 13, 11, 12, 11, 11, 12,
       14, 13, 13, 12, 11, 13, 14, 14, 10, 10, 12, 11, 13, 13, 11, 11, 10,
       10, 14, 14, 14, 11, 11, 11, 10, 12, 11, 11, 14, 10, 14, 11, 10, 11,
       13, 12, 13, 12, 11, 14, 12, 12, 12, 11, 10, 13, 13, 11, 10, 14, 14,
       11, 10, 12, 10, 10, 13, 10, 10, 12, 13, 10, 14, 10, 12, 14, 11, 14,
       11, 10, 10, 12, 10, 13, 14, 13, 13, 10, 10, 10, 14, 10, 14, 12, 13,
       13, 14, 12, 12, 14, 10, 13, 12, 14, 14, 12, 11, 10, 11, 11, 12, 13,
       13, 11, 10, 14, 12, 11, 14, 12, 13, 14, 13, 10, 10, 13, 14, 12, 10,
       11, 11, 10, 10, 13, 12, 11, 12, 11, 14, 12, 13, 10])
In [8]:
y_system
Out[8]:
array(['C', 'B', 'E', 'C', 'B', 'A', 'A', 'D', 'D', 'A', 'A', 'C', 'E',
       'D', 'D', 'B', 'B', 'A', 'B', 'A', 'E', 'B', 'C', 'C', 'A', 'A',
       'E', 'D', 'B', 'C', 'C', 'A', 'D', 'B', 'A', 'E', 'B', 'D', 'D',
       'A', 'D', 'D', 'B', 'E', 'C', 'D', 'B', 'A', 'E', 'D', 'C', 'D',
       'D', 'B', 'D', 'A', 'B', 'A', 'C', 'C', 'D', 'A', 'C', 'A', 'B',
       'A', 'D', 'D', 'B', 'E', 'D', 'B', 'C', 'A', 'B', 'C', 'B', 'D',
       'D', 'A', 'A', 'E', 'E', 'C', 'E', 'E', 'D', 'A', 'C', 'D', 'A',
       'D', 'C', 'E', 'A', 'E', 'E', 'E', 'D', 'A', 'C', 'B', 'E', 'C',
       'E', 'B', 'A', 'C', 'E', 'A', 'A', 'B', 'C', 'B', 'A', 'B', 'A',
       'D', 'A', 'E', 'C', 'B', 'C', 'E', 'C', 'C', 'E', 'A', 'B', 'C',
       'D', 'B', 'B', 'A', 'E', 'E', 'B', 'E', 'C', 'D', 'A', 'C', 'C',
       'D', 'B', 'E', 'D', 'E', 'E', 'D', 'B', 'E', 'C', 'E', 'C', 'B',
       'B', 'C', 'D', 'C', 'C', 'E', 'C', 'B', 'A', 'A', 'B', 'D', 'A',
       'B', 'D', 'E', 'E', 'A', 'B', 'B', 'C', 'E', 'B', 'B', 'E', 'E',
       'C', 'B', 'E', 'D', 'D', 'D', 'C', 'E', 'D', 'E', 'B', 'D', 'B',
       'E', 'A', 'D', 'B', 'B'], dtype=object)
In [9]:
y_future
Out[9]:
array([ 1.55645, -1.99754, -0.27651, -1.30723,  1.86705, -3.11365,
       -0.20693,  1.52666, -1.02371,  1.48291,  0.68662, -1.58258,
       -0.60436,  2.52805,  0.88146, -0.05055,  1.01023, -1.24577,
        1.75874,  1.19494, -0.07263,  2.07198,  1.49171, -0.14395,
        1.90662,  0.87529, -0.83032,  0.93406, -0.95521,  1.12551,
       -2.44419, -2.7345 ,  2.15235, -0.97979, -0.17747, -1.33575,
       -1.21158,  1.8016 , -1.098  , -2.8022 ,  0.34522,  2.21185,
       -1.02968, -0.20183, -1.20952, -2.63829,  1.05192,  0.21538,
        2.95818, -2.29037,  2.2144 ,  2.47155, -0.88815,  1.45926,
       -1.81677,  1.66056,  1.40231, -0.04069,  0.60029,  2.10289,
       -0.26635, -1.7032 ,  0.35601, -0.56147, -2.87334,  2.47941,
        0.34363,  2.49176,  0.49962,  3.06193, -0.9236 ,  1.11707,
       -1.474  , -1.49481, -0.02415, -0.76735,  1.95333,  2.44304,
        2.30825,  2.3635 , -0.44243,  0.23529,  2.78502,  1.63556,
       -2.10955, -0.75596, -2.4211 , -0.455  , -0.86509, -2.99509,
        0.63001, -2.81298, -2.15483,  0.79475, -1.74288, -2.51311,
       -0.63911, -0.96516,  2.73464,  0.4561 , -0.64905, -2.652  ,
       -0.94161, -0.92161, -1.46436,  1.8359 ,  2.3728 ,  1.30986,
        2.56531,  2.70044, -0.63167, -2.87286, -1.21701,  2.9063 ,
       -1.61884,  3.01047,  1.34218,  0.5721 ,  2.59282,  0.32958,
       -1.43949,  0.33654,  1.38442, -1.33579,  0.30157, -1.428  ,
       -1.0595 ,  1.41994, -0.56992,  1.01703, -1.35289, -2.24516,
       -2.42246, -0.77507, -2.66018,  0.0083 ,  2.70029, -2.15505,
        2.2558 ,  1.66627, -2.13916,  0.58818,  2.83787, -1.30919,
        1.27403,  0.77031,  2.37107,  0.25684, -0.63147, -0.70992,
       -0.48562, -0.25293,  2.12373, -0.68955, -0.13685,  1.16542,
       -2.0822 ,  1.79931, -2.2413 ,  1.4738 ,  2.21478,  1.50745,
        0.46313,  1.69537, -0.2336 , -2.55909,  0.24271, -1.41317,
       -0.14806, -2.04283, -0.54231, -1.71161, -1.22565,  2.88291,
        2.18891,  2.62623,  0.88163,  2.99639, -1.7229 ,  1.09224,
       -1.37788,  0.6025 , -2.93865,  0.64708,  0.58693, -2.68705,
        1.96022, -1.17503,  0.83633,  1.41745, -1.09345,  2.762  ,
        0.14216, -0.60697, -2.38461, -0.42178, -1.11095,  2.7235 ,
        1.67746,  1.67797])

The starting kit only includes the small data files, 5 rows only. The full public train and test datasets are to be downloaded with the script:

In [10]:
!python download_data.py

Fitting a single time series

Let's now pick one row and try to predict the future position of the planet by modeling the motion. To illustrate the challenge better, we have some additional set of files with extra information, namely the 2D-positions of the planets involved.

In [11]:
file = "illustration"
data = pd.read_csv("data/" + file + "_phi.csv")

# Additional data items that are only for illustration:
data_x = pd.read_csv("data/" + file + "_x.csv")
data_y = pd.read_csv("data/" + file + "_y.csv")
ref_x = pd.read_csv("data/" + file + "_ref_x.csv")
ref_y = pd.read_csv("data/" + file + "_ref_y.csv")
In [12]:
def plot_observation(i_sample, t_min=0, t_max=200):
    phis = data.iloc[i_sample, range(t_max)].values[t_min:t_max]
    ax = plt.gca()
    ax.cla()
    ax.set_xlim([0, len(phis)])
    ax.set_ylim([-np.pi,np.pi])
    plt.xlabel('time')
    plt.ylabel('$\phi$')

    plt.scatter(range(len(phis)), phis)
    plt.show()
In [13]:
def plot_orbit(i_sample, t_min=0, t_max=200):
    x = data_x.iloc[i_sample, range(t_max)].values[t_min:t_max]
    y = data_y.iloc[i_sample, range(t_max)].values[t_min:t_max]
    x0 = ref_x.iloc[i_sample, range(t_max)].values[t_min:t_max]
    y0 = ref_y.iloc[i_sample, range(t_max)].values[t_min:t_max]

    ax = plt.gca()
    ax.cla()
    xmax = np.amax(np.abs([x, y, x0, y0])) + 20

    ax.set_xlim([-xmax, xmax])
    ax.set_ylim([-xmax, xmax])
    plt.axes().set_aspect('equal')
    plt.scatter(x0, y0, color='r')
    plt.plot(x0, y0, color='r')
    plt.scatter(x, y)
    for i, t in enumerate(x):
        xl = np.array([x0[i], x[i]])
        yl = np.array([y0[i], y[i]])
        plot = plt.Line2D(xl, yl, color='k', linewidth=0.5, linestyle='dotted')
        ax.add_artist(plot)
    plt.show()
In [14]:
plot_observation(0)
In [15]:
plot_orbit(0)
/anaconda/envs/python-flow/lib/python3.6/site-packages/matplotlib/cbook/deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

Notice that this is a single row of the dataset. With only the information in a single row, we can already make a prediction of the future, if we make some assumptions. For example...

In [14]:
# Function that returns the final angle of several epicycles joined successively
def f_phi(a, w, p, t):
    x = np.sum(a * np.cos(w * t + p))
    y = np.sum(a * np.sin(w * t + p))
    return np.arctan2(y, x)
In [15]:
# Look at the 0th row
phis = data.iloc[0, range(200)].values
In [16]:
# Some initial guess of parameters by eye (and some knowledge of physics):
c_initial = np.array([51., 2.*np.sqrt(1./51.), np.pi, 105., np.sqrt(1./105.), 0.])

c = c_initial
a = c[0::3]
w = c[1::3]
p = c[2::3]

times = np.array(range(250))
phi_pred = [f_phi(a, w, p, i) for i in times]

plt.scatter(times[:len(phis)], phis)
plt.plot(times, phi_pred)
plt.show()

Now, let's fit the parameters to describe the function better:

In [17]:
def epicycle_error(c):
    a = c[0::3]
    w = c[1::3]
    p = c[2::3]
    phi_epi = [f_phi(a, w, p, i) for i in range(200)]
    return np.sum((np.unwrap(phis - phi_epi))**2)
In [18]:
c = np.zeros(6)

use_cma = False
if(use_cma):
    import cma
    es = cma.CMAEvolutionStrategy(x0=c_initial, sigma0 = 0.001)
    es.optimize(epicycle_error)
    es.result_pretty()
    c = np.array(es.result.xbest).reshape(-1, 1)
else:
    from scipy.optimize import minimize
    res = minimize(fun=epicycle_error, x0=c_initial,
                   method='Nelder-Mead', tol=1e-10,
                   options={'xtol': 0.001, 'maxiter': 100000})
    c = res.x

a = c[0::3]
w = c[1::3]
p = c[2::3]

times = np.array(range(len(phis)))
phi_pred = [f_phi(a, w, p, i) for i in range(251)]

plt.plot(range(251), phi_pred)
plt.scatter(times, phis)
plt.show()

print("Fitted parameters : ", c)
Fitted parameters :  [5.21092589e+01 2.69584554e-01 3.15340607e+00 1.07120115e+02
 9.13450445e-02 5.12794314e-03]

The above is already what Copernican model would correctly describe. Let's now insist that the Mars is rotating in a bunch of epicycles around the earth, which are sucessively smaller. In order to do that, add more epicycles, and L1 regularization to kill those that are not needed. Below is a time-expensive computation, not recommended for the analysis of the full dataset. Since the functions involved are differentiable, it is perhaps possible to implement a faster method for optimization, e.g. gradient descent. An example is in the example: submissions/tf_ptolemy/

In [19]:
def epicycle_error_l1reg(c):
    a = c[0::3]
    w = c[1::3]
    p = c[2::3]
    phi_epi = [f_phi(a, w, p, i) for i in range(200)]
    return np.sum((np.unwrap(phis - phi_epi))**2) + 0.1 * np.sum(np.abs(a) / np.abs(a[0]))
In [20]:
c_overfit = np.array([51., 0.3, 0., 
                      10., 0.1, 0.,
                      10., 0.5, 0.,
                      5., 0.1, 0.,
                      5., 0.5, 0.,
                     ])

c = np.zeros(18)

use_cma = False
if(use_cma):
    import cma
    es = cma.CMAEvolutionStrategy(x0=c_overfit, sigma0 = 0.001)
    es.optimize(epicycle_error)
    es.result_pretty()
    c = np.array(es.result.xbest).reshape(-1, 1)
else:
    from scipy.optimize import minimize
    res = minimize(fun=epicycle_error_l1reg, x0=c_overfit,
                   method='Nelder-Mead', tol=1e-10,
                   options={'xtol': 0.001, 'maxiter': 10000})
    c = res.x


a = c[0::3]
w = c[1::3]
p = c[2::3]

times = np.array(range(len(phis)))
phi_pred = [f_phi(a, w, p, i) for i in range(251)]

plt.scatter(times, phis)
plt.plot(range(251), phi_pred)
plt.show()

print("Fitted parameters : ", c)
Fitted parameters :  [ 6.04332050e+01  9.10313146e-02  2.42081947e-03  5.98938042e+00
  1.12758734e-01  3.85993346e-03  4.45081474e-01  5.09210615e-01
 -2.79211269e-03  5.08937046e+00  6.54358343e-02 -6.40902640e-04
  1.44952153e+00  4.73680250e-01  2.19746931e-03]

Let's see whether LSTM can predict that far ahead...

In [28]:
from keras.models import Sequential
from keras.models import Model
from keras.layers import Input, Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
Using TensorFlow backend.
In [121]:
n_time = 50
target_future = 50
predict_future = 50
n_train = 100
sample_step = 1
n_steps = len(phis)

X_ts = np.ndarray(shape=(n_train, n_time, 1))

for i in range(0, n_train):
    X_ts[i, :, 0] = phis[i * sample_step:i * sample_step + n_time] / np.pi

target =  phis[n_time + predict_future: n_train * sample_step + n_time  + predict_future: sample_step] / np.pi
inputs = Input(shape=(n_time, 1), dtype='float', name='main_input')
layer = LSTM(12)(inputs)
predictions = Dense(1)(layer)
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer='adam',
              loss='mean_squared_error')

model.fit(X_ts, target, epochs=8, batch_size=1, verbose=0)
Epoch 1/8
 - 5s - loss: 0.3100
Epoch 2/8
 - 3s - loss: 0.1826
Epoch 3/8
 - 3s - loss: 0.1300
Epoch 4/8
 - 3s - loss: 0.0851
Epoch 5/8
 - 3s - loss: 0.0779
Epoch 6/8
 - 3s - loss: 0.0850
Epoch 7/8
 - 3s - loss: 0.0745
Epoch 8/8
 - 3s - loss: 0.0555
Out[121]:
<keras.callbacks.History at 0x140fb6fd0>
In [122]:
x_pred = np.ndarray(shape=(n_steps + target_future, n_time, 1))
for i in range(0, n_steps - n_time):
    x_pred[i, :, 0] = phis[i:i + n_time] / np.pi

for i in range(n_steps - n_time, n_steps + target_future):
    x_pred[i, :, 0] = model.predict(x_pred[i - n_time - predict_future].reshape(1, n_time, 1))[0]

plt.scatter(times, phis)
plt.plot(range(n_steps + target_future), x_pred[:, 0, 0] * np.pi)
plt.show()

The challenge

Notice that this was a single row in a dataset. The challenge is to do this analysis for each row separately. The dataset now consists of alternative physics/mechanics scenarios, in which some formulas/dynamics are made up. Each row in the dataset is another physics system with unknown dynamics, and randomized initial state, for which the participants are to determine the relevant model first and re-fit for row-specific parameters.

The participant can follow many different strategies. Some examples are provided, which are explained below, but any solution that returns the right classification and regression results is fine.

Using feature extractor to simplify data

The above process could be performed either inside the predict() function, or, it can be performed in a transform() function in FeatureExtractor and the predict() simply returns the corresponding result.

Classification problem and score

Although so far the classification part of the problem has been described as a helpful side task, it is essential to the main motivation of this challenge: Invent theoretical models in science.

Ideally, we would like to find the best functions that describe the data. In this challenge, we simplified this effort by assuming that the first step in generating the formula is to be able to classify a dataset into a set of given formulas.

As the formula space is extended, the problem gets similar to theory generation.

starting_kit : Simple ML model

The starting kit is basically a template for all the functions to be implemented. It attempts to solve the classification and regression problems without any knowledge of the context and the discussion above. You can use this example to start implementing your own approach from complete scratch.

In [ ]:
!ramp_test_submission --submission starting_kit

LSTM approaches

Three different LSTM-based examples are provided, distinct by their scope of training.

LSTM trained for all sample

The simplest version basically takes every row in the dataset, and trains the model, then predicts the result when a new row is given.

In [ ]:
!ramp_test_submission --submission lstm_all_sample

LSTM trained for each system

One can first split the dataset into the different systems and train separate models for each. At the prediction time, we first predict the system type and then use the correspondingly trained model for regression.

In [ ]:
!ramp_test_submission --submission lstm_by_class

LSTM trained for each row

As each row is a time series, one can re-train a new LSTM within each row and predict the 50 steps ahead by running the model recursively. This is already what was illustrated in the above fitting exercise with a single row. This approach, however, will be costly in computing, therefore it should be tested carefully before submitting.

In [ ]:
!ramp_test_submission --submission lstm_by_row

Scoring

Although there are two scores involved, we want to set the regression as the main score and treat the classification problem as a side product of the optimization for regression. We want, however, the participants to at least submit a non-trivial solution for classification therefore we assign a small but non-zero proportion for the combined score. The classification score is the fraction of false identifications (printed as "err"), and has a value between 0 and 1. A random assignment is expected to have an error of 0.80. For the regression, RMS-error is used (printed as "rmse"), which is wrapped around [$\pi$] and [-$\pi$]. It's maximum value is [$\pi$] and a random assignment is expected to have an error around 1.8. The final error (being lower the better) is 0.1 err + 0.9 rmse, printed as "combined".

Cross-Validation

Each row in the data sample is generated with random values of parameters that go into the formula. As these parameters cannot cover the infinite phase space, they come from certain distributions which may be different for training and testing datasets. To test the robustness of the learning approach, one can already test the dependency of parameter values, by using different input distributions in traning and validation. In order to do that, we have provided a "distribution" variable in the dataset, by selection of which one can have an handle of sub-datasets with different parameter distributions. The the get_cv function in problem.py implements this already, hence the 3 CV folds in the starting kit reflect the above. The "distribution" variable appears in the data X, however make sure you do not train your models with this variable, as the values of this variable in the test sample will be different.

Local testing/exploration

It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit, not on the classes defined in the cells of this notebook.

Check list

  • Make sure you have installed ramp-workflow locally (see above)
  • Make sure that the python files object_detector.py is in the submissions/your_submission folder, and the train and test data are in data
  • If you haven't yet, download the images by executing
    python download_data.py

Finally, make sure the local processing goes through by running the

ramp_test_submission --submission your_submission

If you want to quickly test the that there are no obvious code errors, use the --quick-test keyword to only use data from the first 30 images.

ramp_test_submission --submission your_submission --quick-test
In [ ]:
!ramp_test_submission --quick-test

If you get to see the train and test scores, and no errors, then you can submit your model to the ramp.studio.

Submitting to ramp.studio

Once you found a good model, you can submit them to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, the event mars_craters for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) your files. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard. If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints scores for each cross-validation fold

        score  combined    err   rmse
        train     1.322  0.423  1.422
        valid     1.740  0.730  1.852
        test      1.775  0.745  1.889

and the mean cross-validation score at the end

----------------------------
Mean CV scores
----------------------------
        score        combined         err            rmse
        train   1.34 ± 0.0264   0.455 ± 0.024  1.438 ± 0.0305
        valid  1.668 ± 0.1004   0.726 ± 0.0605 1.773 ± 0.1066
        test   1.722 ± 0.0278   0.706 ± 0.025  1.835 ± 0.0293
----------------------------
Bagged scores
----------------------------
        score  combined
        valid     1.669
        test      1.686

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is the mean average precision AP, that is the area under the precision/recall curve. When the score is good enough, you can submit it at the RAMP.

More information

You can find more information in the README of the ramp-workflow library.

Contact

Don't hesitate to contact us.

Appendix: Visualize some example systems

In [16]:
file = "illustration"
data = pd.read_csv("data/" + file + "_phi.csv")

# Additional data items that are only for illustration:
data_x = pd.read_csv("data/" + file + "_x.csv")
data_y = pd.read_csv("data/" + file + "_y.csv")
ref_x = pd.read_csv("data/" + file + "_ref_x.csv")
ref_y = pd.read_csv("data/" + file + "_ref_y.csv")
In [17]:
def plot_data(i, t_min=0, t_max=200):
    plot_orbit(i, t_min, t_max)
    plot_observation(i, t_min, t_max)
In [ ]:
for i in range(len(data.index)):
    plot_data(i, 0, 200)