El Nino forecast
el_nino_starting_kit

Paris Saclay Center for Data Science

<a href=https://www.ramp.studio/problems/el_nino>RAMP</a> on El Nino prediction

Balázs Kégl (CNRS), Claire Monteleoni (GWU), Mahesh Mohan (GWU), Timothy DelSole (COLA), Kathleen Pegion (COLA), Julie Leloup (UPMC), Alex Gramfort (LTCI), Mehdi Cherti (CNRS), Camille Marini (CNRS)

Introduction

A climate index is real-valued time-series which has been designated of interest in the climate literature. For example, the El Niño–Southern Oscillation (ENSO) index has widespread uses for predictions of regional and seasonal conditions, as it tends to have strong (positive or negative) correlation with a variety of weather conditions and <a href=http://www.ipcc-wg2.gov/SREX/images/uploads/SREX-SPMbrochure_FINAL.pdf>extreme events</a> throughout the globe. The ENSO index is just one of the many climate indices studied. However there is currently significant room for improvement in predicting even this extremely well studied index with such high global impact. For example, most statistical and climatological models erred significantly in their predictions of the 2015 El Niño event; their predictions were off by several months. Better tools to predict such indices are critical for seasonal and regional climate prediction, and would thus address grand challenges in the study of climate change (<a href=https://www.wcrp-climate.org/grand-challenges/grand-challenges-overview>World Climate Research Programme: Grand Challenges, 2013)</a>.

El Niño

El Niño (La Niña) is a phenomenon in the equatorial Pacific Ocean characterized by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of $+0.5^\circ$C ($-0.5\circ$C). This standard of measure is known as the Oceanic Niño Index (ONI).

More information can be found <a href=https://www.ncdc.noaa.gov/teleconnections/enso/indicators/sst.php>here</a> on why it is an important region, and what the history of the index is.

Here are the <a href = http://iri.columbia.edu/our-expertise/climate/forecasts/enso/current/>current ENSO predictions</a>, updated monthly.

The CCSM4 simulator

Our data is coming from the <a href=http://www.cesm.ucar.edu/models/ccsm4.0/>CCSM4.0</a> model (simulator). This allows us to access a full regular temperature map for a 500+ year period which makes the evaluation of the predictor more robust than if we used real measurements.

The data

The data is a time series of "images" $z_t$, consisting of temperature measurements (for a technical reason it is not SST that we will work with, rather air temperature) on a regular grid on the Earth, indexed by lon(gitude) and lat(itude) coordinates. Latitude and longitude are sampled at a resolution of $5^\circ$, giving 37 latitude and 72 longitude grid points, 2664 temperature measurements at every time step. The average temperatures are recorded every month for 119 years, giving 1428 time points in the public training data (available in the starting kit), 155/1860 in the training data (public leaderboard), and 500/6000 in the test data (private leaderboard). The goal is to predict the average temperature in the El Niño region, 6 months ahead.

Note that the data set given in the starting kit is different from the one used to evaluate your submissions (of course, the data structures and the generative model (simulator) will be identical), so your submission should be generic (for example, it should be able to handle a time series of different length).

The prediction task

The pipeline consists of a feature extractor and a predictor. Since the task is regression, the predictor will be a regressor, and the score (to minimize) will be the <a href=http://en.wikipedia.org/wiki/Root-mean-square_deviation>root mean square error</a>. The feature extractor will have access to the whole data. It will construct a "classical" feature matrix where each row corresponds to a time point. You should collect all information into these features that you find relevant to the regressor. The feature extractor can take anything from the past, that is, it will implement a function $x_t = g(z_1, \ldots, z_t)$. Since you will have access to the full data, in theory you can cheat (even inadvertently) by using information from the future. We have implemented a randomized test to detect such submissions, but please do your best to avoid this since it would make the results irrelevant.

Domain-knowledge suggestions

You are of course free to explore any regression technique to improve the prediction. Since the input dimension is relatively large (2000+ dimensions per time point even after subsampling) sparse regression techniques (eg. LASSO) may be the best way to go, but this is just an a priori suggestion. The following list provides you other hints to start with, based on domain knowledge.

  • There is a strong seasonal cycle that must be taken into account.

  • There is little scientific/observational evidence that regions outside the Pacific play a role in NINO3.4 variability, so it is probably best to focus on Pacific SST for predictions.

  • The relation between tropical and extra-tropical Pacific SST is very unclear, so please explore!

  • The NINO3.4 index has an oscillatory character (cold followed by warm followed by cold), but this pattern does not repeat exactly. It would be useful to be able to predict periods when the oscillation is “strong” and when it “breaks down.”

  • A common shortcoming of empirical predictions is that they under-predict the amplitude of warm and cold events. Can this be improved?

  • There is evidence that the predictability is low when forecasts start in, or cross over, March and April (the so-called “spring barrier”). Improving predictions through the spring barrier would be important.

Exploratory data analysis

Packages to install:

conda install xarray dask netCDF4 basemap

In [2]:
%matplotlib inline
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

Let's start by reading the data into an xarray Dataset object. You can find all information on how to access and manipulate Dataset and DataArray objects at the <a href=http://xray.readthedocs.org/en/stable/>xarray site</a>.

In [3]:
X_ds = xr.open_dataset('data/train.nc')
y_array = np.load('data/train.npy')

xarray uses datetime64[ns] as a time type which means that dates must be between 1678 and 2262. We convert whatever time type we have into datetime64[ns] starting at 1700. This only works if the monthly time series has length less than 562 years, which is the case of all train and test times series, both in the starting kit and in the backend. This is important so that, e.g., grouping by month works correctly.

In [9]:
X_ds['time'] = pd.date_range(
    '1/1/1700', periods=X_ds['time'].shape[0], freq='M')\
    - np.timedelta64(15, 'D')

Printing it, you can see that it contains all the data, indices, and other metadata.

In [10]:
X_ds
Out[10]:
<xarray.Dataset>
Dimensions:  (lat: 37, lon: 72, time: 708)
Coordinates:
    height   float64 2.0
  * lat      (lat) float64 -90.0 -85.0 -80.0 -75.0 -70.0 -65.0 -60.0 -55.0 ...
  * lon      (lon) float64 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ...
  * time     (time) datetime64[ns] 1700-01-16 1700-02-13 1700-03-16 ...
Data variables:
    tas      (time, lat, lon) float64 246.0 246.9 245.9 246.8 246.8 246.3 ...
Attributes:
    n_burn_in:    120
    n_lookahead:  6

y_array on the disk is already shifted by n_lookahead = 6 months. n_burn_in = 120 (months) is the length of the prefix for which no prediction is required. If your feature extractor only uses these ten years of the past to extract features from, you don't need to worry about missing data in the beginning of the sequence. Otherwise you should take care of the issue "manually" (handling missing data in the beginning of the sequence).

In [12]:
y_array, y_array.shape
Out[12]:
(array([ 299.68909098,  299.62266947,  299.38398844,  299.41364237,
         299.74432983,  300.16651611,  300.3601237 ,  300.29920146,
         300.36586202,  300.1573995 ,  300.24405314,  300.29606527,
         300.06072388,  299.21739705,  298.99423625,  299.24555969,
         299.02548319,  299.5132253 ,  299.44878337,  299.18350728,
         299.10292257,  299.31360067,  299.30114441,  299.23499146,
         298.73236389,  298.18153178,  297.7196106 ,  297.89548136,
         298.08258464,  298.2351298 ,  298.29011841,  298.43161723,
         299.01617228,  299.26343892,  299.67506612,  299.88393555,
         299.80882874,  299.32874349,  299.00979818,  299.12949015,
         299.688679  ,  299.91440837,  300.26181742,  300.25551147,
         300.16679484,  300.05690104,  300.17946676,  299.82577922,
         299.48422546,  298.83221029,  298.39238485,  298.16024679,
         298.51611837,  298.88151347,  298.73832804,  299.08477071,
         299.24384054,  299.32169291,  299.53120422,  299.8489919 ,
         299.66968587,  299.16219788,  299.04586995,  299.06116943,
         299.49815674,  299.53868306,  299.75956828,  299.69691874,
         299.36722005,  299.64484762,  299.96180522,  299.56768697,
         299.37949524,  298.62028402,  297.89288534,  298.01477356,
         298.48604228,  298.37393392,  298.37184041,  298.33790894,
         298.37912801,  298.86816813,  299.22107951,  299.34032593,
         299.2846639 ,  298.92726644,  298.40746053,  298.30601807,
         298.49335124,  298.53184916,  298.76950277,  298.87569173,
         298.91259969,  299.13657328,  299.23991801,  299.34034831,
         299.24372559,  298.86759949,  298.58677673,  298.56099752,
         298.89029134,  299.30768229,  299.70274353,  299.83251444,
         299.96387939,  300.05493266,  300.04997864,  300.08421021,
         300.02939657,  299.38363749,  298.87872314,  299.25922445,
         299.35245565,  300.05797628,  300.14415894,  299.83673401,
         299.87040914,  299.82664083,  300.14124044,  300.10702311,
         299.50395915,  298.72865194,  298.47429708,  298.30960083,
         298.46823934,  299.12046102,  299.10884603,  299.20797221,
         299.37948914,  299.83680623,  300.15534871,  300.35130615,
         300.22137349,  300.11109823,  300.21409302,  300.31345215,
         300.65704956,  301.15939433,  301.18641764,  301.17529399,
         301.02631734,  301.08286133,  300.96066589,  300.80235392,
         300.80103455,  299.77181498,  298.74824829,  298.80956726,
         298.92969971,  298.99549764,  298.70575053,  298.16689555,
         297.78281962,  298.32378235,  298.12003886,  297.34470317,
         296.5143809 ,  296.10854492,  295.66273092,  295.60287374,
         295.88641357,  296.38234965,  296.4084137 ,  296.95136719,
         297.66221619,  298.17408752,  298.62349548,  298.91773071,
         298.86664225,  298.59566854,  298.39452718,  298.79084676,
         298.82074483,  299.15209554,  299.26694743,  298.82688904,
         298.62405294,  299.04216003,  299.19151001,  299.262795  ,
         299.24208577,  298.86308594,  298.61299947,  298.61368612,
         298.79710999,  298.85427755,  298.98284709,  299.01369425,
         299.20749207,  299.38732808,  299.64792989,  299.7826416 ,
         299.87296346,  299.84645589,  299.66819458,  299.76290385,
         299.86136678,  300.44264628,  300.57675069,  300.50985718,
         300.45590108,  300.5124176 ,  300.46918945,  300.34577433,
         299.92536418,  299.26412455,  299.02483622,  298.95694784,
         299.62298584,  299.92640889,  299.75321147,  299.4126119 ,
         299.11062826,  299.52103271,  299.57861532,  299.38032328,
         298.8499176 ,  298.41640828,  298.10478617,  298.0545461 ,
         298.52835693,  298.8087677 ,  298.73125305,  298.87310994,
         299.29459737,  299.342158  ,  299.81643982,  299.84582723,
         299.61364746,  299.36870626,  298.78615621,  298.9256368 ,
         299.38734945,  299.61786194,  299.94788615,  299.95309957,
         300.06085409,  299.79488525,  299.89787292,  300.00286967,
         299.49982503,  298.83776449,  298.91188456,  298.69543762,
         298.97396749,  299.21219991,  299.5906484 ,  299.46912028,
         299.45140584,  299.6457194 ,  299.6994751 ,  299.69197489,
         299.55270793,  298.88632609,  298.79042053,  298.50626221,
         298.7912852 ,  298.98835856,  299.24268595,  299.4987854 ,
         299.74344482,  299.96575216,  300.20218811,  300.35870667,
         300.46261495,  300.14408773,  300.03071696,  300.16691488,
         300.41437174,  300.72310384,  300.71927999,  300.76859436,
         300.53301595,  300.36177673,  300.24921875,  300.14986165,
         299.50985616,  298.63769023,  297.77485453,  297.75962321,
         298.35171509,  298.73427022,  298.59003296,  298.23421834,
         298.6022349 ,  298.7136261 ,  298.96371765,  298.79559123,
         298.51246338,  298.09538066,  297.83941345,  297.70880839,
         297.95921326,  298.07378438,  297.9331014 ,  297.79249064,
         297.72654826,  298.24083964,  298.86560872,  299.21312968,
         299.00291545,  298.9000651 ,  298.26330261,  298.10362752,
         298.5393575 ,  298.64694926,  298.78981018,  298.98506775,
         299.09747009,  299.46881917,  299.76211853,  299.84429423,
         299.47461548,  299.24699402,  299.00948486,  299.13988037,
         299.43487549,  299.90821025,  300.12597351,  300.16725362,
         300.2120402 ,  300.23857015,  300.4786733 ,  300.48065491,
         300.42625224,  299.98983663,  299.78131307,  299.72139384,
         300.33870239,  300.59867655,  300.63438619,  300.3927297 ,
         299.9605367 ,  299.65939229,  299.49836324,  299.26661174,
         299.12074382,  298.60696513,  297.81396179,  297.7509196 ,
         298.24858398,  298.58035787,  298.54044189,  298.8260376 ,
         298.7065745 ,  299.18797709,  299.51333618,  299.57223307,
         299.49007975,  299.24794718,  298.89022929,  299.08918762,
         299.38427531,  299.58623149,  299.68817139,  299.50953979,
         299.51154989,  299.77390645,  299.9517039 ,  299.90116984,
         299.70587158,  299.40645447,  298.96441854,  298.86119283,
         299.14282837,  299.15979818,  299.17158915,  299.31363627,
         299.45779521,  299.58793844,  299.79856466,  299.56786092,
         299.18391622,  298.38358358,  297.7452179 ,  297.43931274,
         297.93982646,  298.00423279,  297.94069417,  298.51637573,
         298.99080912,  299.20811361,  299.6104126 ,  299.66213888,
         299.47182312,  299.33772786,  298.93501078,  298.57520854,
         298.92397054,  298.96706238,  298.77170715,  298.62709656,
         298.80751241,  298.92557271,  299.20968323,  299.47097371,
         299.13822327,  298.57830302,  298.22521566,  298.12421468,
         298.33137716,  298.67158101,  298.99522095,  299.2934611 ,
         299.6243337 ,  299.82521973,  300.00668437,  300.25538432,
         300.18123067,  300.24770304,  300.34778544,  300.54138794,
         300.86289164,  301.19176432,  300.98886617,  300.85453389,
         300.83519491,  300.60340474,  300.16342163,  300.17250671,
         299.42533264,  298.98470256,  298.31889648,  298.15219014,
         298.11343282,  298.44204   ,  298.23690796,  298.3522522 ,
         298.55006612,  298.79491984,  298.97990112,  299.06176758,
         298.6797994 ,  297.99974162,  297.68088888,  297.22044678,
         297.93937785,  298.06303304,  298.14697876,  298.24406535,
         298.32788595,  298.80873413,  299.22859192,  299.39973246,
         299.22056173,  299.18197734,  298.88822021,  298.92561747,
         299.09800415,  299.52481181,  299.56885376,  299.46507161,
         299.67489929,  299.89973348,  300.20011393,  300.32721761,
         300.28669027,  299.95495911,  299.67671102,  299.741981  ,
         300.0114919 ,  300.53618774,  300.74923808,  300.14235636,
         300.31719259,  300.05473328,  300.0563151 ,  299.23395487,
         299.02738342,  298.46070455,  298.14735718,  298.15288493,
         298.49388631,  299.12365417,  299.20349935,  299.33192851,
         299.39060262,  299.65065918,  299.80560811,  299.78436686,
         299.76287638,  299.28021139,  298.98638509,  299.09396261,
         299.27070821,  299.23901876,  299.38642273,  299.21884766,
         299.3662557 ,  299.52277425,  299.83812459,  299.91995443,
         299.46138407,  299.30285034,  298.84946493,  298.64609375,
         298.965271  ,  299.14749247,  299.15450134,  298.92614746,
         299.17104085,  299.40941264,  299.53845317,  299.56440226,
         299.1700236 ,  298.69712016,  298.23434652,  298.05406392,
         298.37116699,  298.38796183,  298.36980896,  298.48825277,
         298.78493245,  299.04982605,  299.42079773,  299.4673818 ,
         299.40947469,  298.7191274 ,  298.34810282,  298.51285706,
         298.63487549,  298.96034749,  299.00398458,  299.30296326,
         299.28455811,  299.50567729,  299.83114014,  300.13575745,
         299.88688253,  299.92088013,  299.73868001,  299.82227987,
         300.10555013,  300.49576823,  300.79510701,  300.58331706,
         300.32661133,  300.38898926,  300.1621521 ,  299.81480713,
         298.97254333,  297.95835164,  297.1688975 ,  297.00852966,
         297.28001607,  297.88635864,  297.90695496,  297.34672038,
         296.89883016,  297.41736247,  298.30289001,  298.42028605,
         298.54382629,  298.30654297,  297.8069163 ,  297.61266886,
         298.23219096,  298.32275798,  298.61148173,  298.84476318,
         299.16156108,  299.48490397,  299.78520406,  300.04742228,
         299.85379028,  299.57815043,  299.54811198,  299.57877604,
         299.89711507,  300.28869324,  300.57147115,  300.38822734,
         300.35843811,  300.08753662,  300.13356425,  300.3888265 ,
         299.7540568 ,  299.29988403,  298.76940918,  298.69282939,
         299.0967865 ,  299.25836487,  299.19532878,  299.32058004,
         299.44669698,  299.40459391,  299.3260376 ,  299.1792277 ,
         299.03957316,  298.49201253,  298.39315999,  298.1719045 ,
         298.570106  ,  298.91521301,  299.10992025,  299.10945129,
         299.17663066,  299.16977437,  299.39071147,  299.33183492,
         299.17132772,  298.87142232,  298.36796468,  298.45192159,
         298.68657328,  298.90112101,  298.67368571,  299.01753133,
         299.0577179 ,  299.13859558,  299.34270426,  299.51472168,
         299.3648763 ,  298.68974406,  298.31918538,  298.44442342,
         298.39194438,  298.54940999,  298.68000692,  299.11317037,
         299.12450256,  299.10663452,  299.49482422,  299.86304728,
         299.7203776 ,  299.29945984,  299.02519531,  298.96677043,
         299.45015361,  299.60267741,  299.82493896,  299.8122701 ,
         299.34441732,  299.39477946,  299.54831645,  299.57818705,
         298.96306356,  298.51102905,  298.33016866,  298.22776693,
         298.60289612,  298.37449341,  298.63748169,  298.32370605,
         298.65251668,  298.77590739,  298.74552205,  298.57170003,
         298.39375305,  298.17522074,  297.73935954,  297.51152242,
         297.68424174,  297.83354594,  297.81616007,  298.10815837,
         298.6961263 ,  299.09683228,  299.54005534,  299.63271891,
         299.63163452,  299.32074382,  299.13028564,  299.08505351,
         299.43274943,  299.7745697 ,  299.93927917,  300.14278666,
         300.07593689,  300.28797811,  300.45532125,  300.35434163,
         300.3817688 ,  299.84756571,  299.90789998,  299.93204651,
         300.116745  ,  300.45472921,  300.68567403,  300.18753866,
         299.6526001 ,  299.32466024,  299.29473165,  298.87182109,
         298.04079081,  297.64092712,  297.59451396,  297.96296488,
         298.05620931,  298.14603678,  298.23137817,  298.39725647,
         298.60697734,  298.92179565,  299.20309143,  299.25732625,
         299.11979065,  298.6730835 ,  298.20394389,  297.96294963,
         298.27842   ,  298.46136169,  298.61456706,  298.85750224,
         298.75179749,  298.79724223,  299.03482157,  299.15433655]), (708,))

The main data is in the tas ("temperature at surface") DataArray.

In [52]:
X_ds['tas']
Out[52]:
<xarray.DataArray 'tas' (time: 708, lat: 37, lon: 72)>
[1886112 values with dtype=float64]
Coordinates:
    height   float64 2.0
  * lat      (lat) float64 -90.0 -85.0 -80.0 -75.0 -70.0 -65.0 -60.0 -55.0 ...
  * lon      (lon) float64 2.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5 42.5 47.5 ...
  * time     (time) datetime64[ns] 1700-01-16 1700-02-13 1700-03-16 ...

You can index it in the same way as a pandas or numpy array. The result is always a DataArray.

In [53]:
t = 123
lat = 13
lon = 29
X_ds['tas'][t]
X_ds['tas'][t, lat]
X_ds['tas'][t, lat, lon]
X_ds['tas'][:, lat, lon]
X_ds['tas'][t, :, lon]
X_ds['tas'][:, :, lon]
Out[53]:
<xarray.DataArray 'tas' (time: 708, lat: 37)>
array([[ 246.34985352,  253.71881104,  256.89190674, ...,  244.21122742,
         239.45343018,  237.40930176],
       [ 235.5670166 ,  243.0508728 ,  245.73396301, ...,  233.93948364,
         236.7585144 ,  236.36204529],
       [ 220.1075592 ,  228.37557983,  230.78536987, ...,  238.11174011,
         235.22380066,  234.21894836],
       ..., 
       [ 220.74520874,  225.72584534,  226.19856262, ...,  247.64707947,
         243.89697266,  242.41767883],
       [ 231.03648376,  237.40322876,  240.15187073, ...,  243.68629456,
         242.98490906,  243.19689941],
       [ 240.68638611,  247.16184998,  250.32196045, ...,  244.52952576,
         242.49557495,  240.67288208]])
Coordinates:
    height   float64 2.0
  * lat      (lat) float64 -90.0 -85.0 -80.0 -75.0 -70.0 -65.0 -60.0 -55.0 ...
    lon      float64 147.5
  * time     (time) datetime64[ns] 1700-01-16 1700-02-13 1700-03-16 ...
In [54]:
X_ds['tas'][:, lat, lon]
Out[54]:
<xarray.DataArray 'tas' (time: 708)>
array([ 298.28366089,  297.16641235,  293.81262207,  290.67932129,
        288.5574646 ,  286.10189819,  282.77111816,  287.3918457 ,
        287.23913574,  291.52407837,  293.32342529,  298.58950806,
        297.80838013,  296.02090454,  294.62997437,  291.59881592,
        288.56741333,  286.37176514,  285.68557739,  287.47634888,
        289.69476318,  295.77468872,  297.12460327,  296.60391235,
        297.81918335,  297.8163147 ,  297.3354187 ,  293.64007568,
        289.29589844,  286.00125122,  285.09112549,  291.31518555,
        290.95117188,  293.89587402,  294.88424683,  295.25531006,
        296.10888672,  297.51812744,  295.09332275,  291.57699585,
        287.39221191,  287.14172363,  281.8807373 ,  286.84506226,
        290.7479248 ,  293.5854187 ,  295.34658813,  294.58618164,
        296.62182617,  295.3039856 ,  294.97662354,  291.61621094,
        288.56777954,  283.36322021,  282.9536438 ,  286.95681763,
        291.2963562 ,  293.20379639,  292.99523926,  295.40423584,
        295.05603027,  296.62991333,  294.32312012,  292.18493652,
        287.7958374 ,  284.79852295,  285.27471924,  285.03900146,
        289.50411987,  292.93814087,  295.28057861,  294.33822632,
        295.33773804,  296.0362854 ,  294.51681519,  292.43652344,
        287.412323  ,  284.58834839,  284.46655273,  285.68032837,
        288.54559326,  290.5715332 ,  293.27218628,  295.48898315,
        296.06399536,  295.89639282,  295.1015625 ,  290.81033325,
        286.60073853,  283.96282959,  286.21252441,  286.25656128,
        289.99380493,  292.54507446,  294.27981567,  295.9934082 ,
        296.59365845,  295.91116333,  295.05831909,  291.78787231,
        287.93774414,  282.84082031,  284.21057129,  284.91125488,
        290.27713013,  293.34811401,  296.08224487,  295.36877441,
        297.72097778,  297.69689941,  296.57580566,  290.89172363,
        287.6854248 ,  285.88345337,  284.99090576,  285.74850464,
        289.73464966,  293.35723877,  295.66244507,  295.02606201,
        297.67990112,  297.42053223,  296.69345093,  293.34963989,
        289.19387817,  285.42721558,  284.27783203,  287.18411255,
        290.6366272 ,  294.2204895 ,  294.79238892,  296.23013306,
        295.82150269,  296.63534546,  295.41793823,  291.27038574,
        287.23043823,  285.92037964,  283.09799194,  285.92440796,
        291.74960327,  293.58773804,  294.15637207,  296.4083252 ,
        296.25216675,  297.51907349,  294.37658691,  291.86447144,
        289.18460083,  287.48895264,  285.97485352,  287.62713623,
        291.39077759,  294.55084229,  298.66265869,  297.91229248,
        296.6534729 ,  296.33712769,  295.74945068,  294.30563354,
        286.71243286,  285.98754883,  284.85690308,  287.65768433,
        288.15933228,  290.58865356,  294.54864502,  295.75698853,
        297.28225708,  297.08779907,  295.27709961,  293.20611572,
        288.93206787,  284.68322754,  283.17895508,  285.88571167,
        288.76568604,  291.75524902,  293.83071899,  295.54595947,
        296.96005249,  296.54034424,  294.89782715,  291.34558105,
        287.18328857,  286.32992554,  285.94561768,  283.87365723,
        286.43838501,  289.55447388,  293.27081299,  296.14196777,
        296.68087769,  295.91690063,  294.52407837,  292.48764038,
        288.14764404,  286.38510132,  283.84344482,  287.43450928,
        289.71350098,  293.81762695,  295.83758545,  296.01837158,
        297.7411499 ,  296.02261353,  294.85891724,  292.69708252,
        288.53152466,  285.88287354,  285.4979248 ,  287.79898071,
        292.06915283,  294.48376465,  296.34567261,  295.8404541 ,
        296.41625977,  296.71884155,  294.64727783,  291.36453247,
        287.55395508,  285.2829895 ,  285.12817383,  283.55282593,
        289.1706543 ,  292.80310059,  293.61651611,  294.89794922,
        297.95666504,  296.66693115,  293.87155151,  289.66607666,
        286.67501831,  284.64697266,  284.08789062,  285.42669678,
        289.78219604,  291.32327271,  295.16925049,  296.1512146 ,
        293.97915649,  294.22790527,  293.69088745,  291.66177368,
        289.97427368,  284.10183716,  284.28878784,  286.53173828,
        289.72671509,  292.64578247,  296.23217773,  295.88421631,
        297.09667969,  297.30447388,  294.4777832 ,  290.91986084,
        288.47137451,  285.24697876,  282.69784546,  287.14666748,
        289.43392944,  293.76062012,  294.62002563,  296.74169922,
        296.59780884,  296.89477539,  295.09564209,  291.06811523,
        287.67947388,  285.48001099,  283.71347046,  287.01034546,
        289.9765625 ,  293.92401123,  298.05160522,  298.53692627,
        297.04199219,  297.38137817,  295.59307861,  291.49520874,
        290.447052  ,  287.09262085,  286.24481201,  286.46621704,
        289.77529907,  292.85397339,  294.68753052,  296.8187561 ,
        295.93426514,  295.3432312 ,  294.93933105,  291.04556274,
        286.76141357,  284.85012817,  283.08099365,  284.84414673,
        289.46484375,  291.34362793,  292.50302124,  296.11462402,
        295.52514648,  296.42663574,  293.73425293,  291.32376099,
        284.89480591,  285.27606201,  282.38528442,  286.59387207,
        291.29049683,  291.31842041,  295.00973511,  294.18838501,
        296.54052734,  296.12759399,  296.20578003,  293.00125122,
        287.14813232,  283.37438965,  285.03445435,  286.584198  ,
        290.45825195,  293.43054199,  294.93817139,  295.37268066,
        295.23971558,  298.44604492,  297.14117432,  294.42260742,
        289.07037354,  285.20593262,  285.63363647,  287.32312012,
        290.45300293,  292.40792847,  294.95111084,  298.08517456,
        297.71246338,  298.05657959,  295.6272583 ,  292.60244751,
        288.82669067,  284.95562744,  284.36233521,  286.54190063,
        291.57162476,  293.41119385,  293.28921509,  295.7013855 ,
        297.01412964,  297.1842041 ,  296.36694336,  292.4145813 ,
        288.3861084 ,  285.86981201,  286.7064209 ,  286.21096802,
        291.80511475,  294.87744141,  296.26699829,  296.3757019 ,
        299.76229858,  297.66094971,  295.30462646,  291.65383911,
        287.66369629,  285.05712891,  285.8789978 ,  286.92437744,
        290.01971436,  292.88674927,  294.01107788,  297.35934448,
        297.40637207,  296.93884277,  297.15472412,  292.84414673,
        289.77182007,  285.97738647,  285.98907471,  286.0135498 ,
        288.87701416,  291.70101929,  294.16604614,  297.20870972,
        297.67980957,  295.10873413,  293.02645874,  291.2723999 ,
        287.73348999,  285.45196533,  285.67404175,  287.06970215,
        289.11672974,  291.1267395 ,  293.22924805,  294.88977051,
        296.44436646,  295.76806641,  293.20742798,  291.33581543,
        290.18490601,  287.05130005,  284.43356323,  284.36462402,
        287.44586182,  290.27401733,  293.58178711,  296.56838989,
        295.99368286,  297.12768555,  294.77972412,  290.650177  ,
        288.00482178,  284.0635376 ,  285.17123413,  286.81658936,
        287.38418579,  295.25466919,  296.76074219,  297.92913818,
        297.99972534,  297.60974121,  295.88076782,  292.70196533,
        289.10592651,  286.90652466,  285.81442261,  288.04248047,
        289.75775146,  294.2986145 ,  295.57220459,  296.40939331,
        297.52893066,  297.93011475,  295.25006104,  292.68154907,
        288.32553101,  284.62210083,  283.68301392,  286.45599365,
        289.28390503,  292.70361328,  294.05496216,  295.76174927,
        295.17889404,  295.70593262,  296.01968384,  291.32504272,
        288.50387573,  285.99060059,  282.35302734,  285.43579102,
        287.31500244,  292.10302734,  293.662323  ,  294.82888794,
        296.28320312,  296.37860107,  293.45819092,  290.37667847,
        288.0987854 ,  285.24197388,  285.6138916 ,  285.10250854,
        289.6164856 ,  293.17980957,  297.00283813,  295.7527771 ,
        296.20593262,  298.0093689 ,  295.44592285,  292.02075195,
        288.72729492,  286.25979614,  284.88183594,  288.49130249,
        291.66567993,  295.0088501 ,  297.06512451,  297.36083984,
        297.78817749,  295.70980835,  297.71792603,  294.32809448,
        288.97683716,  285.54998779,  285.35586548,  287.59393311,
        290.52264404,  294.22854614,  294.21533203,  298.20904541,
        298.44894409,  297.34222412,  296.31271362,  292.74835205,
        288.58010864,  285.19277954,  284.30517578,  285.74041748,
        288.97186279,  293.30636597,  294.08572388,  294.64743042,
        296.34173584,  296.61236572,  294.53790283,  290.787323  ,
        286.32418823,  283.74349976,  282.71606445,  288.1055603 ,
        289.20526123,  291.0246582 ,  293.42617798,  297.05703735,
        295.83410645,  297.14151001,  294.83071899,  291.91418457,
        286.26641846,  284.81082153,  283.03134155,  285.02883911,
        289.88156128,  292.35476685,  294.96456909,  296.58609009,
        295.00131226,  295.20059204,  294.45657349,  291.65979004,
        288.11450195,  285.38504028,  283.60510254,  285.36978149,
        290.58544922,  295.65475464,  296.9538269 ,  296.83666992,
        298.615448  ,  297.91326904,  297.82714844,  293.53222656,
        291.0958252 ,  286.347229  ,  286.52252197,  287.64242554,
        291.05114746,  292.3996582 ,  294.84115601,  296.19851685,
        297.51205444,  297.53359985,  295.30496216,  293.86782837,
        289.5597229 ,  286.75878906,  285.10595703,  286.2338562 ,
        289.5664978 ,  292.09655762,  294.30813599,  295.5687561 ,
        296.8352356 ,  295.5748291 ,  293.75744629,  291.23468018,
        287.12869263,  285.09609985,  281.90487671,  284.61218262,
        291.03622437,  292.81741333,  294.77926636,  297.34906006,
        298.44894409,  296.78302002,  294.3397522 ,  289.74383545,
        286.82177734,  283.5135498 ,  284.42379761,  286.39245605,
        289.37295532,  291.15698242,  294.76446533,  295.43447876,
        295.22003174,  295.57086182,  295.18878174,  291.84039307,
        287.67660522,  284.65951538,  283.42651367,  286.31350708,
        288.66848755,  294.34500122,  295.0786438 ,  295.7585144 ,
        296.70465088,  297.04086304,  294.69967651,  292.15054321,
        287.64300537,  284.39523315,  284.87677002,  285.51184082,
        289.91671753,  293.32415771,  293.73696899,  294.48458862,
        295.98764038,  296.56671143,  294.7208252 ,  294.11026001,
        290.34585571,  286.54971313,  286.59014893,  287.67147827,
        289.63366699,  292.43847656,  295.57574463,  295.82260132,
        297.09179688,  297.07894897,  295.80688477,  291.1756897 ,
        287.28369141,  284.10440063,  284.36422729,  285.47006226,
        290.44271851,  292.83074951,  294.15426636,  295.08447266,
        295.78619385,  295.97415161,  296.47833252,  291.37808228,
        286.6895752 ,  284.70422363,  284.33999634,  285.53985596,
        288.8500061 ,  293.27957153,  295.04458618,  295.73025513,
        297.03872681,  297.62631226,  295.05465698,  291.26721191,
        287.28427124,  284.66339111,  284.14425659,  284.86706543,
        288.49542236,  293.21466064,  295.24569702,  295.60269165,
        296.41351318,  296.00985718,  295.31991577,  291.38116455,
        286.49047852,  284.14050293,  283.42434692,  285.49920654,
        288.39178467,  292.4437561 ,  295.81057739,  295.84771729,
        296.40441895,  297.33056641,  297.10894775,  291.00671387,
        288.36218262,  286.8069458 ,  285.50775146,  286.77719116,
        289.13638306,  293.94119263,  295.8977356 ,  296.27243042,
        297.85549927,  298.82513428,  297.47689819,  294.02813721,
        289.86224365,  286.69335938,  285.05737305,  285.85028076,
        291.87701416,  292.98617554,  295.03250122,  295.9803772 ,
        297.50921631,  297.14389038,  295.47042847,  292.23742676,
        287.34353638,  285.44287109,  284.85140991,  286.08880615,
        288.62652588,  291.78424072,  293.10675049,  294.28988647])
Coordinates:
    height   float64 2.0
    lat      float64 -25.0
    lon      float64 147.5
  * time     (time) datetime64[ns] 1700-01-16 1700-02-13 1700-03-16 ...

You can convert any of these objects into a numpy array.

In [55]:
X_ds['tas'].values
X_ds['tas'][t].values
X_ds['tas'][t, lat].values
X_ds['tas'][t, lat, lon].values
Out[55]:
array(293.3496398925781)

You can also use slices, and slice bounds don't even have to be in the index arrays. The following function computes the target at time $t$. The input is a DataArray (3D panel) that contains the temperatures. We select the El Niño 3.4 region and take the mean temperatures, specifying that we are taking the mean over the spatial (lat and lon) coordinates. The output is a vector with the same length as the original time series.

In [56]:
en_lat_bottom = -5
en_lat_top = 5
en_lon_left = 360 - 170
en_lon_right = 360 - 120


def get_area_mean(tas, lat_bottom, lat_top, lon_left, lon_right):
    """The array of mean temperatures in a region at all time points."""
    return tas.loc[:, lat_bottom:lat_top, lon_left:lon_right].mean(dim=('lat','lon'))


def get_enso_mean(tas):
    """The array of mean temperatures in the El Nino 3.4 region at all time points."""
    return get_area_mean(tas, en_lat_bottom, en_lat_top, en_lon_left, en_lon_right)

The following function plots the temperatures at a given $t$ (time_index).

In [57]:
el_nino_lats = [en_lat_bottom, en_lat_top, en_lat_top, en_lat_bottom]
el_nino_lons = [en_lon_right, en_lon_right, en_lon_left, en_lon_left]

from matplotlib.patches import Polygon

def plot_map(X_ds, time_index):
    def draw_screen_poly(lats, lons, m):
        x, y = m(lons, lats)
        xy = list(zip(x, y))
        poly = Polygon(xy, edgecolor='black', fill=False)
        plt.gca().add_patch(poly)

    lons, lats = np.meshgrid(X_ds['lon'], X_ds['lat'])

    fig = plt.figure()
    ax = fig.add_axes([0.05, 0.05, 0.9,0.9])
    map = Basemap(llcrnrlon=0, llcrnrlat=-89, urcrnrlon=360, urcrnrlat=89, projection='mill')
    # draw coastlines, country boundaries, fill continents.
    map.drawcoastlines(linewidth=0.25)
    #map.drawcountries(linewidth=0.25)
    #map.fillcontinents(color='coral',lake_color='aqua')
    # draw the edge of the map projection region (the projection limb)
    #map.drawmapboundary(fill_color='aqua')
    im = map.pcolormesh(
        lons, lats, X_ds[time_index] - 273.15, shading='flat', cmap=plt.cm.jet, latlon=True)
    cb = map.colorbar(im,"bottom", size="5%", pad="2%")
    draw_screen_poly(el_nino_lats, el_nino_lons, map)

    time_str = str(pd.to_datetime(str(X_ds['time'].values[time_index])))[:7]
    ax.set_title("Temperature map " + time_str)
    #plt.savefig("test_plot.pdf")
    plt.show()

Let's plot the temperature at a given time point. Feel free to change the time, play with the season, discover visually the variability of the temperature map.

In [58]:
t = 16
plot_map(X_ds['tas'], t)

The target

The goal is to predict the temperature in the El Niño region, $\ell = 6$ months ahead

\begin{equation} y_t = \frac{1}{D} \sum_{(lat, lon) \in \cal{A}_{\rm{El\ Niño}}} z_{t + \ell}^{(lat, lon)} \end{equation}

Where $\cal{A}_{\rm{El\ Niño}} = \{(lat, lon) : -5^\circ < lat < 5^\circ \wedge -170^\circ < lon < -120^\circ\}$ is the index set of the El Niño 3.4 region, $D$ is the number of temperature measurements in the region, and $z_t^{(lat, lon)}$ is the temperature measured at time $t$, longitude $lon$ and latitude $lat$.

The first order variation of the temperature comes from its regular yearly fluctuation. Climate scientists usually look at the temperture anomaly, not the raw temperature itself. More precisely, they subtract the monthly average from the temperature. We set up the RAMP to predict the raw temperature because 1) subtracting the monthly average is equivalent to adding the monthly average as an input feature to the regressor 2) we wanted to avoid computing the average on the full data since it would have violated causality (remember: you are not allowed to use the future, not even for computing averages). Nevertheless, it is interesting to look at the anomaly since you can compare it to plots produced by climate scientists.

The snippet also shows some powerful features of xarray (grouping by months, taking the groupby mean).

In [59]:
enso = get_enso_mean(X_ds['tas'])
enso_anomaly = enso.groupby('time.month') - enso.groupby('time.month').mean(dim='time')

We plot the anomaly in a fifty year period. Conventionally, the El Niño/La Niña threshold is $0.5^\circ$ Celsius. The plot displays the warm periods as red (El Niño) and cold perids (La Niña) as blue. To be precise, El Niño requires a period of three consecutive months with mean temperatures $0.5^\circ$ Celsius above the seasonal average, so not all colored periods qualify.

In [61]:
plt.clf()
xrange = np.arange(0, 600)
y = enso_anomaly[100:700]
x_limits = [min(xrange), max(xrange)]
y_limits = [-3.0, 3.0]
el_nino_threshold = 0.5
plt.xlim(x_limits)
plt.ylim(y_limits)
plt.plot(xrange, y, c='black')
plt.plot(x_limits, [el_nino_threshold, el_nino_threshold], c='r')
plt.plot(x_limits, [-el_nino_threshold, -el_nino_threshold], c='b')
plt.fill_between(xrange, el_nino_threshold, y, color='red', where=y>=el_nino_threshold, interpolate=True)
plt.fill_between(xrange, -el_nino_threshold, y, color='blue', where=y<=-el_nino_threshold)
# plt.savefig("anomaly.png")
Out[61]:
<matplotlib.collections.PolyCollection at 0x11a071610>

The cross-validation object

Cross validating time-series predictors is tricky. We can't simply shuffle the observations $z_t =$ X_ds['tas'][t] since we would lose both causality and the correlation structure that follows natural order.

To formalize the issue, let us first define formally the predictor that we will produce in the RAMP. Let the time series be $z_1, \ldots, z_T$ and the let target to predict at time $t$ be $y_t$. The target is usually (and in our case) a function of the future $z_{t+1}, \ldots$, but it can be anything else. We want to learn a function that predicts $y$ from the past, that is

\begin{equation} \hat{y}_t = f(z_1, ..., z_t) = f(Z_t) \end{equation}

where $Z_t = (z_1, ..., z_t)$ is the past. Now, the sample $(Z_t, y_t)$ is a regular (although none iid) sample from the point of view of shuffling, so we can train on $\{Z_t, y_t\}_{t \in \cal{I}_{\text{train}}}$ and test on $(Z_t, y_t)_{t \in \cal{I}_{\text{test}}}$, where $\cal{I}_{\text{train}}$ and $\cal{I}_{\text{test}}$ are arbitrary but disjunct train and test index sets, respectively (typically produced by sklearn's ShuffleSplit). Using shuffling would nevertheless allow a second order leakage from training points to test points that preceed them, by, e.g., aggregating the training set and adding the aggregate back as a feature. To avoid this, we use block-CV: on each fold, all $t \in \cal{I}_{\text{test}}$ are larger than all $t \in \cal{I}_{\text{train}}$. We also make sure that all training and test sets contain consecutive observations, so recurrent nets and similar predictors, which rely on this, may be trained.

The training algorithm thus maps $(Z_t, y_t)_{t \in \cal{I}_{\text{train}}}$ to $f$. The point $Z_t$ contains the target for all training points $Z_{t'}$ for $t' \le t - 6$, so it is technically possible to cheat: when you receive a test set $z_1, ..., z_T$, you could look up the target of $z_t$ in $z_{t+6}$. To detect this (often inadvertant) cheating, we will check that you feature extractor is invariant to the future.

To allow a reasonably long past before making the first prediction, we strip the first $b = 120$ months (burn-in). You can of course use a longer window in your feature extractor, but in this case you will have to handle the missing time points in the beginning of the sequence.

The pipeline

We have factorized the pipeline into two steps. The first feature extractor $g$ transforms the past into a classical feature vector $x_t = g(Z_t)$, and the classical regressor $h$ predicts the target from the feature vector $\hat{y}_t = h(x_t)$. To summarize, the full predictor is a composition $f(Z_t) = h(g(Z_t))$. If you have a complex solution where this factorization does not make sense (e.g., RNNs), you can do all the work in the (optional) fit function of the feature extractor, send the prediction as a single feature $x_t$ to the regressor, and simply use an identity function in the regressor $\hat{y}_t = x_t$.

The feature extractor

The feature extractor implements a single transform function. As we explained above, it receives the full X_ds including the burn-in period $b$ as an attribute. It should produce a feature matrix of length $T - b$, of type numpy array, representing the past vector $(Z_{t+b}, \ldots, Z_{T})$. For constructing/computing $x_t$, it can only use the past $Z_t = (z_1, \ldots, z_t) = $ X_ds['tas'][:t].

Note that the following code cells are not executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

In [24]:
%%file submissions/starting_kit/ts_feature_extractor.py
import numpy as np

en_lat_bottom = -5
en_lat_top = 5
en_lon_left = 360 - 170
en_lon_right = 360 - 120


def get_area_mean(tas, lat_bottom, lat_top, lon_left, lon_right):
    """The array of mean temperatures in a region at all time points."""
    return tas.loc[:, lat_bottom:lat_top, lon_left:lon_right].mean(
        dim=('lat', 'lon'))


def get_enso_mean(tas):
    """The array of mean temperatures in the El Nino 3.4 region.

    At all time point.
    """
    return get_area_mean(
        tas, en_lat_bottom, en_lat_top, en_lon_left, en_lon_right)


class FeatureExtractor(object):

    def __init__(self):
        pass

    def transform(self, X_ds):
        """Compute the El Nino mean at time t - (12 - X_ds.n_lookahead).

        Corresponding the month to be predicted.
        """
        # This is the range for which features should be provided. Strip
        # the burn-in from the beginning.
        valid_range = np.arange(X_ds.n_burn_in, len(X_ds['time']))
        enso = get_enso_mean(X_ds['tas'])
        # Roll the input series back so it corresponds to the month to be
        # predicted
        enso_rolled = np.roll(enso, 12 - X_ds.n_lookahead)
        # Strip burn in.
        enso_valid = enso_rolled[valid_range]
        # Reshape into a matrix of one column
        X_array = enso_valid.reshape((-1, 1))
        return X_array
Overwriting submissions/starting_kit/ts_feature_extractor.py

The regressor

The regressor should implement a scikit-klearn-like regressor with fit and predict functions. The starting kit uses a linear model.

In [13]:
%%file submissions/starting_kit/regressor.py
from sklearn.base import BaseEstimator
from sklearn import linear_model


class Regressor(BaseEstimator):
    def __init__(self):
        self.reg = linear_model.BayesianRidge()

    def fit(self, X, y):
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)
Overwriting submissions/starting_kit/regressor.py

Local testing (before submission)

It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit.

First pip install ramp-workflow or install it from the github repo. Make sure that the python files ts_feature_extractor.py and regressor.py are in the submissions/starting_kit folder, and the data train.nc and test.nc are in data. Then run

ramp_test_submission

If it runs and print training and test errors on each fold, then you can submit the code.

In [14]:
!ramp_test_submission
Testing El Nino forecast
Reading train and test files from ./data ...
Reading cv ...
length of common block: 300 months = 25 years
length of validation block: 288 months = 24 years
length of each cv block: 36 months = 3 years
Training ./submissions/starting_kit ...
CV fold 0
	train rmse = 0.952
	valid rmse = 0.79
	test rmse = 0.774
CV fold 1
	train rmse = 0.94
	valid rmse = 0.778
	test rmse = 0.775
CV fold 2
	train rmse = 0.917
	valid rmse = 0.799
	test rmse = 0.774
CV fold 3
	train rmse = 0.887
	valid rmse = 0.842
	test rmse = 0.774
CV fold 4
	train rmse = 0.92
	valid rmse = 0.714
	test rmse = 0.776
CV fold 5
	train rmse = 0.902
	valid rmse = 0.746
	test rmse = 0.776
CV fold 6
	train rmse = 0.877
	valid rmse = 0.854
	test rmse = 0.775
CV fold 7
	train rmse = 0.875
	valid rmse = 0.844
	test rmse = 0.776
----------------------------
train rmse = 0.909 ± 0.0267
valid rmse = 0.796 ± 0.0466
test rmse = 0.775 ± 0.0009
----------------------------
Testing if the notebook can be converted to html
[NbConvertApp] Converting notebook ./el_nino_starting_kit.ipynb to html
[NbConvertApp] Writing 430875 bytes to ./el_nino_starting_kit.html

Alternatively, load and execute rampwf.utils.testing.py, and call assert_submission. This may be useful if you would like to understand how we instantiate the workflow, the scores, the data connectors, and the cross validation scheme defined in problem.py, and how we insert and train/test your submission.

In [21]:
# %load https://raw.githubusercontent.com/paris-saclay-cds/ramp-workflow/master/rampwf/utils/testing.py
In [ ]:
# assert_submission()

Submitting to ramp.studio

Once you found a good model, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event el nino for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) ts_feature_extractor.py and regressor.py from submissions/starting_kit. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard. If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints mean cross-validation scores

----------------------------
train rmse = 0.909 ± 0.0267
valid rmse = 0.796 ± 0.0466
test rmse = 0.775 ± 0.0009

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is root mean squared error ("rmse"), so the line that is relevant in the output of ramp_test_submission is valid rmse = 0.796 ± 0.0466. When the score is good enough, you can submit it at the RAMP.

Other models in the starting kit

To get you started, we made several other example submissions.

This one uses the whole temperature field at time $t$ as the feature vector.

In [27]:
%%file submissions/whole_field/ts_feature_extractor.py
import numpy as np


class FeatureExtractor(object):

    def __init__(self):
        pass

    def transform(self, X_ds):
        """Compute the whole field at time t."""
        # This is the range for which features should be provided. Strip
        # the burn-in from the beginning.
        valid_range = np.arange(X_ds.n_burn_in, len(X_ds['time']))
        # Take the whole temperature field.
        all = X_ds['tas'].values
        # Vectorize it to obtain a single feature vector at time t.
        vectorized = all.reshape(len(all), -1)
        # Strip burn-in.
        X_array = vectorized[valid_range]
        return X_array
Overwriting submissions/whole_field/ts_feature_extractor.py

You can test this by

In [26]:
!ramp_test_submission --submission whole_field
Testing El Nino forecast
Reading train and test files from ./data ...
Reading cv ...
length of common block: 300 months = 25 years
length of validation block: 288 months = 24 years
length of each cv block: 36 months = 3 years
Training ./submissions/whole_field ...
CV fold 0
	train rmse = 0.199
	valid rmse = 0.629
	test rmse = 0.624
CV fold 1
	train rmse = 0.213
	valid rmse = 0.63
	test rmse = 0.586
CV fold 2
	train rmse = 0.227
	valid rmse = 0.608
	test rmse = 0.553
CV fold 3
	train rmse = 0.224
	valid rmse = 0.624
	test rmse = 0.534
CV fold 4
	train rmse = 0.249
	valid rmse = 0.578
	test rmse = 0.542
CV fold 5
	train rmse = 0.165
	valid rmse = 0.649
	test rmse = 0.638
CV fold 6
	train rmse = 0.237
	valid rmse = 0.592
	test rmse = 0.542
CV fold 7
	train rmse = 0.243
	valid rmse = 0.705
	test rmse = 0.533
----------------------------
train rmse = 0.22 ± 0.0256
valid rmse = 0.627 ± 0.0363
test rmse = 0.569 ± 0.0393
----------------------------
Testing if the notebook can be converted to html
[NbConvertApp] Converting notebook ./el_nino_starting_kit.ipynb to html
[NbConvertApp] Writing 444906 bytes to ./el_nino_starting_kit.html

This one computes two features: the anomaly with respect to the seasonal average corresponding to the current monthand the seasonal average corresponding to the month to be predicted.

In [15]:
%%file submissions/seasonal_anomalies/ts_feature_extractor.py
import numpy as np

en_lat_bottom = -5
en_lat_top = 5
en_lon_left = 360 - 170
en_lon_right = 360 - 120


def get_area_mean(tas, lat_bottom, lat_top, lon_left, lon_right):
    """The array of mean temperatures in a region at all time points."""
    return tas.loc[:, lat_bottom:lat_top, lon_left:lon_right].mean(
        dim=('lat', 'lon'))


def get_enso_mean(tas):
    """The array of mean temperatures in the El Nino 3.4 region.

    At all time point.
    """
    return get_area_mean(
        tas, en_lat_bottom, en_lat_top, en_lon_left, en_lon_right)


class FeatureExtractor(object):

    def __init__(self):
        pass

    def transform(self, X_ds):
        """Anomaly and running mean.

        Compute the El Nino running mean at time t - (12 - X_ds.n_lookahead),
        corresponding the month to be predicted and the anomaly at time t
        (the difference between the temperature and the monthly running mean).

        The code is short but inefficient.
        """
        n_lookahead = X_ds.n_lookahead
        n_burn_in = X_ds.n_burn_in
        enso = get_enso_mean(X_ds['tas'])
        # The running monthly mean at time t (for every month).
        # It is using xarray's groupby at every t,
        # so its running time is O(T^2). In principle it can be computed in
        # O(T).
        running_monthly_means = np.array([
            enso.isel(time=slice(None, t)).groupby('time.month').mean(
                dim='time')
            for t in range(X_ds.n_burn_in, len(X_ds['tas']))])
        # The running monthly mean at time t
        # (corresponding to the current month).
        running_monthly_means_t0 = np.array(
            [running_monthly_means[t, t % 12]
             for t in range(len(running_monthly_means))])
        # The running monthly mean at time t
        # (corresponding to the month to be predicted).
        running_monthly_means_lookahead = np.array(
            [running_monthly_means[t, (t + n_lookahead) % 12]
             for t in range(len(running_monthly_means))])
        # The temperature anomaly at t0
        anomalys = enso.values[n_burn_in:] - running_monthly_means_t0
        X_array = np.array([running_monthly_means_lookahead, anomalys]).T
        return X_array
Overwriting submissions/seasonal_anomalies/ts_feature_extractor.py

You can test this by

In [65]:
!ramp_test_submission --submission seasonal_anomalies
Testing El Nino forecast
Reading train and test files from ./data ...
Reading cv ...
length of common block: 300 months = 25 years
length of validation block: 288 months = 24 years
length of each cv block: 36 months = 3 years
Training ./submissions/seasonal_anomalies ...
CV fold 0
	train rmse = 0.76
	valid rmse = 0.653
	test rmse = 0.643
CV fold 1
	train rmse = 0.754
	valid rmse = 0.638
	test rmse = 0.638
CV fold 2
	train rmse = 0.74
	valid rmse = 0.647
	test rmse = 0.638
CV fold 3
	train rmse = 0.714
	valid rmse = 0.69
	test rmse = 0.638
CV fold 4
	train rmse = 0.744
	valid rmse = 0.575
	test rmse = 0.637
CV fold 5
	train rmse = 0.728
	valid rmse = 0.599
	test rmse = 0.637
CV fold 6
	train rmse = 0.708
	valid rmse = 0.697
	test rmse = 0.637
CV fold 7
	train rmse = 0.703
	valid rmse = 0.746
	test rmse = 0.638
----------------------------
train rmse = 0.731 ± 0.0201
valid rmse = 0.656 ± 0.0516
test rmse = 0.638 ± 0.002
----------------------------
Testing if the notebook can be converted to html
[NbConvertApp] Converting notebook ./el_nino_starting_kit.ipynb to html
[NbConvertApp] Writing 434022 bytes to ./el_nino_starting_kit.html

The illegal lookahead check

Since the feature extractor receives the full data series which contains the target, we have no technical constraints against cheating, that is, using information coming after the present (six months before the target). We can nevertheless check wether the feature $x_t$ changes if we alter the future $(z_{t+1}, \ldots)$. The check is implemented both in the unit test (so you can check if your submission is valid) and at our backend (so you will receive a similar error message on the leaderboard if your model doesn't pass the test).

Of course, you have no reason to deliberately cheat, but it is possible that you accidentally introduce a bug. For example, the code below rolls the features in the wrong way (forward instead of backward):

enso_rolled = np.roll(enso, -X_ds.n_lookahead)

In [67]:
%%file submissions/illegal_lookahead/ts_feature_extractor.py
import numpy as np

en_lat_bottom = -5
en_lat_top = 5
en_lon_left = 360 - 170
en_lon_right = 360 - 120


def get_area_mean(tas, lat_bottom, lat_top, lon_left, lon_right):
    """The array of mean temperatures in a region at all time points."""
    return tas.loc[:, lat_bottom:lat_top, lon_left:lon_right].mean(
        dim=('lat', 'lon'))


def get_enso_mean(tas):
    """The array of mean temperatures in the El Nino 3.4 region.

    At all time point.
    """
    return get_area_mean(
        tas, en_lat_bottom, en_lat_top, en_lon_left, en_lon_right)


class FeatureExtractor(object):

    def __init__(self):
        pass

    def transform(self, X_ds):
        """Compute the El Nino mean at time t - (12 - X_ds.n_lookahead).

        Corresponding the month to be predicted.
        """
        # This is the range for which features should be provided. Strip
        # the burn-in from the beginning.
        valid_range = np.arange(X_ds.n_burn_in, len(X_ds['time']))
        enso = get_enso_mean(X_ds['tas'])
        # Roll the input series forward ("by accident" :)) so it uses the
        # target as input.
        enso_rolled = np.roll(enso, -X_ds.n_lookahead)
        # Strip burn in.
        enso_valid = enso_rolled[valid_range]
        # Reshape into a matrix of one column
        X_array = enso_valid.reshape((-1, 1))
        return X_array
Overwriting submissions/illegal_lookahead/ts_feature_extractor.py

Check what happens when the unit test is run.

In [66]:
!ramp_test_submission --submission illegal_lookahead
Testing El Nino forecast
Reading train and test files from ./data ...
Reading cv ...
length of common block: 300 months = 25 years
length of validation block: 288 months = 24 years
length of each cv block: 36 months = 3 years
Training ./submissions/illegal_lookahead ...
Traceback (most recent call last):
  File "/Users/kegl/anaconda/bin/ramp_test_submission", line 11, in <module>
    load_entry_point('ramp-workflow', 'console_scripts', 'ramp_test_submission')()
  File "/Users/kegl/Research/RAMP/ramp-workflow/rampwf/utils/command_line.py", line 51, in ramp_test_submission
    submission=sub)
  File "/Users/kegl/Research/RAMP/ramp-workflow/rampwf/utils/testing.py", line 83, in assert_submission
    module_path, X_train, y_train, train_is=train_is)
  File "/Users/kegl/Research/RAMP/ramp-workflow/rampwf/workflows/el_nino.py", line 32, in train_submission
    ts_fe, X_train_ds)
  File "/Users/kegl/Research/RAMP/ramp-workflow/rampwf/workflows/ts_feature_extractor.py", line 59, in test_submission
    raise AssertionError(message)
AssertionError: The feature extractor looks into the future by at least 6 time steps

More information

You can find more information in the README of the ramp-workflow library.

Contact

Don't hesitate to contact us.