El Nino forecast, single-day RAMP at Climate Informatics Workshop 2015; Saclay Data Camp 2016/17


A climate index is real-valued time-series which has been designated of interest in the climate literature. For example, the El Niño–Southern Oscillation (ENSO) index has widespread uses for predictions of regional and seasonal conditions, as it tends to have strong (positive or negative) correlation with a variety of weather conditions and extreme events throughout the globe. The ENSO index is just one of the many climate indices studied. However there is currently significant room for improvement in predicting even this extremely well studied index with such high global impact. For example, most statistical and climatological models erred significantly in their predictions of the 2015 El Niño event; their predictions were off by several months. Better tools to predict such indices are critical for seasonal and regional climate prediction, and would thus address grand challenges in the study of climate change (World Climate Research Programme: Grand Challenges, 2013).

El Niño

El Niño (La Niña) is a phenomenon in the equatorial Pacific Ocean characterized by a five consecutive 3-month running mean of sea surface temperature (SST) anomalies in the Niño 3.4 region that is above (below) the threshold of $+0.5^\circ$C ($-0.5\circ$C). This standard of measure is known as the Oceanic Niño Index (ONI).

More information can be found here on why it is an important region, and what the history of the index is. Here are the current ENSO predictions, updated monthly.

The CCSM4 simulator

Our data is coming from the CCSM4.0 model (simulator). This allows us to access a full regular temperature map for a 500+ year period which makes the evaluation of the predictor more robust than if we used real measurements.

The data

The data is a time series of "images" $z_t$, consisting of temperature measurements (for a technical reason it is not SST that we will work with, rather air temperature) on a regular grid on the Earth, indexed by lon(gitude) and lat(itude) coordinates. Latitude and longitude are sampled at a resolution of $5^\circ$, giving 37 latitude and 72 longitude grid points, 2664 temperature measurements at every time step. The average temperatures are recorded every month for 119 years, giving 1428 time points in the public training data (available in the starting kit), 155/1860 in the training data (public leaderboard), and 500/6000 in the test data (private leaderboard). The goal is to predict the average temperature in the El Niño region, 6 months ahead. Note that the data set given in the starting kit is different from the one used to evaluate your submissions (of course, the data structures and the generative model (simulator) will be identical), so your submission should be generic (for example, it should be able to handle a time series of different length).

The prediction task

The pipeline consists of a feature extractor and a predictor. Since the task is regression, the predictor will be a regressor, and the score (to minimize) will be the root mean square error. The feature extractor will have access to the whole data. It will construct a "classical" feature matrix where each row corresponds to a time point. You should collect all information into these features that you find relevant to the regressor. The feature extractor can take anything from the past, that is, it will implement a function $x_t = f(z_1, \ldots, z_t)$. Since you will have access to the full data, in theory you can cheat (even inadvertantly) by using information from the future. We have implemented a randomized test to detect such submissions, but please do your best to avoid this since it would make the results irrelevant.

Domain-knowledge suggestions

You are of course free to explore any regression technique to improve the prediction. Since the input dimension is relatively large (2000+ dimensions per time point even after subsampling) sparse regression techniques (eg. LASSO) may be the best way to go, but this is just an a priori suggestion. The following list provides you other hints to start with, based on domain knowledge.
  • There is a strong seasonal cycle that must be taken into account.
  • There is little scientific/observational evidence that regions outside the Pacific play a role in NINO3.4 variability, so it is probably best to focus on Pacific SST for predictions.
  • The relation between tropical and extra-tropical Pacific SST is very unclear, so please explore!
  • The NINO3.4 index has an oscillatory character (cold followed by warm followed by cold), but this pattern does not repeat exactly. It would be useful to be able to predict periods when the oscillation is “strong” and when it “breaks down.”
  • A common shortcoming of empirical predictions is that they under-predict the amplitude of warm and cold events. Can this be improved?
  • There is evidence that the predictability is low when forecasts start in, or cross over, March and April (the so-called “spring barrier”). Improving predictions through the spring barrier would be important.
  • Submissions will open at (UTC) 2000-01-01 00:00:00
  • When you submit, your submission is sent to be trained automatically. The jobs may wait some time in a queue before being run so be patient.
  • Pending (untrained) and failing submissions can be resubmitted under the same name at an arbitrary frequency.
  • Once your submission is trained, it cannot be deleted or replaced.
  • After each succesfully trained submission, you have to wait 900s to resubmit.
  • The leaderboard is in "hidden" mode until (UTC) 2016-12-16 08:00:00 which means that all scores are visible, but the links pointing to the code of the participants are hidden. After (UTC) 2016-12-16 08:00:00, all submitted codes are public. You will be encouraged to look at and reuse each other's code.