Arctic sea ice forecast, Polytechnique MAP583 2016/17
Balázs Kégl (CNRS), Camille Marini (CNRS), Andy Rhines (UW), Jennifer Dy (NEU), Arindam Banerjee (UMN)


Arctic sea ice cover is one of the most variable features of Earth's climate. Its annual cycle peaks at around 15 million square kilometers in early spring, melting back to a minimum of about 6 million square kilometers in September. These seasonal swings are important for Earth's energy balance, as ice reflects the majority of sunlight while open water absorbs it. Changes in ice cover are also important for marine life and navigation for shipping.

In recent years, Arctic sea ice cover has declined rapidly, particularly during the September minimum. These changes have outpaced the predictions of climate models, and forecasting extent remains a formidable challenge. Typically, skillful predictions are limited to ~2-5 months in advance (Stroeve, et al. "Improving Predictions of Arctic Sea Ice Extent"), while idealized experiments suggest that predictions up to two years in advance should be possible (Guemas et al, 2014).

Better tools to predict ice cover are critical for seasonal and regional climate prediction, and would thus address grand challenges in the study of climate change (World Climate Research Programme: Grand Challenges, 2013)

The CCSM4 simulator

As a surrogate for observational data, we will use output from a 1300 year simulation using the NCAR CCSM4.0 climate model. The model was run in fully-coupled mode with interactive ocean, atmosphere, and sea ice. The simulation was also performed in an idealized "Pre-Industrial" mode, where greenhouse gas concentrations and other external forcings are held fixed to 1850 levels. This allows us to access a stationary climate over a 1000+ year period, which makes the evaluation of the predictor more robust than if we used real measurements that are both non-stationary and limited to several decades.

The data

The data is a time series of "images" $z_t$, consisting of different physical variables on a regular grid on the Earth, indexed by lon(gitude) and lat(itude) coordinates. The variables we have made available are:
  • ice_area --- the Northern Hemisphere sea ice area, in millions of squared kilometers.
  • ts --- surface temperature, most important over the oceans which have a very high heat capacity.
  • taux --- zonal (x-direction) surface wind stress. This is the frictional effect of winds on the sea surface and sea ice.
  • tauy --- meridional (y-direction) surface wind stress.
  • ps --- surface pressure.
  • psl --- equivalent sea-level surface pressure. This corrects ps for the effects of topography, though the two should be very similar.
  • shflx --- Surface sensible heat flux, the amount of heat transferred from the surface to the atmosphere.
  • cldtot --- Total cloud cover (fractional), which has strong effects on radiative energy balance at the surface.
The fields are recorded every month for 1300 years, giving 15,600 time points. The goal is to predict the Northern Hemisphere sea ice area 4 months ahead. Since the most important prediction is the minimum area in September, we will also display the RMSE over predictions in May, predecting that years (minimum) ice area in September.

The prediction task

The pipeline will consists of a time series feature extractor and a predictor. Since the task is regression, the predictor will be a regressor, and the score (to minimize) will be the root mean square error. The feature extractor will have access to the whole data. It will construct a "classical" feature matrix where each row corresponds to a time point. You should collect all information into these features that you find relevant to the regressor. The feature extractor can take anything from the past, that is, it will implement a function $x_t = f(z_1, \ldots, z_t)$. Since you will have access to the full data, in theory you can cheat (even inadvertantly) by using information from the future. We have implemented a randomized test to find such "bugs", but please do your best to avoid this since it would make the results irrelevant.
  • Submissions will open at (UTC) 2000-01-01 00:00:00
  • When you submit, your submission is sent to be trained automatically. The jobs may wait some time in a queue before being run so be patient.
  • Pending (untrained) and failing submissions can be resubmitted under the same name at an arbitrary frequency.
  • Once your submission is trained, it cannot be deleted or replaced.
  • After each succesfully trained submission, you have to wait 900s to resubmit.
  • The leaderboard is in "hidden" mode until (UTC) 2017-02-13 19:00:00 which means that all scores are visible, but the links pointing to the code of the participants are hidden. After (UTC) 2017-02-13 19:00:00, all submitted codes are public. You will be encouraged to look at and reuse each other's code.