RAMP on detecting Solar storms¶

Joris van den Bossche, Gautier Nguyen, Nicolas Aunai & Balazs Kegl

Table of Contents¶

1  RAMP on detecting Solar storms
1.1  Introduction
1.2  Getting started with the RAMP starting kit
1.2.1  Software prerequisites
1.2.2  Getting the data
1.3  The data
1.3.1  An example ICME "solar storm" event
1.3.2  The duration of events
1.3.3  Number of events show a cycle
1.3.4  Imbalance between solar wind and solar storm
1.3.5  Testing data
1.4  Workflow
1.4.1  The model to submit
1.4.2  Evaluation
1.4.3  Evaluation with Cross-Validation
1.5  Submitting to the online challenge: ramp.studio
1.6  More information
1.7  Questions

Introduction¶

Interplanetary Coronal Mass Ejections (ICMEs) result from magnetic instabilities occurring in the Sun atmosphere, and interact with the planetary environment and may result in intense internal activity such as strong particle acceleration, so-called geomagnetic storms and geomagnetic induced currents. These effects have serious consequences regarding space and ground technologies and understanding them is part of the so-called space weather discipline.

ICMEs signatures as measured by in-situ spacecraft come as patterns in time series of the magnetic field, the particle density, bulk velocity, temperature etc. Although well visible by expert eyes, these patterns have quite variable characteristics which make naive automatization of their detection difficult.

The goal of this RAMP is to detect Interplanetary Coronal Mass Ejections (ICMEs) in the data measured by in-situ spacecraft.

ICMEs are the interplanetary counterpart of Coronal Mass Ejections (CMEs), the expulsion of large quantities of plasma and magnetic field that result from magnetic instabilities occurring in the Sun atmosphere (Kilpua et al. (2017) and references therein). They travel at several hundred or thousands of kilometers per second and, if in their trajectory, can reach Earth in 2-4 days.

ICMEs interact with the planetary environment and may result in intense internal activity such as strong particle acceleration, so-called geomagnetic storms and geomagnetic induced currents. These effects have serious consequences regarding space and ground technologies and understanding them is part of the so-called space weather discipline. ICMEs signatures as measured by in-situ spacecraft thus come as patterns in time series of the magnetic field, the particle density, bulk velocity, temperature etc. Although well visible by expert eyes, these patterns have quite variable characteristics which makes naive automatization of their detection difficult. To overcome this problem, Lepping et al. (2005) proposed an automatic detection method based on manually set thresholds on a set of physical parameters. However, the method allowed to detect only 60 % of the ICMEs with a high percentage of false positives (60%). Moreover, because of the subjectivity induced by the manually set threshold, the method had difficulties to create a reproducible and constant ICME catalog.

This challenge proposes to design the best algorithm to detect ICMEs from the most complete ICME catalog containing 657 events. We propose to give to the users a subset of this large dataset in order to test and calibrate their algorithm. We provide in-situ data measurement by the WIND spacecraft between 1997 and 2016, that we sampled to a 10 minutes resolution and for which we computed three additional features that proved to be useful in the visual identification of ICMEs. Using an appropriate metric, we will compare the true solution to the estimation. The goal is to provide an ICME catalog containing less than 10% of false positives while recording as much existing event as possible.

Formally, each instance will consist of a measurement of various physical parameters in the interplanetary medium. The training set will contain data measurement from 1997 to 2010 and the beginning and ending dates of the 438 ICMEs that were measured in this period : tstart and tend.

To download and run this notebook: download the full starting kit, with all the necessary files.

Getting started with the RAMP starting kit¶

Software prerequisites¶

This starting kit requires the following dependencies:

  • numpy
  • pandas
  • pyarrow
  • scikit-learn
  • matplolib
  • jupyter
  • imbalanced-learn

We recommend to install those using conda (using the Anaconda distribution).

In addition, ramp-workflow is needed. This can be installed from the master branch on GitHub:

python -m pip install https://api.github.com/repos/paris-saclay-cds/ramp-workflow/zipball/master

Getting the data¶

The public train and test data can be downloaded by running from the root of the starting kit:

python download_data.py
In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The data¶

We start with inspecting the training data:

In [2]:
from problem import get_train_data

data_train, labels_train = get_train_data()
In [3]:
data_train.head()
Out[3]:
B Bx Bx_rms By By_rms Bz Bz_rms Na_nl Np Np_nl ... Range F 8 Range F 9 V Vth Vx Vy Vz Beta Pdyn RmsBob
1997-10-01 00:00:00 6.584763 3.753262 2.303108 0.966140 2.602693 -5.179685 2.668414 2.290824 23.045732 24.352797 ... 2.757919e+09 2.472087e+09 378.313934 80.613098 -351.598389 -138.521454 6.956387 7.641340 5.487331e-15 0.668473
1997-10-01 00:10:00 6.036456 0.693559 1.810752 -0.904843 2.165570 -1.944006 2.372931 2.119593 23.000492 20.993362 ... 3.365612e+09 3.087122e+09 350.421021 69.919327 -331.012146 -110.970787 -21.269474 9.149856 4.783776e-15 0.753848
1997-10-01 00:20:00 5.653682 -4.684786 0.893058 -2.668830 0.768677 1.479302 1.069266 2.876815 20.676191 17.496399 ... 1.675611e+09 1.558640e+09 328.324493 92.194435 -306.114899 -117.035202 -13.018987 11.924199 3.719768e-15 0.282667
1997-10-01 00:30:00 5.461768 -4.672382 1.081638 -2.425630 0.765681 1.203713 0.934445 2.851195 20.730188 16.747108 ... 1.589037e+09 1.439569e+09 319.436859 94.230705 -298.460938 -110.403969 -20.350492 16.032987 3.525211e-15 0.304713
1997-10-01 00:40:00 6.177846 -5.230110 1.046126 -2.872561 0.635256 1.505010 0.850657 3.317076 20.675701 17.524536 ... 1.812308e+09 1.529260e+09 327.545929 89.292595 -307.303070 -111.865845 -12.313167 10.253789 3.694283e-15 0.244203

5 rows × 33 columns

The data consist of 30 primary input variables: the bulk velocity and its components $V,V_{x}, V_{y}, V_{z} $, the thermal velocity $V_{th}$, the magnetic field, its components and their RMS : $B, B_{x}, B_{y}, B_{z}, \sigma_{B_x}, \sigma_{B_y}, \sigma_{B_z}$, the density of protons and $\alpha$ particles obtained from both moment and non-linear analysis : $N_{p}, N_{p,nl}$ and $N_{a,nl}$ as well as 15 canals of proton flux between 0.3 and 10 keV.

The data are resampled to a 10 minute resolution.

In addition to the 30 input variables, we computed 3 additional features that will also serve as input variables : the plasma parameter $\beta$, defined as the ratio between the thermal and the magnetic pressure, the dynamic pressure $P_{dyn} = N_{p}V^{2}$ and the normalized magnetic fluctuations : $\sigma_{B} = \sqrt{(\sigma_{B_x}^{2}+\sigma_{B_y}^{2}+\sigma_{B_z}^{2}})/B$.

In [4]:
data_train.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 509834 entries, 1997-10-01 00:00:00 to 2007-12-31 23:50:00
Data columns (total 33 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   B           509834 non-null  float32
 1   Bx          509834 non-null  float32
 2   Bx_rms      509834 non-null  float32
 3   By          509834 non-null  float32
 4   By_rms      509834 non-null  float32
 5   Bz          509834 non-null  float32
 6   Bz_rms      509834 non-null  float32
 7   Na_nl       509834 non-null  float32
 8   Np          509834 non-null  float32
 9   Np_nl       509834 non-null  float32
 10  Range F 0   509834 non-null  float32
 11  Range F 1   509834 non-null  float32
 12  Range F 10  509834 non-null  float32
 13  Range F 11  509834 non-null  float32
 14  Range F 12  509834 non-null  float32
 15  Range F 13  509834 non-null  float32
 16  Range F 14  509834 non-null  float32
 17  Range F 2   509834 non-null  float32
 18  Range F 3   509834 non-null  float32
 19  Range F 4   509834 non-null  float32
 20  Range F 5   509834 non-null  float32
 21  Range F 6   509834 non-null  float32
 22  Range F 7   509834 non-null  float32
 23  Range F 8   509834 non-null  float32
 24  Range F 9   509834 non-null  float32
 25  V           509834 non-null  float32
 26  Vth         509834 non-null  float32
 27  Vx          509834 non-null  float32
 28  Vy          509834 non-null  float32
 29  Vz          509834 non-null  float32
 30  Beta        509834 non-null  float64
 31  Pdyn        509834 non-null  float64
 32  RmsBob      509834 non-null  float32
dtypes: float32(31), float64(2)
memory usage: 72.0 MB

The target labels consists of an indicator for each time step (O for background solar wind, 1 for solar storm, the event to detect):

In [5]:
labels_train.head()
Out[5]:
1997-10-01 00:00:00    0
1997-10-01 00:10:00    0
1997-10-01 00:20:00    0
1997-10-01 00:30:00    0
1997-10-01 00:40:00    0
Name: label, dtype: int64

An example ICME "solar storm" event¶

ICMEs signatures as measured by in-situ spacecraft thus come as patterns in time series of the magnetic field, the particle density, bulk velocity, temperature etc.

Let's visualize a typical event to inspect the patterns.

In [6]:
def plot_event(start, end, data, delta=36):
    start = pd.to_datetime(start)
    end = pd.to_datetime(end)
    subset = data[
        (start - pd.Timedelta(hours=delta)) : (end + pd.Timedelta(hours=delta))
    ]

    fig, axes = plt.subplots(nrows=4, ncols=1, figsize=(10, 15), sharex=True)

    # plot 1
    axes[0].plot(subset.index, subset["B"], color="gray", linewidth=2.5)
    axes[0].plot(subset.index, subset["Bx"])
    axes[0].plot(subset.index, subset["By"])
    axes[0].plot(subset.index, subset["Bz"])
    axes[0].legend(
        ["B", "Bx", "By", "Bz (nT)"], loc="center left", bbox_to_anchor=(1, 0.5)
    )
    axes[0].set_ylabel("Magnetic Field (nT)")

    # plot 2
    axes[1].plot(subset.index, subset["Beta"], color="gray")
    axes[1].set_ylim(-0.05, 1.7)
    axes[1].set_ylabel("Beta")

    # plot 3
    axes[2].plot(subset.index, subset["V"], color="gray")
    axes[2].set_ylabel("V(km/s)")
    # axes[2].set_ylim(250, 500)

    # plot 4
    axes[3].plot(subset.index, subset["Vth"], color="gray")
    axes[3].set_ylabel("$V_{th}$(km/s)")
    # axes[3].set_ylim(5, 60)

    # add vertical lines
    for ax in axes:
        ax.axvline(start, color="k")
        ax.axvline(end, color="k")
        ax.xaxis.grid(True, which="minor")

    return fig, axes
In [7]:
plot_event(
    pd.Timestamp("2001-10-31 22:00:00"), pd.Timestamp("2001-11-02 05:30:00"), data_train
);

Not all events will be "text-book" examples and don't always exhibit all typical characteristics.

Visualizing some more, randomly drawn events:

In [8]:
from problem import turn_prediction_to_event_list
In [9]:
events = turn_prediction_to_event_list(labels_train)
In [10]:
rng = np.random.RandomState(1234)

for i in rng.randint(0, len(events), 3):
    plot_event(events[i].begin, events[i].end, data_train)