RAMP on variable star type prediction

Balázs Kégl (LAL/CNRS), Marc Moniez (LAL/CNRS), Alex Gramfort (Inria), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS)

Introduction

Variable stars

Most stars emit light steadily in time, but a small fraction of them has a variable light curve: light emission versus time. We call them variable stars. The light curves are usually periodic and highly regular. There are essentially two reasons why light emission can vary. First, the star itself can be oscillating, so its light emission varies in time. Second, the star that seems a single point at Earth (because of our large distance) is actually a binary system: two stars that orbit around their common center of gravity. When the orbital plane is parallel to our line of view, the stars eclipse each other periodically, creating a light curve with a charateristic signature. Identifying, classifying, and analyzing variable stars are hugely important for calibrating distances, and making these analyses automatic will be crucial in the upcoming sky survey projects such as LSST.

The EROS1 database, a catalog of light curves

Our data is coming from the EROS1 project that took data between 1990 and 1994. 400 photographic plates were taken of the Large Magellanic Cloud with the ESO-Schmidt 1m telescope. Each plate covers the same $5^\circ \times 5^\circ$ field, centered at $\alpha = 5h18m43s$, $\delta = −69d42m17s$ in the celestial coordinate system. Photos were taken in two frequency bands (red: 630nm; blue: 385nm), digitized at the Observatoire de Paris, and analyzed at the IN2P3 Computing Center.

Selecting variable stars

The full catalog contains 8 million objects. We estimated the probability of a star being stable using an in-house algorithm and selected the $\simeq 1\%$ least stable stars. The light curve of each selected star was then visually inspected, and the star was either declared stable or assigned a variability type and a quality index. The variability types are eclipsing binary, Cepheid, RR-Lyrae, Mira, and other (the variability is clearly established, but the type is unclear). The quality index was visually estimated from 1 (lowest signal to background ratio) to 3 (best signal to background ratio). A total of 22802 variable objects were found that include 9046 RR-Lyrae, 2758 Cepheids, 1596 eclipsing binaries, 890 Miras, and 8512 unclassified objects. About 10% of the data was lost in the data archeology step that included converting the measurements from a native PAW format to csv, giving us a total of 19429 stars.

Selecting data

For the RAMP, we decided to drop all instances in the "unclassified object" category since the interpretation of this type was unclear. We then randomly selected 30% of the data for training, giving us 3641 training instances.

The data consists of two files, train.csv contains "static" features in a classical row-wise csv table, and train_varlength_features.csv.gz is a table that contains all the time series. The contents is obtained with get_train_data function below.

Exploratory data analysis

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import chain
pd.set_option('display.max_columns', None)

Get access to the training data

In [2]:
import problem

X_df, y = problem.get_train_data()

The static features

Red variables are ids, labels, or other human-annotated features, so they should not be used as input in the classification. Light blue variables are legal but not likely to contribute information to the classification. Black variables are definitely discriminative.

  • patch_id: The catalogue is organized by tiles corresponding to $1$cm$^2$ patches on the plates, this is their id.
  • star_id_b, star_id_r: The id of the star within the patch. patch_id and star_id_b or star_id_r identify the stars uniquely, and we use [patch_id]_[star_id_b] for indexing the stars in the pandas table.
  • magnitude_b, magnitude_r: The average apparent luminosity of the star (in two frequency bands). Magnitude is a logarithmic measure, and the higher it is, the lower the apparent luminosity is.
  • asc_d, asc_m, asc_s: Celestial right ascension (coordinate) of the star, measured in degrees, minutes, and seconds, respectively.
  • dec_d, dec_m, dec_s: Celestial declination (coordinate) of the star, measured in degrees, minutes, and seconds, respectively.
  • period: the estimated period of the light curve. For the correct period, it should be divided by div_period
  • frequency: $1/$period, so it is redundant.
  • num_points_good_b, num_points_good_r: Number of good light curve measurements (some measurements can be corrupted).
  • asym_b, asym_r: Unknown semantics.
  • log_p_not_variable: Logarithm of the estimated probability that the star is stable.
  • sigma_flux_b, sigma_flux_r: The square root of the total variance of the light measurements (indicating the amplitude of the variability).
  • type: The label to predict.
  • quality: Human-annotated measure of the quality of the time curve. The higher the better.
  • div_period: The algorithm that estimates the period sometimes finds a multiple of the period. These cases were human-detected, and the divisor was recorded. In principle this variable is not available automatically as an observable, but we are confident that it could be obtained automatically, so we allow it as an input.
In [3]:
X_df.head()
Out[3]:
patch_id star_id_b star_id_r magnitude_b magnitude_r asc_d asc_m asc_s dec_d dec_m dec_s period frequency num_points_good_b num_points_good_r asym_b asym_r log_p_not_variable sigma_flux_b sigma_flux_r quality div_period time_points_b time_points_r light_points_b light_points_r error_points_b error_points_r bkg_points_b bkg_points_r polltn_points_b polltn_points_r
0 135.0 9613.0 10062.0 19.1458 18.8044 5.0 40.0 51.37 -70.0 13.0 58.78 2.35495 0.424637 123.0 124.0 1.277780 0.653333 -2.80984 36.1471 46.4916 1.0 7.0 [290.3, 291.35, 322.25, 326.24, 345.18, 347.23... [290.34, 291.31, 322.29, 326.28, 345.23, 347.1... [19.15, 18.96, 18.86, 19.17, 18.96, 19.03, 19.... [18.75, 18.66, 18.5, 18.68, 18.77, 18.69, 18.7... [0.13, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.1... [0.16, 0.25, 0.14, 0.17, 0.18, 0.23, 0.14, 0.1... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.13, 0.13, 0.11, 0.11, 0.14, 0.19, 0.08, 0.1...
1 271.0 4304.0 4708.0 17.6799 15.3077 5.0 28.0 51.55 -69.0 23.0 19.17 280.00000 0.003571 123.0 125.0 0.921875 1.777780 -14.51460 136.1020 1269.7400 2.0 2.0 [290.3, 291.35, 322.25, 326.24, 345.18, 347.23... [290.34, 291.31, 322.29, 326.28, 345.23, 347.1... [17.7, 17.97, 17.83, 17.71, 17.6, 17.4, 17.46,... [15.43, 15.66, 15.66, 15.5, 15.28, 15.28, 15.3... [0.12, 0.1, 0.08, 0.09, 0.11, 0.09, 0.11, 0.12... [0.1, 0.11, 0.09, 0.09, 0.06, 0.07, 0.06, 0.07... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.02, 0.02, 0.0, 0.0, 0.02, 0.02, 0.0, 0... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2 295.0 8200.0 8739.0 19.3872 18.8606 5.0 27.0 7.89 -68.0 43.0 14.47 1.42264 0.702919 122.0 124.0 2.388890 1.883720 -6.50514 32.1775 48.2365 2.0 3.0 [290.3, 291.35, 322.25, 326.24, 345.18, 347.23... [290.34, 291.31, 322.29, 326.28, 345.23, 347.1... [20.16, 19.37, 19.28, 19.53, 19.53, 19.95, 19.... [19.19, 18.81, 18.86, 19.06, 19.05, 18.89, 18.... [0.25, 0.14, 0.13, 0.16, 0.22, 0.26, 0.19, 0.2... [0.27, 0.25, 0.17, 0.21, 0.24, 0.25, 0.15, 0.2... [0.0, 0.0, 0.0, 0.0, 0.0, 0.01, 0.0, 0.0, 0.0,... [0.01, 0.01, 0.0, 0.0, 0.0, 0.01, 0.0, 0.0, 0.... [0.03, 0.03, 0.02, 0.02, 0.06, 0.11, 0.05, 0.0... [0.06, 0.03, 0.03, 0.03, 0.05, 0.06, 0.02, 0.0...
3 223.0 530.0 557.0 16.4751 15.5033 5.0 35.0 41.69 -70.0 56.0 13.25 193.54800 0.005167 124.0 124.0 0.252525 1.431370 -6.12045 368.7470 1026.6800 0.0 3.0 [290.3, 291.35, 322.25, 326.24, 345.18, 347.23... [290.34, 291.31, 322.29, 326.28, 345.23, 347.1... [16.73, 16.3, 16.25, 16.38, 16.22, 16.07, 16.0... [15.24, 15.55, 15.74, 15.69, 15.55, 15.37, 15.... [0.14, 0.12, 0.11, 0.18, 0.23, 0.16, 0.2, 0.16... [0.25, 0.21, 0.13, 0.13, 0.12, 0.13, 0.12, 0.1... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4 354.0 10866.0 12022.0 17.8301 16.7286 5.0 21.0 50.48 -69.0 17.0 32.07 1.81622 0.550594 123.0 116.0 1.510200 0.633803 -2.07771 121.7490 318.9770 1.0 5.0 [290.3, 291.35, 322.25, 326.24, 345.18, 347.23... [290.34, 291.31, 322.29, 326.28, 345.23, 347.1... [17.67, 17.51, 17.78, 17.73, 17.6, 17.87, 18.0... [16.97, 16.88, 16.72, 16.67, 16.47, 17.12, 16.... [0.13, 0.1, 0.1, 0.1, 0.12, 0.1, 0.12, 0.14, 0... [0.13, 0.14, 0.1, 0.1, 0.08, 0.18, 0.08, 0.08,... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... [0.0, 0.0, 0.0, 0.0, 0.02, 0.02, 0.02, 0.0, 0.... [0.19, 0.22, 0.13, 0.11, 0.11, 0.35, 0.05, 0.0...

The labels

In [4]:
label_names = {1: 'binary', 2: 'cepheid', 3: 'rr_lyrae', 4: 'mira'}
labels = list(label_names.keys())
y_series = pd.Series(y).replace(label_names)
y_series.head()
Out[4]:
0    rr_lyrae
1        mira
2    rr_lyrae
3      binary
4    rr_lyrae
dtype: object
In [5]:
_ = y_series.value_counts().plot(kind="bar")

Some classwise histograms and scatterplots

In [6]:
colors = ['r', 'b', 'g', 'm']
def plot_classwise_normalized(feature, bins=None):
    if bins is None:
        bins = np.linspace(X_df[feature].min(), X_df[feature].max(), 15)
    for label, color in zip(labels, colors):
        plt.hist(X_df[y == label][feature].values, density=True, bins=bins, 
                 alpha=0.5, color=color)
In [7]:
plot_classwise_normalized('period')
In [8]:
plot_classwise_normalized('period', bins=np.linspace(0, 50, 15))

Observe the aliasing below.

In [9]:
plot_classwise_normalized('period', bins=np.linspace(0, 5, 15))
In [10]:
X_df['real_period'] = X_df['period'] / X_df['div_period']
In [11]:
plot_classwise_normalized('real_period')
In [12]:
plot_classwise_normalized('real_period', bins=np.linspace(0, 50, 15))
In [13]:
plot_classwise_normalized('real_period', bins=np.linspace(0, 5, 15))
In [14]:
plot_classwise_normalized('magnitude_b')
In [15]:
plot_classwise_normalized('magnitude_r')
In [16]:
plot_classwise_normalized('asym_b')
In [17]:
plot_classwise_normalized('asym_r')
In [18]:
plot_classwise_normalized('sigma_flux_b')
In [19]:
plot_classwise_normalized('sigma_flux_b', bins=np.linspace(0, 1000, 15))
In [20]:
plot_classwise_normalized('sigma_flux_b', bins=np.linspace(0, 100, 15))
In [21]:
plot_classwise_normalized('sigma_flux_r')
In [22]:
plot_classwise_normalized('sigma_flux_r', bins=np.linspace(0, 1000, 15))
In [23]:
plot_classwise_normalized('sigma_flux_r', bins=np.linspace(0, 100, 15))
In [24]:
plot_classwise_normalized('log_p_not_variable')
In [25]:
colors = ['r', 'b', 'g', 'm']
def plot_classwise_scatter(feature1, feature2, range1=None, range2=None):
    if range1 is None:
        range1 = [X_df[feature1].min(), X_df[feature1].max()]
    if range2 is None:
        range2 = [X_df[feature2].min(), X_df[feature2].max()]
    for label, color in zip(labels, colors):
        plt.xlim(range1[0], range1[1])
        plt.ylim(range2[0], range2[1])
        plt.scatter(X_df[y == label][feature1], 
                    X_df[y == label][feature2],
            alpha=0.3, s=80, c=color, marker='.');
In [26]:
plot_classwise_scatter('magnitude_b', 'magnitude_r')