Balázs Kégl (LAL/CNRS), Marc Moniez (LAL/CNRS), Alex Gramfort (Inria), Djalel Benbouzid (UPMC), Mehdi Cherti (LAL/CNRS)
Most stars emit light steadily in time, but a small fraction of them has a variable light curve: light emission versus time. We call them variable stars. The light curves are usually periodic and highly regular. There are essentially two reasons why light emission can vary. First, the star itself can be oscillating, so its light emission varies in time. Second, the star that seems a single point at Earth (because of our large distance) is actually a binary system: two stars that orbit around their common center of gravity. When the orbital plane is parallel to our line of view, the stars eclipse each other periodically, creating a light curve with a charateristic signature. Identifying, classifying, and analyzing variable stars are hugely important for calibrating distances, and making these analyses automatic will be crucial in the upcoming sky survey projects such as LSST.
Our data is coming from the EROS1 project that took data between 1990 and 1994. 400 photographic plates were taken of the Large Magellanic Cloud with the ESO-Schmidt 1m telescope. Each plate covers the same $5^\circ \times 5^\circ$ field, centered at $\alpha = 5h18m43s$, $\delta = −69d42m17s$ in the celestial coordinate system. Photos were taken in two frequency bands (red: 630nm; blue: 385nm), digitized at the Observatoire de Paris, and analyzed at the IN2P3 Computing Center.
The full catalog contains 8 million objects. We estimated the probability of a star being stable using an in-house algorithm and selected the $\simeq 1\%$ least stable stars. The light curve of each selected star was then visually inspected, and the star was either declared stable or assigned a variability type and a quality index. The variability types are eclipsing binary, Cepheid, RR-Lyrae, Mira, and other (the variability is clearly established, but the type is unclear). The quality index was visually estimated from 1 (lowest signal to background ratio) to 3 (best signal to background ratio). A total of 22802 variable objects were found that include 9046 RR-Lyrae, 2758 Cepheids, 1596 eclipsing binaries, 890 Miras, and 8512 unclassified objects. About 10% of the data was lost in the data archeology step that included converting the measurements from a native PAW format to csv, giving us a total of 19429 stars.
For the RAMP, we decided to drop all instances in the "unclassified object" category since the interpretation of this type was unclear. We then randomly selected 30% of the data for training, giving us 3641 training instances.
The data consists of two files, train.csv contains "static" features in a classical row-wise csv table, and train_varlength_features.csv.gz is a table that contains all the time series. The contents is obtained with get_train_data
function below.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from itertools import chain
pd.set_option('display.max_columns', None)
import problem
X_df, y = problem.get_train_data()
Red variables are ids, labels, or other human-annotated features, so they should not be used as input in the classification. Light blue variables are legal but not likely to contribute information to the classification. Black variables are definitely discriminative.
patch_id
: The catalogue is organized by tiles corresponding to $1$cm$^2$ patches on the plates, this is their id.
star_id_b, star_id_r
: The id of the star within the patch. patch_id
and star_id_b
or star_id_r
identify the stars uniquely, and we use [patch_id]_[star_id_b]
for indexing the stars in the pandas table.
magnitude_b, magnitude_r
: The average apparent luminosity of the star (in two frequency bands). Magnitude is a logarithmic measure, and the higher it is, the lower the apparent luminosity is.
asc_d, asc_m, asc_s
: Celestial right ascension (coordinate) of the star, measured in degrees, minutes, and seconds, respectively.
dec_d, dec_m, dec_s
: Celestial declination (coordinate) of the star, measured in degrees, minutes, and seconds, respectively.
period
: the estimated period of the light curve. For the correct period, it should be divided by div_period
frequency
: $1/$period
, so it is redundant.
num_points_good_b, num_points_good_r
: Number of good light curve measurements (some measurements can be corrupted).
asym_b, asym_r
: Unknown semantics.
log_p_not_variable
: Logarithm of the estimated probability that the star is stable.
sigma_flux_b, sigma_flux_r
: The square root of the total variance of the light measurements (indicating the amplitude of the variability).
type
: The label to predict.
quality
: Human-annotated measure of the quality of the time curve. The higher the better.
div_period
: The algorithm that estimates the period
sometimes finds a multiple of the period. These cases were human-detected, and the divisor was recorded. In principle this variable is not available automatically as an observable, but we are confident that it could be obtained automatically, so we allow it as an input.
X_df.head()
label_names = {1: 'binary', 2: 'cepheid', 3: 'rr_lyrae', 4: 'mira'}
labels = list(label_names.keys())
y_series = pd.Series(y).replace(label_names)
y_series.head()
_ = y_series.value_counts().plot(kind="bar")
colors = ['r', 'b', 'g', 'm']
def plot_classwise_normalized(feature, bins=None):
if bins is None:
bins = np.linspace(X_df[feature].min(), X_df[feature].max(), 15)
for label, color in zip(labels, colors):
plt.hist(X_df[y == label][feature].values, density=True, bins=bins,
alpha=0.5, color=color)
plot_classwise_normalized('period')
plot_classwise_normalized('period', bins=np.linspace(0, 50, 15))
Observe the aliasing below.
plot_classwise_normalized('period', bins=np.linspace(0, 5, 15))
X_df['real_period'] = X_df['period'] / X_df['div_period']
plot_classwise_normalized('real_period')
plot_classwise_normalized('real_period', bins=np.linspace(0, 50, 15))
plot_classwise_normalized('real_period', bins=np.linspace(0, 5, 15))
plot_classwise_normalized('magnitude_b')
plot_classwise_normalized('magnitude_r')
plot_classwise_normalized('asym_b')
plot_classwise_normalized('asym_r')
plot_classwise_normalized('sigma_flux_b')
plot_classwise_normalized('sigma_flux_b', bins=np.linspace(0, 1000, 15))
plot_classwise_normalized('sigma_flux_b', bins=np.linspace(0, 100, 15))
plot_classwise_normalized('sigma_flux_r')
plot_classwise_normalized('sigma_flux_r', bins=np.linspace(0, 1000, 15))
plot_classwise_normalized('sigma_flux_r', bins=np.linspace(0, 100, 15))
plot_classwise_normalized('log_p_not_variable')
colors = ['r', 'b', 'g', 'm']
def plot_classwise_scatter(feature1, feature2, range1=None, range2=None):
if range1 is None:
range1 = [X_df[feature1].min(), X_df[feature1].max()]
if range2 is None:
range2 = [X_df[feature2].min(), X_df[feature2].max()]
for label, color in zip(labels, colors):
plt.xlim(range1[0], range1[1])
plt.ylim(range2[0], range2[1])
plt.scatter(X_df[y == label][feature1],
X_df[y == label][feature2],
alpha=0.3, s=80, c=color, marker='.');
plot_classwise_scatter('magnitude_b', 'magnitude_r')
plot_classwise_scatter('magnitude_b', 'real_period', range1=None, range2=[0,10])
Each column contains a list of floating point numbers.
time_points_b, time_points_r
: The time (in unit of days) when the photos were taken. Note that the filters had to be changed so the time points of the blue and red frequency band are slighty different.
light_points_b, light_points_r
: The light points measured at the time points.
error_points_b, error_points_r
: Uncertainties (error bars) on the light measurements.
bkg_points_, bkg_points_r
: Background noise measured at the time points.
polltn_points_b, polltn_points_r
: Pollution noise measured at the time points.
X_df.hist('num_points_good_r')
print(min(X_df['num_points_good_b']))
print(max(X_df['num_points_good_b']))
print(min(X_df['num_points_good_r']))
print(max(X_df['num_points_good_r']))
Set the patch id and star id below.
patch_id = 98
star_id_b = 477
def star_key(slab_id, star_id_b):
return str(slab_id) + '_' + str(star_id_b)
X_df = X_df.set_index(X_df.apply(lambda row: "%d_%d" % (row['patch_id'], row['star_id_b']), axis=1))
time_points = X_df.loc[star_key(patch_id, star_id_b)]['time_points_b']
light_points = X_df.loc[star_key(patch_id, star_id_b)]['light_points_b']
error_points = X_df.loc[star_key(patch_id, star_id_b)]['error_points_b']
plt.errorbar(time_points, light_points, yerr=error_points, fmt='o');
The raw measurements seem rather messy. The scatter of the plots is visibly larger than the measurement uncertainty (which makes it, by definition, a variable star), but there is no visible periodicity. We can use the estimated period to overplot several periods of the curve ("fold" the time series) using the following function.
def fold_time_series(time_point, period, div_period):
real_period = period / div_period
return time_point % real_period # modulo real_period
period = X_df.loc[star_key(patch_id, star_id_b)]['period']
div_period = X_df.loc[star_key(patch_id, star_id_b)]['div_period']
print(period, div_period)
time_points_folded = [fold_time_series(time_point, period, div_period)
for time_point in time_points]
The resulting curve has a characteristic signature.
plt.gca().invert_yaxis()
plt.errorbar(time_points_folded, light_points, yerr=error_points, fmt='o');
The goal of the RAMP is to classify the stars into one of the four types. In your code you will have access both to the static features and the time series. The submission site will have several examples that you can start from.
The input data are stored in a dataframe. To go from a dataframe to a numpy array we will a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.
cols = [
'magnitude_b',
'magnitude_r',
'period',
'asym_b',
'asym_r',
'log_p_not_variable',
'sigma_flux_b',
'sigma_flux_r',
'quality',
'div_period',
]
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
('passthrough', cols)
)
X_array = transformer.fit_transform(X_df)
X_array
Let's look at how to transform line curves.
The following feature extractor takes the light curve, bins it into num_bins
bins, and return the bin means. It works with one band at a time.
def fold_time_series(time_point, period, div_period):
return (time_point -
(time_point // (period / div_period)) * period / div_period)
def get_bin_means(X_df, num_bins, band):
feature_array = np.empty((len(X_df), num_bins))
for k, (_, x) in enumerate(X_df.iterrows()):
period = x['period']
div_period = x['div_period']
real_period = period / div_period
bins = [i * real_period / num_bins for i in range(num_bins + 1)]
time_points = np.array(x['time_points_' + band])
light_points = np.array(x['light_points_' + band])
time_points_folded = \
np.array([fold_time_series(time_point, period, div_period)
for time_point in time_points])
time_points_folded_digitized = \
np.digitize(time_points_folded, bins) - 1
for i in range(num_bins):
this_light_points = light_points[time_points_folded_digitized == i]
if len(this_light_points) > 0:
feature_array[k, i] = np.mean(this_light_points)
else:
feature_array[k, i] = np.nan # missing
return feature_array
get_bin_means(X_df.iloc[:2], 5, 'r')
For this we will use a funtion transformer that will get applied to both red and blues curves.
from sklearn.preprocessing import FunctionTransformer
transformer_r = FunctionTransformer(
lambda X_df: get_bin_means(X_df, 5, 'r')
)
transformer_b = FunctionTransformer(
lambda X_df: get_bin_means(X_df, 5, 'b')
)
transformer = make_column_transformer(
(transformer_r, ['period', 'div_period', 'time_points_r', 'light_points_r']),
(transformer_b, ['period', 'div_period', 'time_points_b', 'light_points_b']),
)
X_array = transformer.fit_transform(X_df)
X_array.shape
Combined with some static features and plugged into a random forest it reads;
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
def fold_time_series(time_point, period, div_period):
return (time_point -
(time_point // (period / div_period)) * period / div_period)
def get_bin_means(X_df, num_bins, band):
feature_array = np.empty((len(X_df), num_bins))
for k, (_, x) in enumerate(X_df.iterrows()):
period = x['period']
div_period = x['div_period']
real_period = period / div_period
bins = [i * real_period / num_bins for i in range(num_bins + 1)]
time_points = np.array(x['time_points_' + band])
light_points = np.array(x['light_points_' + band])
time_points_folded = \
np.array([fold_time_series(time_point, period, div_period)
for time_point in time_points])
time_points_folded_digitized = \
np.digitize(time_points_folded, bins) - 1
for i in range(num_bins):
this_light_points = light_points[time_points_folded_digitized == i]
if len(this_light_points) > 0:
feature_array[k, i] = np.mean(this_light_points)
else:
feature_array[k, i] = np.nan # missing
return feature_array
transformer_r = FunctionTransformer(
lambda X_df: get_bin_means(X_df, 5, 'r')
)
transformer_b = FunctionTransformer(
lambda X_df: get_bin_means(X_df, 5, 'b')
)
cols = [
'magnitude_b',
'magnitude_r',
'period',
'asym_b',
'asym_r',
'log_p_not_variable',
'sigma_flux_b',
'sigma_flux_r',
'quality',
'div_period',
]
common = ['period', 'div_period']
transformer = make_column_transformer(
(transformer_r, common + ['time_points_r', 'light_points_r']),
(transformer_b, common + ['time_points_b', 'light_points_b']),
('passthrough', cols)
)
pipe = make_pipeline(
transformer,
SimpleImputer(strategy='most_frequent'),
RandomForestClassifier(max_depth=5, n_estimators=10)
)
def get_estimator():
return pipe
import problem
from sklearn.model_selection import cross_val_score
X_df, y = problem.get_train_data()
scores = cross_val_score(get_estimator(), X_df, y, cv=2, scoring='accuracy')
print(scores)
To submit your code, you can refer to the online documentation.