James Catmore (UOslo), Imad Chaabane (LRI/UPSud), Sergei Gleyzer (UFlorida), Cécile Germain (LRI/UPSud), Isabelle Guyon (LRI/UPSud), Victor Estrade (LRI/UPSud), Balazs Kegl (LAL/CNRS), Edouard Leicht (LAL/CNRS), Gilles Louppe (NYU), David Rousseau (LAL/CNRS), Jean-Roch Vlimant (CalTech)
Anomaly detection, where we seek to identify events or datasets that deviate from those normally encountered, is a common task in experimental particle physics. For example, two runs recorded on the same day with identical accelerator and detector conditions and the same trigger menu should not be distinguishable statistically. If they are, some unexpected systematic effect must be present which acts to skew each event or a subset of the events, leading to a collective anomaly. There are many ways in which such problems can arise: for instance, the data acquisition or reconstruction software might be misconfigured, or some subcomponent of the detector might be malfunctioning.
Conversely, an otherwise normal dataset may contain individual events which are somehow unusual. These point anomalies may arise due to a problem with the detector, data acquisition, trigger or reconstruction that only occur in very rare circumstances.
For both cases it would be highly desirable to devise a mechanism that could automatically scan all new datasets, detect any anomalous features, and alert a human being to enable detailed investigation. This is the subject of today's RAMP.
The prediction task
The nature of the challenge is to devise a classifier that can distinguish the anomalous cases from the bulk of the data in a test dataset, having first trained the classifier on a test dataset. Whilst the anomalous events are labelled in the training set, no distinction is made between the different types of distortion.
The challenge in this RAMP is to Separate a skewed data point from a original data point
A version of the HiggsML dataset
in the Kaggle Challenge
in 2014) is provided. It contains a mixture of Higgs
particles decaying into tau pairs and the principal background processes. Half of the events are unchanged, but the
other half has been artificially distorted or corrupted in some way. The detail of these distortions will be
revealed during the RAMP.
The full dataset contains approximately 800K events.
We are giving you 100k events to build models and will use the rest to test them.