Drug classification and concentration estimation from Raman spectra
Camille Marini (LTCI/CNRS), Alex Gramfort (LTCI/Télécom ParisTech), Sana Tfaili (Lip(Sys)²/UPSud), Laetitia Le (Lip(Sys)²/UPSud), Mehdi Cherti (LAL/CNRS), Balázs Kégl (LAL/CNRS)


Chemotherapy is one of the most used treatment against cancer. It uses chemical substances (chemotherapeutic agents) which kill cells that divide too quickly. These chemical substances are often diluted in a particular solution and packaged in bags, diffusers, or syringes, before being administered. Wrong medication (wrong chemotherapeutic agent or wrong concentration) can have major impacts for patients. To prevent wrong medication, some recent French regulations impose the verification of anti-cancer drugs before their administration. The goal is to check that they contain the good chemotherapeutic agent with the good dosage.

Raman spectroscopy could be used to make this check, since, theoretically, i) each molecule has a specific spectral fingerprint by which the molecule can be identified; and ii) the Raman intensity increases with the concentration of the molecule. The main advantage of spectroscopy above other methods (for example, liquid chromatography) is that it is non-destructive and non-invasive (measures are made without opening the drug containers). However, this method is rarely used in hospital environment because of the complexity of the spectral signals to analyze. Automating the analysis of these spectral signals could significantly help. Eventually, a complete analytical system (from measuring Raman spectra to identifying the chemotherapeutic agent and its concentration) could be designed, which would be easy to use and would prevent wrong medication.

In this context, the goal of this project is to develop prediction models able to identify and quantify chemotherapeutic agents from their Raman spectra.

The Lip(Sys)² laboratory measured Raman spectra of 4 types of chemotherapeutic agents (called molecule) in 3 different packages (called vial), diluted in 9 different solutions (called solute gammes), and having different concentrations. A total of 360 spectra were measured for each agent, except for one (348 spectra).
Part of these data are saved in the file train.csv as follows (n_samples being the number of samples):

  • molecule: Type of chemotherapeutic agent. Four possible values: A for infliximab, B for bévacizumab, Q for ramucirumab, R for rituximab. Dimension: (n_samples,)
  • vial: Vial type. Three possible values: 1, 2, 3. Dimension: (1, n_samples)
  • solute: Solute group. Fourteen possible values: 1, 2, ..., 14. Dimension: (1, n_samples)
  • concentration: Concentration of the molecule. Dimension: (n_samples, 1)
  • spectra: Intensity of Raman spectrum. Dimension: (n_samples, 1866)

To sum up, there are too objectives:

  • classification: predict which molecule it corresponds to given the spectrum.
  • regression: predict the concentration of a molecule. The prediction should not depend on the vial or the solute group. The error metric is the mean absolute relative error (mare): $$\frac{1}{n_{samples}}\sum_{i=1}^{n_{samples}}\left|\frac{y_i-\hat{y}_i}{y_i}\right|$$ with $y$ and $\hat{y}$ being the true and predicted concentration.