Predict age from brain grey matter (regression). Aging is associated with is grey matter (GM) atrophy, each year, an adult lose 0.1% of GM. We will try to learn a predictor of the chronological age (true age) using GM measurements on the brain on a population of healthy control participants.

Such a predictor provides the expected **brain age** of a subject. Deviation from
this expected brain age indicates acceleration or slowdown of the aging process
which may be associated with a pathological neurobiological process or protective factor of aging.

There are 357 samples in the training set and 90 samples in the test set.

Voxel-based_morphometry VBM using cat12 software which provides:

Regions Of Interest (

`rois`

) of Grey Matter (GM) scaled for the Total Intracranial Volume (TIV):`[train|test]_rois.csv`

284 features.VBM GM 3D maps or images (

`vbm3d`

) of voxels in the MNI space:`[train|test]_vbm.npz`

contains 3D images of shapes (121, 145, 121). This npz contains the 3D mask and the affine transformation to MNI referential. Masking the brain provide*flat*331 695 input features (voxels) for each participant.

By default `problem.get_[train|test]_data()`

return the concatenation of 284 ROIs of
Grey Matter (GM) features with 331 695 features (voxels) within a brain mask.
Those two blocks are higly redundant.
To select only on ROIs (`rois`

) features do:

`X[:, :284]`

To select only on (`vbm`

) (voxel with the brain) features do:

`X[:, 284:]`

The target can be found in `[test|train]_participants.csv`

files, selecting the
`age`

column for regression problem.

The main Evaluation metrics is the Root-mean-square deviation RMSE. We will also look at the R-squared R2.

This starting kit requires Python and the following dependencies:

`numpy`

`scipy`

`pandas`

`scikit-learn`

`matplolib`

`seaborn`

`jupyter`

`ramp-workflow`

Therefore, we advise you to install Anaconda distribution which include almost all dependencies.

Only `nilearn`

and `ramp-workflow`

are not included by default in the Anaconda
distribution. They will be installed from the execution of the notebook.

To run a submission and the notebook you will need the dependencies listed in requirements.txt. You can install the dependencies with the following command-line:

`pip install -U -r requirements.txt`

If you are using conda, we provide an environment.yml file for similar usage.

`conda env create -f environment.yml`

Then, you can activate the environment using:

`conda activate brain_age`

And desactivate using

`conda deactivate`

- Download the data locally:

`python download_data.py`

- Execute the jupyter notebook, from the root directory using:

`jupyter notebook brain_age_starting_kit.ipynb`

Tune your model using the starting_kit

- Submission (Run locally)

The submissions need to be located in the `submissions`

folder.
For instance for `linear_regression_rois`

, it should be located in
`submissions/submissions/linear_regression_rois`

.

Copy everything required to build your estimator in a submission file:
`submissions/submissions/linear_regression_rois/estimator.py`

.
This file must contain a function `get_estimator()`

.

Run locally:

`ramp-test --submission linear_regression_rois`

- Submission on RAMP:

In [1]:

```
%matplotlib inline
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
participants_train = pd.read_csv(os.path.join("data", "train_participants.csv" ))
participants_train["set"] = 'train'
participants_test = pd.read_csv(os.path.join("data", "test_participants.csv" ))
participants_test["set"] = 'test'
participants = participants_train.append(participants_test)
sns.violinplot(x="set", y="age", data=participants)
participants[["age", "set"]].groupby("set").describe()
```

Out[1]:

age | ||||||||
---|---|---|---|---|---|---|---|---|

count | mean | std | min | 25% | 50% | 75% | max | |

set | ||||||||

test | 90.0 | 47.848019 | 17.852717 | 20.071184 | 31.819986 | 44.855578 | 62.017112 | 82.187543 |

train | 357.0 | 49.138846 | 16.095719 | 19.980835 | 35.455168 | 50.902122 | 62.061602 | 86.318960 |

Load ROIs data

In [2]:

```
globvol_rois_train = pd.read_csv(os.path.join("data", "train_rois.csv"))
print(globvol_rois_train.iloc[:5, :10])
```

`train_rois.csv`

provides:

- Global volumes of "tissues": CerrebroSpinal Fluid (
`CSF_Vol`

), Grey (`GM_Vol`

) and White Matter (`WM_Vol`

) volume of participants. - ROIs are starting at column
`l3thVen_GM_Vol`

. Note that`rois_train.loc[:, 'l3thVen_GM_Vol':]`

matches`problem.get_train_data()[:, :284]`

.

In [3]:

```
rois_train = globvol_rois_train.loc[:, 'l3thVen_GM_Vol':]
vols_train = globvol_rois_train.loc[:, ['CSF_Vol', 'GM_Vol', 'WM_Vol']]
```

In [4]:

```
from sklearn.decomposition import PCA
pca_rois = PCA(n_components=2)
PCs_rois = pca_rois.fit_transform(rois_train)
print(pca_rois.explained_variance_ratio_)
df = pd.DataFrame(dict(age=participants_train['age'], PC1_ROIs=PCs_rois[:, 0], PC2_ROIs=PCs_rois[:, 1]))
```

[0.54771351 0.10711543]

Main sources of variability ?

In [5]:

```
sns.pairplot(df)
print(df.corr())
```

PC1_ROIs is higlhy correlated with age Thus age seems to be the main source of variability.

Question: what is Neurobiological effect of age ? Lets explore the hypothesis of brain atrophy.

We compute brain atrophy indices:

- GM_ratio: GM_Vol / (GM_Vol + WM_Vol + CSF_Vol). It is ratio of GM in the brain.
- WM_ratio: WM_Vol / (GM_Vol + WM_Vol + CSF_Vol). It is ratio of WM in the brain.

In [6]:

```
df["GM_ratio"] = vols_train.loc[:, "GM_Vol"] / vols_train.sum(axis=1)
df["WM_ratio"] = vols_train.loc[:, "WM_Vol"] / vols_train.sum(axis=1)
df["CSF_ratio"] = vols_train.loc[:, "CSF_Vol"] / vols_train.sum(axis=1)
sns.pairplot(df[["PC1_ROIs", "age", "GM_ratio", "WM_ratio","CSF_ratio"]])
print(df[["PC1_ROIs", "age", "GM_ratio", "WM_ratio", "CSF_ratio"]].corr())
```

Very strong effect on GM:

- GM_ratio correlate with PC1_ROIs
- GM_ratio correlate with age

Effect on WM:

- WM_ratio correlate with PC1_ROIs
- WM_ratio correlate with age

Aging (strongly correlates with PC1) and leads to decrease of GM_ratio and increase of CSF_ratio:

**Aging leads to brain atrophy with is the main source of brain variability.
Therefore it should be possible to predict the age form the brain anatomy.**

Import and read data

In [7]:

```
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
import sklearn.metrics as metrics
import problem
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
```

In [8]:

```
X_train, y_train = problem.get_train_data()
X_test, y_test = problem.get_test_data()
assert X_train.shape[1] == 284 + 331695
```

Function to compute scores

In [9]:

```
def cv_train_test_scores(rmse_cv_test, rmse_cv_train, r2_cv_test, r2_cv_train,
y_train, y_pred_train, y_test, y_pred_test):
"""Compute CV score, train and test score from a cv grid search model.
Parameters
----------
rmse_cv_test : array
Test rmse across CV folds.
rmse_cv_train : array
Train rmse across CV folds.
r2_cv_test : array
Test R2 across CV folds.
r2_cv_train : array
Train R2 across CV folds.
y_train : array
True train values.
y_pred_train : array
Predicted train values.
y_test : array
True test values.
y_pred_test : array
Predicted test values.
Returns
-------
info : TYPE
DataFrame(r2_cv, r2_train, mae_train, mse_train).
"""
# CV scores
rmse_cv_test_mean, rmse_cv_test_sd = np.mean(rmse_cv_test), np.std(rmse_cv_test)
rmse_cv_train_mean, rmse_cv_train_sd = np.mean(rmse_cv_train), np.std(rmse_cv_train)
r2_cv_test_mean, r2_cv_test_sd = np.mean(r2_cv_test), np.std(r2_cv_test)
r2_cv_train_mean, r2_cv_train_sd = np.mean(r2_cv_train), np.std(r2_cv_train)
# Test scores
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))
r2_test = metrics.r2_score(y_test, y_pred_test)
# Train scores
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train))
r2_train = metrics.r2_score(y_train, y_pred_train)
scores = pd.DataFrame([[rmse_cv_test_mean, rmse_cv_test_sd, rmse_cv_train_mean, rmse_cv_train_sd,
r2_cv_test_mean, rmse_cv_test_sd, r2_cv_train_mean, r2_cv_train_sd,
rmse_test, r2_test, rmse_train, r2_train
]],
columns=('rmse_cv_test_mean', 'rmse_cv_test_sd', 'rmse_cv_train_mean', 'rmse_cv_train_sd',
'r2_cv_test_mean', 'rmse_cv_test_sd', 'r2_cv_train_mean', 'r2_cv_train_sd',
'rmse_test', 'r2_test', 'rmse_train', 'r2_train'
))
return scores
```

Selecting only rois or vbm images:

This can be achieved by a `ROIsFeatureExtractor`

or `VBMFeatureExtractor`

In [10]:

```
class ROIsFeatureExtractor(BaseEstimator, TransformerMixin):
"""Select only the 284 ROIs features:"""
def fit(self, X, y):
return self
def transform(self, X):
return X[:, :284]
class VBMFeatureExtractor(BaseEstimator, TransformerMixin):
"""Select only the 284 ROIs features:"""
def fit(self, X, y):
return self
def transform(self, X):
return X[:, 284:]
fe = ROIsFeatureExtractor()
print(fe.transform(X_train).shape)
fe = VBMFeatureExtractor()
print(fe.transform(X_train).shape)
```

(357, 284) (357, 331695)

The framework is evaluated with a cross-validation approach. The metrics used are the root-mean-square error (RMSE).

First we propose a simple Regression predictor based on ROIs features only:

In [11]:

```
cv = problem.get_cv(X_train, y_train)
estimator = make_pipeline(ROIsFeatureExtractor(), StandardScaler(), LinearRegression())
cv_results = cross_validate(estimator, X_train, y_train, scoring=['neg_root_mean_squared_error', 'r2'], cv=cv,
verbose=1, return_train_score=True, n_jobs=5)
# Refit on all train
estimator.fit(X_train, y_train)
# Apply on test
y_pred_train = estimator.predict(X_train)
y_pred_test = estimator.predict(X_test)
print("Important scores are rmse_cv_test_mean and rmse_test")
cv_train_test_scores(rmse_cv_test=-cv_results['test_neg_root_mean_squared_error'],
rmse_cv_train=-cv_results['train_neg_root_mean_squared_error'],
r2_cv_test=cv_results['test_r2'],
r2_cv_train=cv_results['train_r2'],
y_train=y_train, y_pred_train=y_pred_train, y_test=y_test, y_pred_test=y_pred_test).T.round(3)
```

[Parallel(n_jobs=5)]: Using backend LokyBackend with 5 concurrent workers. [Parallel(n_jobs=5)]: Done 2 out of 5 | elapsed: 2.9s remaining: 4.4s

Important scores are rmse_cv_test_mean and rmse_test

[Parallel(n_jobs=5)]: Done 5 out of 5 | elapsed: 3.3s finished

Out[11]: