Biologically, it is known that, while cells carry (almost) the same genomic information, they tend to express only a fraction of their genes leading to specialization into specific types with different biological functions. Thus, cell-types study and classification is of primary interest for many biological and medical applications. In the past decade, measuring genes expression level at the scale of a unique cell has become possible with the rise of high-throughput technologies named single-cell RNA-seq (scRNA-seq).
The goal of this data challenge is the supervised classification of cell-types thanks to the scMARK benchmark dataset from Mendonca et. al. The authors compiled 100, 000 cells expression from 10 different studies to serve as a comparison for different machine learning approaches, in an analogy with the MNIST benchmark dataset for computer vision.
This data-challenge uses a small extraction with only 4 cell-types (the labels to predict) from scMARK:
1. Cancer_cells
2. NK_cells
3. T_cells_CD4+
4. T_cells_CD8+
The public dataset contains 1500 points splitted in 1000 training points and 500 test points. It will serve as your local benchmark for developing your submissions. On the server side, your submission will use the whole 1500 public points as the training set, and another private and unavailable test dataset, containing 1500 supplementary test points, will be used for the ranking of participants. The labels' distribution in the public (resp. private) training and testing datasets are the same.
If marked as code
, the two following cells will
They are disabled by default since you only have to call these command once (in your dedicated Python env). You can examine the file, requirements.txt
, included in the repo to view the list of dependencies.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Raw data are stored in h5ad format which can be read via the scanpy.read_h5ad
function which returns an AnnData object.
The problem.py
file contains the definition of the data-challenge according to the RAMP framework. In addition, it contains an helper functions to import data.
from problem import get_train_data, get_test_data
X_train, y_train = get_train_data()
X_test, y_test = get_test_data()
A first inspection of the labels indicates that the classes are imbalanced.
Note: the same analysis may be conducted for y_test.
lab_df = pd.DataFrame({'label': y_train})
lab_df.value_counts(normalize=True)
label T_cells_CD8+ 0.342 T_cells_CD4+ 0.336 Cancer_cells 0.237 NK_cells 0.085 dtype: float64
lab_df.label.hist();
Secondly, looking at the features
print(X_train.shape)
print(type(X_train))
(1000, 13551) <class 'scipy.sparse._csr.csr_matrix'>
We see that we have a fairly high dimensional problem with 1000 data points (unique cells) described by 14059 variables (genes). Since we measure expression level, the data is quite sparse, with many unexpressed genes for each cell. Thus, get_*_data()
functions returns $X$ as a scipy
sparse matrix stored in row format. This is useful
1. To save memory space
2. Some algorithm may work with scipy's sparse CSR matrices.
Of course many existing algorithm, e.g. in scikit-learn, may throw error when given such an object, requiring a np.array
type. Thankfully the .toarray()
method straightforwadly converts to NumPy.
X_train.toarray()
array([[1., 0., 0., ..., 3., 2., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [1., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
A particularity of RNA-seq data is that total counts may vary widely between cells and/or genes.
total_genes_counts = X_train.toarray().sum(axis=0)
total_cell_counts = X_train.toarray().sum(axis=1)
# plt.hist(np.log10(total_genes_counts), bins = np.arange(6))
plt.hist(total_genes_counts, bins = 10**np.arange(6))
plt.xscale("log")
plt.title("Histogram of total gene (i.e. column) counts in log-scale.")
plt.xlabel('Total genes count (log-scale)')
plt.show()
plt.hist(np.log10(total_cell_counts), bins = np.arange(1,6))
plt.title("Histogram of log-total cell (i.e. row) counts.")
plt.xlabel('log(cell_count)')
plt.show()
This suggests for some normalization of the counts. There are many normalization possible for RNA-seq data, and one of the goal of this challenge is to test for different pre-processing. For simplicity, here we choose to normalize each row (cell) by its total count.
def preprocess_X(X):
X = X.toarray()
return X / X.sum(axis=1)[:, np.newaxis]
X_train_norm = preprocess_X(X_train)
# sanity check
np.allclose(X_train_norm.sum(axis=1), np.ones(X_train_norm.shape[0]))
True
This challenge scores your submissions and ranks participants with a balanced accuracy score, computed via the (unadjusted) sklearn's balanced_accuracy_score
function.
Balanced accuracy is computed as the average of Recall scores for each class see implementation for more details. It is between 0 and 1, the higher, the better.
from sklearn.metrics import balanced_accuracy_score
# this custom class is used by the challenge and calls
# balanced_accuracy_score(y_true, y_pred, adjusted=False)
# under the hood
from problem import BalancedAccuracy
We now show a first naive attempt at the challenge, and will proceed in two steps :
1. First, we will construct a classifier step-by-step.
2. Then, we will show how to implement this classifier as a proper RAMP submision.
Given the high-dimensional nature of the problem we will construct a classifier: standardize data, do a PCA retaining only the 50 first components, and finally fit a random forest classifier on the 50 first components.
This can be easily implemented as a scikit-learn Pipeline
.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline(
[
("Scaler", StandardScaler(with_mean=True, with_std=True)),
("PCA with 50 components", PCA(n_components=50)),
(
"Random Forest Classifier",
RandomForestClassifier(
max_depth=5, n_estimators=100, max_features=3
),
),
]
)
pipe
Pipeline(steps=[('Scaler', StandardScaler()), ('PCA with 50 components', PCA(n_components=50)), ('Random Forest Classifier', RandomForestClassifier(max_depth=5, max_features=3))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('Scaler', StandardScaler()), ('PCA with 50 components', PCA(n_components=50)), ('Random Forest Classifier', RandomForestClassifier(max_depth=5, max_features=3))])
StandardScaler()
PCA(n_components=50)
RandomForestClassifier(max_depth=5, max_features=3)
# fit on train
pipe.fit(X_train_norm, y_train)
y_tr_pred = pipe.predict(X_train_norm)
# predict on test
X_test_norm = preprocess_X(X_test)
y_te_pred = pipe.predict(X_test_norm)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
# compute balanced accuracy and confusion matrix
print(f"Train balanced accuracy : {balanced_accuracy_score(y_train, y_tr_pred):.3f}")
print(f"Test balanced accuracy : {balanced_accuracy_score(y_test, y_te_pred):.3f}")
cm = confusion_matrix(y_test, y_te_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe.classes_, )
disp.plot()
plt.title("Confusion matrix on test set");
Train balanced accuracy : 0.688 Test balanced accuracy : 0.576
This naive classifier does a better job than a dummy random classifier which would yield an average balanced accuracy of 1/4. However, it never predicts the "NK_cell" type and seems to confuse between the two different "T-cell" types. There seems to be room for improvement. Good news, it is your job ! :)
Next, let's see how to implement this exact same classifier as a receivable RAMP submission
The RAMP challenge is automatized and a submission requires a specific structure described below.
A submission is stored in ./submissions/<submission_foldername>/
and must contain a Python file named classifier.py
.
This python script must itself implement (at least) a custom Classifier
class with
fit(X, y)
method.predict_proba(X)
method.Warning: the X
argument must be understood as the sparse CSR count data matrix obtained by get_train_data()
. Thus any pre-processing of the count matrix must be done inside the methods.
We illustrate this below with the naive classifier already implemented.
Note: The following class is also implemented in
./submissions/starting_kit/classifier.py
.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
def _preprocess_X(X_sparse):
# cast a dense array
X = X_sparse.toarray()
# normalize each row
return X / X.sum(axis=1)[:, np.newaxis]
class Classifier(object):
def __init__(self):
# Use scikit-learn's pipeline
self.pipe = make_pipeline(
StandardScaler(with_mean=True, with_std=True),
PCA(n_components=50),
RandomForestClassifier(
max_depth=5, n_estimators=100,
max_features=3
),
)
def fit(self, X_sparse, y):
X = _preprocess_X(X_sparse)
self.pipe.fit(X, y)
self.classes_ = self.pipe.classes_
pass
def predict_proba(self, X_sparse):
X = _preprocess_X(X_sparse)
# here we use RandomForest.predict_proba()
return self.pipe.predict_proba(X)
Below is a simplified version of what RAMP does with your submission.
clf = Classifier()
clf.fit(X_train, y_train)
# predict_proba
y_tr_pred_proba = clf.predict_proba(X_train)
y_te_pred_proba = clf.predict_proba(X_test)
# convert to hard classification with argmax
y_tr_pred = clf.classes_[np.argmax(y_tr_pred_proba, axis=1)]
y_te_pred = clf.classes_[np.argmax(y_te_pred_proba, axis=1)]
print('Train balanced accuracy:', balanced_accuracy_score(y_train, y_tr_pred))
print('Test balanced accuracy:', balanced_accuracy_score(y_test, y_te_pred))
Train balanced accuracy: 0.6930744874565415 Test balanced accuracy: 0.5820983426079323
In reality things are a bit more sophisticated. Locally, the RAMP platform averages your classifier preformance over a 5-fold cross-validation scheme implemented for you in the get_cv
method. The good news is, RAMP automatize everything for you thanks to ramp-test
. The public train, validation and test performance are shown to you for information.
Before submitting to RAMP, you can test your solution locally to ensure that trivial errors (e.g. typos, path issues, etc.) are resolved. We can test a given submission using the ramp-test
command that was installed in the virtual environment.
We'll use the following command:
!ramp-test --submission <subm_folder> --quick-test
The !
signals that the command should be run on the command line instead of this notebook.
ramp-test
is the command to be executed. It signals ramp to perform a local test.
--submission <subm_folder>
specifies which submission to run. You can have multiple potential submissions in the submissions/
directory; this prevents ramp
from running all of them (and by default it runs starting_kit).
!ramp-test --submission starting_kit
Testing Single-cell RNA-seq cell types classification Reading train and test files from ./data/ ... Reading cv ... Training submissions/starting_kit ... CV fold 0 score bal_acc time train 0.71 2.156162 valid 0.57 0.140732 test 0.56 0.097918 CV fold 1 score bal_acc time train 0.74 1.413743 valid 0.58 0.146134 test 0.59 0.096986 CV fold 2 score bal_acc time train 0.71 2.122261 valid 0.60 0.146318 test 0.57 0.096218 CV fold 3 score bal_acc time train 0.73 2.167345 valid 0.59 0.154218 test 0.57 0.099463 CV fold 4 score bal_acc time train 0.70 1.898126 valid 0.58 0.144049 test 0.57 0.085239 ---------------------------- Mean CV scores ---------------------------- score bal_acc time train 0.72 ± 0.016 2.0 ± 0.29 valid 0.58 ± 0.01 0.1 ± 0.0 test 0.57 ± 0.009 0.1 ± 0.01 ---------------------------- Bagged scores ---------------------------- score bal_acc valid 0.58 test 0.58
We see that the mean CV scores are consistent with the previous result we had. If you use a classifier with more variance, you would see more variation accross CV-folds.
On the server, the participants are ranked according to the balanced accuracy score on the private test data set. However, only the ranking will be available, and not the score in itself in order to avoid overfitting of this private test set. The score on public test will be available as a proxy but your submission could very well have a better ranking on private with a worst balanced accuracy on public.
You can find more information in the README of the ramp-workflow library.