MNIST classification
Current events on this problem

Paris Saclay Center for Data Science

RAMP MNIST handwritten digit classification

Mehdi Cherti (CNRS), Balázs Kégl (CNRS)



The prediction task

The goal of this RAMP is to classify correctly handwritten digits. For each submission, you will have to provide an image preprocessor (to standardize, resize, crop, augment images) and batch classifier, which will fit a training set and predict the classes (species) on a test set. The images are big so loading them into the memory at once is impossible. The batch classifier therefore will access them through a generator which can be "asked for" a certain number of training and validation images at a time. You will typically run one minibatch of stochastique gradient descent on these images to train a deep convolutional neural networks which are the state of the art in image classification.


First of all, even though 68K images is relatively small compared to industrial level data sets, to achieve state-of-the-art performance, you will need big networks which will take ages (days) to train on a CPU. If you want to have a faster turnaround for tuning your net, you will need a GPU-equipped server of could instance. Setting up an AWS instance is easy, just follow this tutorial. If you want to have the starting kit preinstalled, use the community AMI "pollenating_insects_2_users".

Your main bottleneck is memory. E.g., increasing the resolution to 128x128, you will need to decrease batch size. You should always run on the AWS node before submitting.

For learning the nuts and bolts of convolutional nets, we suggest that you follow Andrej Karpathy’s excellent course.

You have some trivial "classical" options to explore. You should set the epoch size to something more than three (in the starting kit). You should check when the validation error curve flattens because you will also be graded on training and test time. You can change the network architecture, apply different regularization techniques to control overfitting, optimization options to control underfitting.

You can use pretrained nets from here. There are a couple of examples in the starting kit. Your options are the following.

  • Retrain or not the weights. If you do not, you are using the pretrained net as fixed a feature extractor. You can add some layers on the top of the output of the pretrained net, and only train your layers. If you retrain all the layers, you use the pretrained net as an initialization. Again, your goal is not only to increase accuracy but also to be frugal. Retraining the full net is obviously more expensive.
  • You can "read out" the activations from any layer, you do not need to keep the full net, not even the full convolutional stack.
  • The starting kit contains examples with the VGG16 net, but feel free to use any of the other popular nets. Just note that there is no way to change the architecture of these nets. In particular, each net expects images of a given dimension so your image preprocessing needs to resize or crop the images to the right size.

You can also adjust the image preprocessing. Resizing to small (64x64 or even 32x32) will make the training faster so you can explore more hyperparameters, but the details will be lost so your final result will probably be suboptimal. Insects are mostly centered in the images but there are a lot of smaller insects which could be cropped for a better performance. You can also rotate the images or apply other data augmentation tricks (google "convolutional nets data augmentation"). You should also look at the actual images to get some inspiration to find meaningful preprocessing ideas.

You can also get inspired by looking at the submissions of the first edition (after signing up).

In [1]:
import os
import numpy as np
import pandas as pd
from import imread
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import AxesGrid
from matplotlib import cm

%matplotlib inline

pd.set_option('display.max_rows', 500)

The data

If the images are not yet in data/imgs, change the type of the net cell to "Code" and run it.

In [2]:
df = pd.read_csv('data/train.csv')
X_df = df['id']
y_df = df['class']
X = X_df.values
y = y_df.values

The class distribution is balanced.

In [3]:
labels_counts_df = df.groupby('class').count()
0 4742
1 5370
2 4795
3 4914
4 4675
5 4337
6 4709
7 4998
8 4707
9 4753

It is worthwhile to look at some image panels, grouped by label.

In [4]:
nb_rows = 4
nb_cols = 4
nb_elements = nb_rows * nb_cols
label = 8

df_given_label = df[df['class']==label]

subsample = np.random.choice(df_given_label['id'], replace=False, size=nb_elements)

fig = plt.figure(figsize=(4, 4))
grid = AxesGrid(fig, 111, # similar to subplot(141)
                nrows_ncols = (nb_rows, nb_cols),
                axes_pad = 0.05,
                label_mode = "1",
for i, image_id in enumerate(subsample):
    filename = 'data/imgs/{}'.format(image_id)
    image = imread(filename)
    im = grid[i].imshow(image, cmap='Greys', interpolation='nearest')
/Users/kegl/anaconda/lib/python2.7/site-packages/matplotlib/ UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "

All images have size 28 $\times$ 28.

In [43]:
(28, 28)
In [47]:
n_subsample = 1000
shapes = np.empty((n_subsample, 2))
for i, image_id in enumerate(X_df[:n_subsample]):
    filename = 'data/imgs/{}'.format(image_id)
    image = imread(filename)
    shapes[i] = image.shape
In [48]:
shapes_df = pd.DataFrame(shapes, columns=['height', 'width'])
shapes_df['count'] = 0
shapes_df.groupby(['height', 'width']).count().sort_values('count', ascending=False)
height width
28.0 28.0 1000
In [49]:
<matplotlib.axes._subplots.AxesSubplot at 0x115c1cdd0>

Image preprocessing

In the first workflow element you can resize, crop, or rotate the images. This is an important step. Neural nets need standard-size images defined by the dimension of the input layer. MNIST images are centered and resized, so these operations are unlikely to be useful but rotation may help.

In [67]:
filename = 'data/imgs/{}'.format(X_df[161])
image = imread(filename)
plt.imshow(image, cmap='Greys', interpolation='nearest')

Here we resize the images to different resolutions, then blow them up so the difference can be visible.

In [66]:
from skimage.transform import resize

nb_rows = 1
nb_cols = 2
nb_elements = nb_rows * nb_cols

fig = plt.figure(figsize=(4, 4))
grid = AxesGrid(fig, 111, # similar to subplot(141)
                nrows_ncols = (nb_rows, nb_cols),
                axes_pad = 0.05,
                label_mode = "1",
    resize(resize(image, (16, 16)), (224, 224), order=0),
    cmap='Greys', interpolation='nearest')
    resize(resize(image, (8, 8)), (224, 224), order=0),
    cmap='Greys', interpolation='nearest')

Here we rotate the image. Explore options in skimage.

In [69]:
from skimage.transform import rotate

nb_rows = 1
nb_cols = 4
nb_elements = nb_rows * nb_cols

fig = plt.figure(figsize=(10, 10))
grid = AxesGrid(fig, 111, # similar to subplot(141)
                nrows_ncols = (nb_rows, nb_cols),
                axes_pad = 0.05,
                label_mode = "1",
grid[0].imshow(rotate(image, 30), cmap='Greys', interpolation='nearest')
grid[1].imshow(rotate(image, 45), cmap='Greys', interpolation='nearest')
grid[2].imshow(rotate(image, 60), cmap='Greys', interpolation='nearest')
grid[3].imshow(rotate(image, 75), cmap='Greys', interpolation='nearest')

All these tansformations should be implemented in the transform function found in the image_preprocessor workflow element that you will submit.

The pipeline

For submitting at the RAMP site, you will have to write two classes, saved in two different files:

  • the function image_preprocessor, which will be used to transform the images.
  • a class BatchClassifier that will handle the training and prediction using sample generators.

Note that the following code cells are not executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

Image preprocessor

MNIST contains well-centered and aligned images so we do nothing in the image preprocessor.

In [70]:
%%file submissions/starting_kit/
def transform(x):
    return x
Overwriting submissions/starting_kit/

The batch classifier

Your convnet models will be implemented in the workflow element. The model is usually implemented in a separate build_model function which is called from init. It does not do any computation, just builds the network architecture. Your main task is to find the right architecture for this data set. Every parameter is "free": the size of the input layer (which should match the preprocessing you do in image_preprocessing.transform), the number and composition of the convolutional blocks, the size and number of filters in each layer, the number and size of the fully connected layers, the regularization that you apply (usually only on the fully connected layers), and the optimizer. In addition, in the fit function you shold decide the size of the minibatches and the number of epochs.

The starting kit is using keras. MNIST could be loaded into memory at once, but in general, image data sets are too big for this, so instead of a set of images, your function will receive a generator. The type of this generator is BatchGeneratorBuilder that you can find in rampwf.workflows.image_classifier. You will interact with it by calling its get_train_valid_generators function. This function will still not return data, since typically you would call the fit function of a keras neural net with it, which would again result in filling the memory too fast. Instead, it returns two other generators that you will pass to the fit_generator function of a keras model. Why two? The first generator will generate training images for keras, and the second generator generates validation images, which you can use to monitor the learning curves when developing the model. Normally you will not need validation samples in your submission (only when you develop your models outside of the RAMP server) unless you implement automatic early stopping based on validation accuracy.

The BatchGeneratorBuilder.get_train_valid_generators function expects two parameters:

  • valid_ratio: the ratio of the minibatch that should be used for validation, typically 0.1, and
  • batch_size: the number of images keras' fit_generator will get for every minibatch. The batch size is an important hyperparameter. Statistically, a larger batch size means more precise gradient estimates but computationally slower steps. More importantly, you will have to carefully set the batch size for staying within the memory, depending on the input image size, your network architecture, and your backend. In the starting kit we use 32 images per minibatch.
In [72]:
%%file submissions/starting_kit/
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Flatten
from keras.optimizers import SGD
from rampwf.workflows.image_classifier import get_nb_minibatches

class BatchClassifier(object):
    def __init__(self):
        self.model = self._build_model()

    def fit(self, gen_builder):
        batch_size = 32
        gen_train, gen_valid, nb_train, nb_valid =\
                batch_size=batch_size, valid_ratio=0.1)
            # Total number of steps (batches of samples) to yield from
            # generator before declaring one epoch finished and starting the
            # next epoch. It should typically be equal to the number of unique
            # samples of your dataset divided by the batch size.
            steps_per_epoch=get_nb_minibatches(nb_train, batch_size),
            # In parallel to training, a CPU process loads and preprocesses
            # data from disk and put it into a queue in the form of
            # mini-batches of size `batch_size`.`max_queue_size` controls the
            # maximum size of that queue. The size of the queue should be big
            # enough so that the training process (GPU) never
            # waits for data (the queue should be never be empty).
            # The CPU process loads chunks of 1024 images each time, and
            # 1024/batch_size mini-batches from that chunk are put into the
            # queue. Assuming training the model on those 1024/batch_size
            # mini-batches is slower than loading a single chunk of 1024
            # images, a good lower bound for `max_queue_size` would be
            # (1024/batch_size). if `batch_size` is 16, you can put
            # `max_queue_size` to 64.
            # WARNING : It is obligatory to set `workers` to 1.
            # This in principle controls the number of workers used
            # by keras to load mini-batches from disk to memory in parallel
            # to GPU training. But I don't like the way it works and their
            # code is not very commented/used, so I dont trust it that much
            # (we might have surprises).
            # The way it works in keras is by launching in parallel `workers`
            # threads or processes which will all use a copy of the generator
            # passed to `fit_generator`. So if nothing is done and `workers`
            # is set to some number > 1, the neural net will be trained with
            # repetitions of the same data, because the workers are independent
            # and they got through the same generator.
            # Hence it is necessary to introduce a shared lock between the the
            # processes so that they load different data, this can become a bit
            # complicated, so I choose to rather load exactly one chunk at a
            # time using 1 worker (so `workers` have to be equal to 1), but
            # do this single chunk loading in parallel with joblib.
            validation_steps=get_nb_minibatches(nb_valid, batch_size),

    def predict_proba(self, X):
        return self.model.predict(X)

    def _build_model(self):
        inp = Input((1, 28, 28))
        x = Flatten(name='flatten')(inp)
        x = Dense(100, activation='relu', name='fc1')(x)
        out = Dense(10, activation='softmax', name='predictions')(x)
        model = Model(inp, out)
        return model
Overwriting submissions/starting_kit/

Local testing (before submission)

It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit, not on the classes defined in the cells of this notebook.

First pip install ramp-workflow or install it from the github repo. Make sure that the python files and are in the submissions/starting_kit folder, and the data train.csv and test.csv are in data. If you haven't yet, downlad the images by executing python Then run


If it runs and print training and test errors on each fold, then you can submit the code.

In [73]:
Testing MNIST classification
Reading train and test files from ./data ...
Reading cv ...
Training ./submissions/starting_kit ...
Using TensorFlow backend.
Epoch 1/1
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
1080/1080 [==============================] - 36s - loss: 9.0336 - acc: 0.4248 - val_loss: 7.1904 - val_acc: 0.5393
Warning: model can't be pickled.
can't set attributes of built-in/extension type 'tuple'
CV fold 0
	train acc = 0.55
	valid acc = 0.55
	test acc = 0.55
	train nll = 14.53
	valid nll = 14.65
	test nll = 14.6
train acc = 0.55 ± 0.0
train nll = 14.53 ± 0.0
valid acc = 0.55 ± 0.0
valid nll = 14.65 ± 0.0
test acc = 0.55 ± 0.0
test nll = 14.6 ± 0.0

Submitting to

Once you found a good feature extractor and classifier, you can submit them to First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event MNIST for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox and copy-paste (or upload) and from submissions/starting_kit. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard. If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints mean cross-validation scores

train acc = 0.55 ± 0.0
train nll = 14.53 ± 0.0
valid acc = 0.55 ± 0.0
valid nll = 14.65 ± 0.0
test acc = 0.55 ± 0.0
test nll = 14.6 ± 0.0

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is balanced accuracy aka macro-averaged recall, so the line that is relevant in the output of ramp_test_submission is valid acc = 0.55 ± 0.0. When the score is good enough, you can submit it at the RAMP.

More information

You can find more information in the README of the ramp-workflow library.


Don't hesitate to contact us.