Pollenating insect classification (18 classes)
Current events on this problem
Keywords
pollenating_insects_starting_kit

Paris Saclay Center for Data Science

RAMP on Pollinating insect classification

Mehdi Cherti (CNRS), Romain Julliard (MNHN), Gregoire Lois (MNHN), Balázs Kégl (CNRS)

Introduction

Pollinating insects play a fundamental role in the stability of ecosystems. An insect is said to be pollinator when it transports pollen from one flower to another, helping them to accomplish fertilization. The vast majority of plants pollinates using insects, and at the same time, these insects depend on plants for their survival. However, because of human intensified agrigulture, urbanisation and climate change, these species are threatened. 35% of human alimentation is based on plants pollinated by insects. Diversity of these insects is also important, the more diverse they are the best overall assistance is provided by these insects.

The SPIPOLL (Suivi Photographique des Insectes POLLinisateurs) project proposes to quantitatively study pollinating insects in France. For this, they created a crowdsourcing platform where anyone can upload pictures of insects and identify their species through a series of questions. These data are then used by specialists for further analyses.

Data

In this RAMP, we propose a dataset of pictures of insects from different species gathered from the SPIPOLL project and labeled by specialists. The dataset contains a set of 21004 labeled pictures of insects coming from 18 different insect species. Each picture is a color image. The size of the images (number of pixels) vary.

The prediction task

The goal of this RAMP is to classify correctly the species of the insects. For each submission, you will have to provide an image preprocessor (to standardize, resize, crop, augment images) and batch classifier, which will fit a training set and predict the classes (species) on a test set. The images are big so loading them into the memory at once is impossible. The batch classifier therefore will access them through a generator which can be "asked for" a certain number of training and validation images at a time. You will typically run one minibatch of stochastique gradient descent on these images to train a deep convolutional neural networks which are the state of the art in image classification.

Hints

First of all, even though 21K images is relatively small compared to industrial level data sets, to achieve state-of-the-art performance, you will need big networks which will take ages (days) to train on a CPU. If you want to have a faster turnaround for tuning your net, you will need a GPU-equipped server of could instance. Setting up an AWS instance is easy, just follow this tutorial. If you want to have the starting kit preinstalled, use the community AMI "pollenating_insects_users".

Your main bottleneck is memory. E.g., increasing the resolution to 128x128, you will need to decrease batch size. You should always run user_test_submission.py on the AWS node before submitting.

For learning the nuts and bolts of convolutional nets, we suggest that you follow Andrej Karpathy’s excellent course.

You have some trivial "classical" options to explore. You should set the epoch size to something more than three (in the starting kit). You should check when the validation error curve flattens because you will also be graded on training and test time. You can change the network architecture, apply different regularization techniques to control overfitting, optimization options to control underfitting.

You can use pretrained nets from here. There are a couple of examples in the starting kit. Your options are the following.

  • Retrain or not the weights. If you do not, you are using the pretrained net as fixed a feature extractor. You can add some layers on the top of the output of the pretrained net, and only train your layers. If you retrain all the layers, you use the pretrained net as an initialization. Again, your goal is not only to increase accuracy but also to be frugal. Retraining the full net is obviously more expensive.
  • You can "read out" the activations from any layer, you do not need to keep the full net, not even the full convolutional stack.
  • The starting kit contains examples with the VGG16 net, but feel free to use any of the other popular nets. Just note that there is no way to change the architecture of these nets. In particular, each net expects images of a given dimension so your image preprocessing needs to resize or crop the images to the right size.

You can also adjust the image preprocessing. Resizing to small (64x64 or even 32x32) will make the training faster so you can explore more hyperparameters, but the details will be lost so your final result will probably be suboptimal. Insects are mostly centered in the images but there are a lot of smaller insects which could be cropped for a better performance. You can also rotate the images or apply other data augmentation tricks (google "convolutional nets data augmentation"). You should also look at the actual images to get some inspiration to find meaningful preprocessing ideas.

In [12]:
import os
import numpy as np
import pandas as pd
from skimage.io import imread
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import AxesGrid
from matplotlib import cm

%matplotlib inline

pd.set_option('display.max_rows', 500)

The data

If the images are not yet in data/imgs, change the type of the net cell to "Code" and run it.

!python download_data.py
In [13]:
df = pd.read_csv('data/train.csv')
X_df = df['id']
y_df = df['class']
X = X_df.values
y = y_df.values

The class distribution is quite heavy tail: the largest class (ordinary bees) and the three largest classes contain 23% and 42% of the images, respectively.

In [14]:
labels_counts_df = df.groupby('class').count()
labels_counts_df = labels_counts_df.rename(columns={'id': 'count'})
class_codes_df =  pd.read_csv('data/class_codes.csv', index_col='class')
labels_counts_df = pd.merge(
    class_codes_df, labels_counts_df, left_index=True, right_index=True)
labels_counts_df = labels_counts_df.sort_values('count', ascending=False)
labels_counts_df
Out[14]:
taxa_code taxa_name count
class
3 565 L'Abeille mellifère (Apis mellifera) 4824
0 970 Les Bourdons noirs à bande(s) jaune(s) et cul ... 2525
2 715 Le Syrphe ceinturé (Episyrphus balteatus) 1463
10 881 Les Oedemères verts (Oedemera) 914
11 696 Les Mouches aux reflets métalliques (Neomyia, ... 804
16 996 Les Guêpes Polistes (Polistes) 793
7 978 La Coccinelle à 7 points (Coccinella septempun... 680
5 835 Le Drap mortuaire (Oxythyrea funesta) 658
13 971 Les Bourdons noirs à bande(s) jaune(s) et cul ... 606
9 952 L'Araignée crabe Napoléon (Synema globosum) 584
12 687 Les Hélophiles (Helophilus, Parhelophilus) 455
4 682 L'Eristale des fleurs (Myathropa florea) 424
17 995 Les Guêpes Crabronidae difficiles à déterminer... 398
1 833 Les Dermestes (Dermestidae) 382
8 654 Les Xylocopes (Xylocopa) 376
14 1071 Les Cétoines métalliques à marques blanches (C... 373
6 759 Le Pentatome rayé (Graphosoma lineatum) 339
15 1061 Les Araignées crabes sombres (Xysticus, Coriar... 206
In [15]:
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

plt.figure(figsize=(10, 5))

x = np.arange(len(labels_counts_df))
plt.bar(x, labels_counts_df['count'])
plt.xticks(x + 0.5, labels_counts_df['taxa_name'], rotation=90, fontsize=10);

It is worthwhile to look at some image panels, grouped by label.

In [20]:
nb_rows = 4
nb_cols = 4
nb_elements = nb_rows * nb_cols
# change the label here to see other classes
label = 1

print("{0}".format(labels_counts_df.loc[label]))

X_given_label = X[y==label]

subsample = np.random.choice(X_given_label, replace=False, size=nb_elements)

fig = plt.figure(figsize=(10, 10))
grid = AxesGrid(fig, 111, # similar to subplot(141)
                nrows_ncols = (nb_rows, nb_cols),
                axes_pad = 0.05,
                label_mode = "1",
)
for i, image_id in enumerate(subsample):
    filename = 'data/imgs/{}'.format(image_id)
    image = imread(filename)
    im = grid[i].imshow(image/255.)
    grid[i].axis('off')
plt.tight_layout()