FAN revenue prediction challenge


Lucy Liu (CDS), Maria Teleczuk (CDS), Clément Chastagnol (Sidetrade),
Gael Varoquaux (Inria, Parietal), Alex Gramfort (Inria, Parietal), Guillaume Lemaitre (Scikit-learn @ Inria Foundation)

Predicting revenue using French Attribution Notices: RAMP studio challenge

Introduction

The objective of this challenge is to work with 'dirty data'. Most real data is dirty and the availability of high-quality, open-source ML and data analysis frameworks (such as scikit-learn, pandas...) means that the next frontier for tooling and automation lies in preprocessing. This challenge aims to investigate methodologies to perform statistical analysis directly on the original dirty data.

There are two datasets in this challenge:

  • company_revenue_TRAIN.csv - company revenue declarations.
  • award_notices_RAMP.csv - French Attribution Notices.

Aim

The predictive aim of this challenge is to use company_revenue_TRAIN.csv and award_notices_RAMP.csv to predict the Revenue for each entry in the 'company financial data' dataset. It is advised that you use both datasets, as it improves the prediction (see Score comparison), but using only the company_revenue_TRAIN.csv dataset is also allowed.

Data

Company financial data

This dataset was built from an extract of the National Institute of Statistics and Economic Studies (INSEE) reference database of company revenue declarations from 2013 to 2018. Each row represents the declaration of one company for one year and the following information is provided in the columns:

  • Legal_ID - the reconcilled legal ID of the company
  • Name - the name of the company
  • Activity_code (APE) - 'Activite Principale de l'Entreprise', the main activity of the company - more information in English or in French
  • Address
  • Zipcode
  • City
  • Revenue - in Euros
  • Headcount
  • Fiscal_year_end_date
  • Fiscal_year_duration_in_months
  • Year

There are a few things to note:

  • there are revenue declarations for the same company but different years
  • there is a large reduction in entries for the years 2017 and 2018 due to the Loi Macron law in 2017
  • the 'same company' can have several different entities, resulting in entries where the Legal_ID is different but the Name, Address, City and Zipcode are all the same.
  • Revenue can be negative. A negative revenue may be due to cancelled orders from the previous fiscal year that was recorded in the current fiscal year.

Award notices

Every French public organisation has to issue a call for tenders when buying supplies or services (above a minimum threshold). These are called public procurement contracts. Companies then compete anonymously on these contracts and when a bid is awarded, a notice has to be legally published by the public organization on the BOAMP (historical data is hosted by the DILA). These are called French Attribution Notices (FAN). About 25% of awards are actually electronically published.

Each contract can be divided into a maximum of 5 lots and the same company can win >1 lot of a contract. The award notices dataset comprises award notices from 2017 and 2018. Each row refers to one lot and there can be up to 5 lots referring to the same contract. The following information is provided for each lot:

  • ID_call - ID of the award notice
  • Publication_date of the award notice
  • End_of_call_date of the award notice
  • Departments_of_publication - the department code(s) of the award notice
  • Department_of_provision - the department code(s) where the contract works/goods/services were provided
  • Call_summary - summary of the award notice
  • Call_title - title of the award notice
  • Complete_call_description - description of the award notice
  • Total_amount - total amount of the contract, from all lots, in euros
  • CPV_classes - Common Procurement Vocabulary (CPV), a classification system for public procurement used to describe the subject of procurement contracts (more information can be found here) Columns providing details about the company issuing the contract:
  • Buyer_name
  • Buyer_address
  • Buyer_zipcode
  • Buyer_city
  • Buyer_email
  • Buyer_URL

  • Contract_awarded - whether or not the contact was awarded

Columns providing details about the winner of each lot:

  • ID - unique lot ID
  • awarded - whether or not the contact was awarded
  • description - description of the lot
  • incumbent_name - name of the lot winnter
  • incumbent_address - address of the lot winner
  • incumbent_zipcode - zipcode of the lot winner
  • incumbent_city - city of the lot winner
  • incumbent_country - country of the lot winner
  • number_of_received_bids - number of bids this lot received
  • amount - amount of the lot in euros

Important notes

Both data sets are very dirty. There is a lot of missing data and the column descriptions provided above are a guide only. Further, the award notices dataset is much smaller than the company revenue declarations dataset. Therefore, it is expected that many companies in the company revenue declarations dataset are not present in the award notices dataset.

Training and test

The company revenue training dataset has been split into 'training' and 'test' subsets. The shapes are:

  • training: (1 495 948, 11)
  • test: (520 966, 11)

Your model will be tested on a completely separate company revenue dataset stored on the RAMP server. This dataset has a shape of (702 181, 11). This test dataset will also be dirty but we can guarantee that the following columns will (only) be of numerical data type:

  • Legal_ID
  • Headcount
  • Fiscal_year_duration_in_months
  • Year

Score metric

A unique score is used:

$$score = |max(5,log_{10}(max(1,y\_true))) - max(5,log_{10}(max(1,y\_pred)))| $$

Score interpretation:

  • A lower score is better
  • Any y_true or y_pred value less than 1 is 'taken' as 1
  • If both the y_true and y_pred are less than 100 000, the score would be 0.
  • The score is the same regardless of the order of y_true and y_pred in the equation.
  • If the difference in raw y_true and y_pred values is the same, the score is greater for smaller magnitudes of y_true and y_pred.

Data exploration

In [ ]:
%matplotlib inline
import imp
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn

Company financial data

In [ ]:
from problem import get_train_data

X_df, y_array = get_train_data()
In [ ]:
X_df.head(3)
In [ ]:
X_df['Fiscal_year_end_date'] = pd.to_datetime(X_df['Fiscal_year_end_date'], format='%Y-%m-%d')
In [ ]:
X_df.shape
In [ ]:
# proportion of NaN values
X_df.isna().sum() / X_df.shape[0]
In [ ]:
# number of unique values
X_df.nunique()
In [ ]:
X_df.dtypes
In [ ]:
X_df.describe()

Distribution of Revenue

In [ ]:
logRev = np.log10(np.clip(y_array, a_min=1, a_max=None))
seaborn.kdeplot(logRev)

Award notices

This dataset is present in the data/ directory and is added by default to all submissions when submitting to RAMP studio. When testing your submission locally however, it is important to copy this file into the submission directory of the submission you wish to test.

For example, the starting kit submission directory (submissions/starting_kit) contains a copy of this dataset as it is required for RAMP to work locally. You will not need to upload this data file when you are making a submission. See for more information.

In [ ]:
award = pd.read_csv('data/award_notices_RAMP.csv.zip', compression='zip')
award.head(3)
In [ ]:
award.shape
In [ ]:
award.dtypes
In [ ]:
award.describe()
In [ ]:
# proportion of NA values
award.isna().sum() / award.shape[0]

Predictions

Company revenue only

First, let's predict using only the comp dataset:

  1. Create a transformer that deals with missing values:
In [ ]:
X_df.dtypes
In [ ]:
# list of numerical columns
num_cols = ['Legal_ID', 'Headcount', 'Fiscal_year_duration_in_months', 'Year']
In [ ]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='median'))
])
  1. Create a transformer to split Fiscal_year_end_date into separate year, month, day columns:
In [ ]:
from sklearn.preprocessing import FunctionTransformer

def process_date(X):
    date = pd.to_datetime(X['Fiscal_year_end_date'], format='%Y-%m-%d')
    return np.c_[date.dt.year, date.dt.month, date.dt.day]

date_transformer = FunctionTransformer(process_date, validate=False)
  1. Process the Activity_code (APE) column. At the moment the Scikit learn OrdinalEncoder() does not have a handle_unknown argument. This means that it would not be able to handle any values in Activity_code (APE) which appear in 'train' but do not appear in 'test'. Here we will simply get around this by using the first 2 characters of APE, which are always numbers. The first 2 numbers give the broad category the companies activities fall under (e.g. 'AGRICULTURE'). This column is then converted to numeric data type, so missing values can be dealt with by using the median value (with SimpleImputer()).
In [ ]:
def process_APE(X):
    APE = X['Activity_code (APE)'].str[:2]
    return pd.to_numeric(APE, errors='coerce').values[:, np.newaxis]

APE_transformer = FunctionTransformer(process_APE, validate=False)
  1. Only keep the numbers in the Zipcode column:
In [ ]:
def zipcodes(X):
    zipcode_nums = pd.to_numeric(X['Zipcode'], errors='coerce')
    return zipcode_nums.values[:, np.newaxis]

zipcode_transformer = FunctionTransformer(zipcodes, validate=False)

Finally the full pipeline is assembled.

  • for missing values in numerical columns, the 'median' is used.
  • the date column Fiscal_year_end_date is transformed into separated year month and day columns.
  • Activity_code (APE) is dealt with as described above.
  • the columns Name, Address and City are all dropped.
In [ ]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

zipcode_col = ['Zipcode']
date_cols = ['Fiscal_year_end_date']
drop_cols = ['Name', 'Address', 'City']
APE_col = ['Activity_code (APE)']


preprocessor_comp = ColumnTransformer(
    transformers=[
        ('zipcode', make_pipeline(zipcode_transformer, SimpleImputer(strategy='median')), zipcode_col),
        ('num', numeric_transformer, num_cols),
        ('date', make_pipeline(date_transformer, SimpleImputer(strategy='median')), date_cols),
        ('APE', make_pipeline(APE_transformer, SimpleImputer(strategy='median')), APE_col),
        ('drop cols', 'drop', drop_cols),
    ])
In [ ]:
# make sure it works
preprocessor_comp.fit_transform(X_df)

We will use a Random Forest Regressor model:

In [ ]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=10)

Before we can test our model, we need to define our unique scoring function:

In [ ]:
from sklearn.metrics import make_scorer

def loss(y_true, y_pred):
    
    if isinstance(y_true, pd.Series):
        y_true = y_true.values

    true = np.maximum(5., np.log10(np.maximum(1., y_true)))
    pred = np.maximum(5., np.log10(np.maximum(1., y_pred)))
    
    loss = np.mean(np.abs(true - pred))
    
    return loss
    
fan_loss = make_scorer(loss, greater_is_better=False)

Now we can test our model. Note that we use GroupShuffleSplit using Legal_ID as the group so that the same company ('Legal_ID') only appears in either 'train' or 'test' but does not appear both in 'train' and 'test'.

This reflects the same conditions of this challenge where, the private 'test' data (on RAMP) does not contain any company that also appears in the public 'train' dataset you have access to. This is because Revenue for the same company is often very similar.

In [ ]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GroupShuffleSplit

clf = Pipeline(steps=[
    ('preprocessing', preprocessor_comp),
    ('classifier', regressor)])

cv = GroupShuffleSplit(n_splits=8, test_size=0.25)

scores_Xdf = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=fan_loss, groups=X_df['Legal_ID'], n_jobs=2)

print("mean: %e (+/- %e)" % (scores_Xdf.mean(), scores_Xdf.std()))

Naive merge

Now let us predict using a naive merge of award and comp datasets.

The naive merge will only use the name of the company. To aid the merging we will convert the name to all lower case and remove punctuation and white space.

In [ ]:
award['Name_processed'] = award['incumbent_name'].str.lower()
award['Name_processed'] = award['Name_processed'].str.replace('[^\w]','')

For each company, extract 2 features:

  • the number of award lots won
  • the total sum of award lots won
In [ ]:
award_features = award.groupby(['Name_processed'])['amount'].agg(['count','sum'])
award_features.head(3)

Next we will perform a naive merge of X_df and award_features.

Be careful in this step to ensure that the order of X_df is not changed.

In [ ]:
def merge_naive(X):
    X['Name'] = X['Name'].str.lower()     
    X['Name'] = X['Name'].str.replace('[^\w]','')
    df = pd.merge(X, award_features, left_on='Name', right_on='Name_processed', how='left')
    return df[['count','sum']]
merge_transformer = FunctionTransformer(merge_naive, validate=False)

Implement the same feature transformation steps as above and include the 'merge' step.

In [ ]:
zipcode_col = ['Zipcode']
date_cols = ['Fiscal_year_end_date']
drop_cols = ['Address', 'City']
APE_col = ['Activity_code (APE)']
merge_col = ['Name']

preprocessor_merge = ColumnTransformer(
    transformers=[
        ('zipcode', make_pipeline(zipcode_transformer, SimpleImputer(strategy='median')), zipcode_col),
        ('num', numeric_transformer, num_cols),
        ('date', make_pipeline(date_transformer, SimpleImputer(strategy='median')), date_cols),
        ('APE', make_pipeline(APE_transformer, SimpleImputer(strategy='median')), APE_col),
        ('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='median')), merge_col),
        ('drop cols', 'drop', drop_cols),
    ])
In [ ]:
# check it works
preprocessor_merge.fit_transform(X_df)

This example is implemented in RAMP within the files in the folder submissions/starting_kit.

The transformation steps above are implemented in the file submissions/starting_kit/feature_extractor.py (a copy of this file is shown below). This file needs to define a class called FeatureExtractor which requires a fit() and transform() function. The fit() function takes both X_df and y_array as parameters, meaning that you are able to engineer new features using y_array (e.g. target encoding). The transform() function only takes X_df. We only use the transform() function in our simple example below.

Note that the award dataset is being read in from the submission folder (submissions/starting_kit). This means that when testing locally, each submission folder should contain a copy of the award dataset award_notices_RAMP.csv.

In [ ]:
import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline


class FeatureExtractor(object):
    def __init__(self):
        pass

    def fit(self, X_df, y_array):
        pass

    def transform(self, X_df):
        X_encoded = X_df

        path = os.path.dirname(__file__)
        award = pd.read_csv(os.path.join(path, 'award_notices_RAMP.csv.zip', compression='zip'),
                            low_memory=False)
        # obtain features from award
        award['Name_processed'] = award['incumbent_name'].str.lower()
        award['Name_processed'] = award['Name_processed'].str.replace('[^\w]','')
        award_features = award.groupby(['Name_processed'])['amount'].agg(['count','sum'])

        def zipcodes(X):
            zipcode_nums = pd.to_numeric(X['Zipcode'], errors='coerce')
            return zipcode_nums.values[:, np.newaxis]
        zipcode_transformer = FunctionTransformer(zipcodes, validate=False)

        numeric_transformer = Pipeline(steps=[
            ('impute', SimpleImputer(strategy='median'))])

        def process_date(X):
            date = pd.to_datetime(X['Fiscal_year_end_date'], format='%Y-%m-%d')
            return np.c_[date.dt.year, date.dt.month, date.dt.day]
        date_transformer = FunctionTransformer(process_date, validate=False)
        
        def process_APE(X):
            APE = X['Activity_code (APE)'].str[:2]
            return pd.to_numeric(APE).values[:, np.newaxis]
        APE_transformer = FunctionTransformer(process_APE, validate=False)

        def merge_naive(X):
            X['Name'] = X['Name'].str.lower()     
            X['Name'] = X['Name'].str.replace('[^\w]','')
            df = pd.merge(X, award_features, left_on='Name', 
                          right_on='Name_processed', how='left')
            return df[['count','sum']]
        merge_transformer = FunctionTransformer(merge_naive, validate=False)

        num_cols = ['Legal_ID', 'Headcount', 
                    'Fiscal_year_duration_in_months', 'Year']
        zipcode_col = ['Zipcode']
        date_cols = ['Fiscal_year_end_date']
        APE_col = ['Activity_code (APE)']
        merge_col = ['Name']
        drop_cols = ['Address', 'City']

        preprocessor = ColumnTransformer(
            transformers=[
                ('zipcode', make_pipeline(zipcode_transformer, SimpleImputer(strategy='median')), zipcode_col),
                ('num', numeric_transformer, num_cols),
                ('date', make_pipeline(date_transformer, SimpleImputer(strategy='median')), date_cols),
                ('APE', make_pipeline(APE_transformer, SimpleImputer(strategy='median')), APE_col),
                ('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='median')), merge_col),
                ('drop cols', 'drop', drop_cols),
                ])

        X_array = preprocessor.fit_transform(X_encoded)
        return X_array

We will use a Random Forest Regressor model again:

In [ ]:
regressor = RandomForestRegressor(n_estimators=10)

This is implemented in the file submissions/starting_kit/regressor.py (a copy of this file is shown below). The Regressor class must have a fit() and predict() function. If you are using a scikit-learn function, this can be done by simply calling fit() and predict() on the regressor, as seen below.

In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator


class Regressor(BaseEstimator):
    def __init__(self):
        self.reg = RandomForestRegressor(n_estimators=5)

    def fit(self, X, y):
        self.reg.fit(X, y)

    def predict(self, X):
        return self.reg.predict(X)

We can test our model within this notebook, using preprocessor_merge and regressor defined above, in a Pipeline().

Once you are happy with a solution, you can transfer your solution to feature_extractor.py and regressor.py files and test your submission using RAMP (see 'Submissions').

In [ ]:
clf = Pipeline(steps=[
    ('preprocessing', preprocessor_merge),
    ('classifier', regressor)])

cv = GroupShuffleSplit(n_splits=8, test_size=0.25)

scores_merge = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=fan_loss, groups=X_df['Legal_ID'],
                               n_jobs=2)

print("mean: %e (+/- %e)" % (scores_merge.mean(), scores_merge.std()))

Score comparison

You can see in the below plot of the scores that using the merged data always results in better scores (smaller is better):

In [ ]:
scores = pd.DataFrame({'Xdf': scores_Xdf, 'merge': scores_merge})
scores.boxplot()

Record linkage

The problem of trying to match 'records' referring to the same entity across different data sources is called record linkage. We used a very naive way to match companies above but there are much more sophiisticated methods and packages implementing these methods. To get you started here are two Python packages:

Submission structure

Each of your submissions should be in it's own folder within the submissions folder (e.g. submissions/my_submission). The submission directory should contain 3 files:

  • a copy of the award_notices_RAMP.csv dataset. Note that you will not need to upload this file when making a submission in RAMP studio as this is done automatically by RAMP studio.
  • feature_extractor.py - this should merge the X_df and the award dataset and any feature transformation you wish
  • regressor.py - this should implement a regressor with a fit() and predict() function

See submissions/starting_kit for an example.

You can also test your submissions using RAMP-workflow before submitting:

Local testing (before submission)

It is important that you test your submission files before submitting them. For this we provide a unit test - ramp_test_submission.

First, ensure that ramp-worflow is installed (see the github repo for installation instructions).

Now you can use ramp_test_submission. This command will test on files in submissions/starting_kit by default. To specify testing on a different folder use the flag --submission. For example to run the test on submissions/solution1 use: ramp_test_submission --submission solution1.

If it runs and print training and test errors on each fold, then you can submit the code.

For example, below we test the starting kit:

In [ ]:
!ramp_test_submission

Submitting to ramp.studio

Once you found a good model, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then sign up for the 'FAN_revenue_prediction' challenge and the 'Saclay M2 Data Camp 2019/20' event. Both signups are controlled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox. Here you can either edit and save the code files on the left hand side of the page or upload your feature_extractor.py and regressor.py files on the right hand side of the page, then give your submission a name and click 'submit'. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get an email, and your submission shows up on the public leaderboard. If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally then check that your submission works locally with ramp_test_submission. The script will print mean cross-validation scores if your submission works.

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is FAN error.

More information

You can find more information in the README of the ramp-workflow library.