Predicting revenue using French Attribution Notices: RAMP studio challenge
The objective of this challenge is to work with 'dirty data'. Most real data is dirty and the availability of high-quality, open-source ML and data analysis frameworks (such as scikit-learn, pandas...) means that the next frontier for tooling and automation lies in preprocessing. This challenge aims to investigate methodologies to perform statistical analysis directly on the original dirty data.
There are two datasets in this challenge:
company_revenue_TRAIN.csv
- company revenue declarations.award_notices_RAMP.csv
- French Attribution Notices.The predictive aim of this challenge is to use company_revenue_TRAIN.csv
and award_notices_RAMP.csv
to predict the Revenue for each entry in the 'company financial data' dataset. It is advised that you use both datasets, as it improves the prediction (see Score comparison), but using only the company_revenue_TRAIN.csv
dataset is also allowed.
This dataset was built from an extract of the National Institute of Statistics and Economic Studies (INSEE) reference database of company revenue declarations from 2013 to 2018. Each row represents the declaration of one company for one year and the following information is provided in the columns:
Legal_ID
- the reconcilled legal ID of the companyName
- the name of the companyActivity_code (APE)
- 'Activite Principale de l'Entreprise', the main activity of the company - more information in English or in FrenchAddress
Zipcode
City
Revenue
- in EurosHeadcount
Fiscal_year_end_date
Fiscal_year_duration_in_months
Year
There are a few things to note:
Legal_ID
is different but the Name
, Address
, City
and Zipcode
are all the same.Revenue
can be negative. A negative revenue may be due to cancelled orders from the previous fiscal year that was recorded in the current fiscal year.Every French public organisation has to issue a call for tenders when buying supplies or services (above a minimum threshold). These are called public procurement contracts. Companies then compete anonymously on these contracts and when a bid is awarded, a notice has to be legally published by the public organization on the BOAMP (historical data is hosted by the DILA). These are called French Attribution Notices (FAN). About 25% of awards are actually electronically published.
Each contract can be divided into a maximum of 5 lots and the same company can win >1 lot of a contract. The award notices dataset comprises award notices from 2017 and 2018. Each row refers to one lot and there can be up to 5 lots referring to the same contract. The following information is provided for each lot:
ID_call
- ID of the award noticePublication_date
of the award noticeEnd_of_call_date
of the award noticeDepartments_of_publication
- the department code(s) of the award noticeDepartment_of_provision
- the department code(s) where the contract works/goods/services were provided Call_summary
- summary of the award noticeCall_title
- title of the award noticeComplete_call_description
- description of the award noticeTotal_amount
- total amount of the contract, from all lots, in eurosCPV_classes
- Common Procurement Vocabulary (CPV), a classification system for public procurement used to describe the subject of procurement contracts (more information can be found here)
Columns providing details about the company issuing the contract:Buyer_name
Buyer_address
Buyer_zipcode
Buyer_city
Buyer_email
Buyer_URL
Contract_awarded
- whether or not the contact was awarded
Columns providing details about the winner of each lot:
ID
- unique lot ID awarded
- whether or not the contact was awardeddescription
- description of the lotincumbent_name
- name of the lot winnterincumbent_address
- address of the lot winnerincumbent_zipcode
- zipcode of the lot winnerincumbent_city
- city of the lot winnerincumbent_country
- country of the lot winnernumber_of_received_bids
- number of bids this lot receivedamount
- amount of the lot in eurosImportant notes
Both data sets are very dirty. There is a lot of missing data and the column descriptions provided above are a guide only. Further, the award notices dataset is much smaller than the company revenue declarations dataset. Therefore, it is expected that many companies in the company revenue declarations dataset are not present in the award notices dataset.
The company revenue training dataset has been split into 'training' and 'test' subsets. The shapes are:
Your model will be tested on a completely separate company revenue dataset stored on the RAMP server. This dataset has a shape of (702 181, 11). This test dataset will also be dirty but we can guarantee that the following columns will (only) be of numerical data type:
Legal_ID
Headcount
Fiscal_year_duration_in_months
Year
A unique score is used:
$$score = |max(5,log_{10}(max(1,y\_true))) - max(5,log_{10}(max(1,y\_pred)))| $$Score interpretation:
y_true
or y_pred
value less than 1 is 'taken' as 1y_true
and y_pred
are less than 100 000, the score would be 0.y_true
and y_pred
in the equation.y_true
and y_pred
values is the same, the score is greater for smaller magnitudes of y_true
and y_pred
.%matplotlib inline
import imp
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn
from problem import get_train_data
X_df, y_array = get_train_data()
X_df.head(3)
X_df['Fiscal_year_end_date'] = pd.to_datetime(X_df['Fiscal_year_end_date'], format='%Y-%m-%d')
X_df.shape
# proportion of NaN values
X_df.isna().sum() / X_df.shape[0]
# number of unique values
X_df.nunique()
X_df.dtypes
X_df.describe()
logRev = np.log10(np.clip(y_array, a_min=1, a_max=None))
seaborn.kdeplot(logRev)
This dataset is present in the data/
directory and is added by default to all submissions when submitting to RAMP studio. When testing your submission locally however, it is important to copy this file into the submission directory of the submission you wish to test.
For example, the starting kit submission directory (submissions/starting_kit
) contains a copy of this dataset as it is required for RAMP to work locally. You will not need to upload this data file when you are making a submission. See for more information.
award = pd.read_csv('data/award_notices_RAMP.csv.zip', compression='zip')
award.head(3)
award.shape
award.dtypes
award.describe()
# proportion of NA values
award.isna().sum() / award.shape[0]
First, let's predict using only the comp
dataset:
X_df.dtypes
# list of numerical columns
num_cols = ['Legal_ID', 'Headcount', 'Fiscal_year_duration_in_months', 'Year']
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
numeric_transformer = Pipeline(steps=[
('impute', SimpleImputer(strategy='median'))
])
Fiscal_year_end_date
into separate year, month, day columns:from sklearn.preprocessing import FunctionTransformer
def process_date(X):
date = pd.to_datetime(X['Fiscal_year_end_date'], format='%Y-%m-%d')
return np.c_[date.dt.year, date.dt.month, date.dt.day]
date_transformer = FunctionTransformer(process_date, validate=False)
Activity_code (APE)
column. At the moment the Scikit learn OrdinalEncoder()
does not have a handle_unknown
argument. This means that it would not be able to handle any values in Activity_code (APE)
which appear in 'train' but do not appear in 'test'. Here we will simply get around this by using the first 2 characters of APE
, which are always numbers. The first 2 numbers give the broad category the companies activities fall under (e.g. 'AGRICULTURE'). This column is then converted to numeric data type, so missing values can be dealt with by using the median value (with SimpleImputer()
).def process_APE(X):
APE = X['Activity_code (APE)'].str[:2]
return pd.to_numeric(APE, errors='coerce').values[:, np.newaxis]
APE_transformer = FunctionTransformer(process_APE, validate=False)
Zipcode
column:def zipcodes(X):
zipcode_nums = pd.to_numeric(X['Zipcode'], errors='coerce')
return zipcode_nums.values[:, np.newaxis]
zipcode_transformer = FunctionTransformer(zipcodes, validate=False)
Finally the full pipeline is assembled.
Fiscal_year_end_date
is transformed into separated year month and day columns.Activity_code (APE)
is dealt with as described above.Name
, Address
and City
are all dropped.from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
zipcode_col = ['Zipcode']
date_cols = ['Fiscal_year_end_date']
drop_cols = ['Name', 'Address', 'City']
APE_col = ['Activity_code (APE)']
preprocessor_comp = ColumnTransformer(
transformers=[
('zipcode', make_pipeline(zipcode_transformer, SimpleImputer(strategy='median')), zipcode_col),
('num', numeric_transformer, num_cols),
('date', make_pipeline(date_transformer, SimpleImputer(strategy='median')), date_cols),
('APE', make_pipeline(APE_transformer, SimpleImputer(strategy='median')), APE_col),
('drop cols', 'drop', drop_cols),
])
# make sure it works
preprocessor_comp.fit_transform(X_df)
We will use a Random Forest Regressor model:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10)
Before we can test our model, we need to define our unique scoring function:
from sklearn.metrics import make_scorer
def loss(y_true, y_pred):
if isinstance(y_true, pd.Series):
y_true = y_true.values
true = np.maximum(5., np.log10(np.maximum(1., y_true)))
pred = np.maximum(5., np.log10(np.maximum(1., y_pred)))
loss = np.mean(np.abs(true - pred))
return loss
fan_loss = make_scorer(loss, greater_is_better=False)
Now we can test our model. Note that we use GroupShuffleSplit
using Legal_ID
as the group so that the same company ('Legal_ID') only appears in either 'train' or 'test' but does not appear both in 'train' and 'test'.
This reflects the same conditions of this challenge where, the private 'test' data (on RAMP) does not contain any company that also appears in the public 'train' dataset you have access to. This is because Revenue
for the same company is often very similar.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GroupShuffleSplit
clf = Pipeline(steps=[
('preprocessing', preprocessor_comp),
('classifier', regressor)])
cv = GroupShuffleSplit(n_splits=8, test_size=0.25)
scores_Xdf = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=fan_loss, groups=X_df['Legal_ID'], n_jobs=2)
print("mean: %e (+/- %e)" % (scores_Xdf.mean(), scores_Xdf.std()))
Now let us predict using a naive merge of award
and comp
datasets.
The naive merge will only use the name of the company. To aid the merging we will convert the name to all lower case and remove punctuation and white space.
award['Name_processed'] = award['incumbent_name'].str.lower()
award['Name_processed'] = award['Name_processed'].str.replace('[^\w]','')
For each company, extract 2 features:
award_features = award.groupby(['Name_processed'])['amount'].agg(['count','sum'])
award_features.head(3)
Next we will perform a naive merge of X_df
and award_features
.
Be careful in this step to ensure that the order of X_df
is not changed.
def merge_naive(X):
X['Name'] = X['Name'].str.lower()
X['Name'] = X['Name'].str.replace('[^\w]','')
df = pd.merge(X, award_features, left_on='Name', right_on='Name_processed', how='left')
return df[['count','sum']]
merge_transformer = FunctionTransformer(merge_naive, validate=False)
Implement the same feature transformation steps as above and include the 'merge' step.
zipcode_col = ['Zipcode']
date_cols = ['Fiscal_year_end_date']
drop_cols = ['Address', 'City']
APE_col = ['Activity_code (APE)']
merge_col = ['Name']
preprocessor_merge = ColumnTransformer(
transformers=[
('zipcode', make_pipeline(zipcode_transformer, SimpleImputer(strategy='median')), zipcode_col),
('num', numeric_transformer, num_cols),
('date', make_pipeline(date_transformer, SimpleImputer(strategy='median')), date_cols),
('APE', make_pipeline(APE_transformer, SimpleImputer(strategy='median')), APE_col),
('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='median')), merge_col),
('drop cols', 'drop', drop_cols),
])
# check it works
preprocessor_merge.fit_transform(X_df)
This example is implemented in RAMP within the files in the folder submissions/starting_kit
.
The transformation steps above are implemented in the file submissions/starting_kit/feature_extractor.py
(a copy of this file is shown below). This file needs to define a class called FeatureExtractor
which requires a fit()
and transform()
function. The fit()
function takes both X_df
and y_array
as parameters, meaning that you are able to engineer new features using y_array
(e.g. target encoding). The transform()
function only takes X_df
. We only use the transform()
function in our simple example below.
Note that the award
dataset is being read in from the submission folder (submissions/starting_kit
). This means that when testing locally, each submission folder should contain a copy of the award dataset award_notices_RAMP.csv
.
import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
class FeatureExtractor(object):
def __init__(self):
pass
def fit(self, X_df, y_array):
pass
def transform(self, X_df):
X_encoded = X_df
path = os.path.dirname(__file__)
award = pd.read_csv(os.path.join(path, 'award_notices_RAMP.csv.zip', compression='zip'),
low_memory=False)
# obtain features from award
award['Name_processed'] = award['incumbent_name'].str.lower()
award['Name_processed'] = award['Name_processed'].str.replace('[^\w]','')
award_features = award.groupby(['Name_processed'])['amount'].agg(['count','sum'])
def zipcodes(X):
zipcode_nums = pd.to_numeric(X['Zipcode'], errors='coerce')
return zipcode_nums.values[:, np.newaxis]
zipcode_transformer = FunctionTransformer(zipcodes, validate=False)
numeric_transformer = Pipeline(steps=[
('impute', SimpleImputer(strategy='median'))])
def process_date(X):
date = pd.to_datetime(X['Fiscal_year_end_date'], format='%Y-%m-%d')
return np.c_[date.dt.year, date.dt.month, date.dt.day]
date_transformer = FunctionTransformer(process_date, validate=False)
def process_APE(X):
APE = X['Activity_code (APE)'].str[:2]
return pd.to_numeric(APE).values[:, np.newaxis]
APE_transformer = FunctionTransformer(process_APE, validate=False)
def merge_naive(X):
X['Name'] = X['Name'].str.lower()
X['Name'] = X['Name'].str.replace('[^\w]','')
df = pd.merge(X, award_features, left_on='Name',
right_on='Name_processed', how='left')
return df[['count','sum']]
merge_transformer = FunctionTransformer(merge_naive, validate=False)
num_cols = ['Legal_ID', 'Headcount',
'Fiscal_year_duration_in_months', 'Year']
zipcode_col = ['Zipcode']
date_cols = ['Fiscal_year_end_date']
APE_col = ['Activity_code (APE)']
merge_col = ['Name']
drop_cols = ['Address', 'City']
preprocessor = ColumnTransformer(
transformers=[
('zipcode', make_pipeline(zipcode_transformer, SimpleImputer(strategy='median')), zipcode_col),
('num', numeric_transformer, num_cols),
('date', make_pipeline(date_transformer, SimpleImputer(strategy='median')), date_cols),
('APE', make_pipeline(APE_transformer, SimpleImputer(strategy='median')), APE_col),
('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='median')), merge_col),
('drop cols', 'drop', drop_cols),
])
X_array = preprocessor.fit_transform(X_encoded)
return X_array
We will use a Random Forest Regressor model again:
regressor = RandomForestRegressor(n_estimators=10)
This is implemented in the file submissions/starting_kit/regressor.py
(a copy of this file is shown below). The Regressor
class must have a fit()
and predict()
function. If you are using a scikit-learn function, this can be done by simply calling fit()
and predict()
on the regressor, as seen below.
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator
class Regressor(BaseEstimator):
def __init__(self):
self.reg = RandomForestRegressor(n_estimators=5)
def fit(self, X, y):
self.reg.fit(X, y)
def predict(self, X):
return self.reg.predict(X)
We can test our model within this notebook, using preprocessor_merge
and regressor
defined above, in a Pipeline()
.
Once you are happy with a solution, you can transfer your solution to feature_extractor.py
and regressor.py
files and test your submission using RAMP (see 'Submissions').
clf = Pipeline(steps=[
('preprocessing', preprocessor_merge),
('classifier', regressor)])
cv = GroupShuffleSplit(n_splits=8, test_size=0.25)
scores_merge = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=fan_loss, groups=X_df['Legal_ID'],
n_jobs=2)
print("mean: %e (+/- %e)" % (scores_merge.mean(), scores_merge.std()))
You can see in the below plot of the scores that using the merged data always results in better scores (smaller is better):
scores = pd.DataFrame({'Xdf': scores_Xdf, 'merge': scores_merge})
scores.boxplot()
The problem of trying to match 'records' referring to the same entity across different data sources is called record linkage. We used a very naive way to match companies above but there are much more sophiisticated methods and packages implementing these methods. To get you started here are two Python packages:
Each of your submissions should be in it's own folder within the submissions
folder (e.g. submissions/my_submission
). The submission directory should contain 3 files:
award_notices_RAMP.csv
dataset. Note that you will not need to upload this file when making a submission in RAMP studio as this is done automatically by RAMP studio.feature_extractor.py
- this should merge the X_df
and the award dataset and any feature transformation you wishregressor.py
- this should implement a regressor with a fit()
and predict()
functionSee submissions/starting_kit
for an example.
You can also test your submissions using RAMP-workflow before submitting:
It is important that you test your submission files before submitting them. For this we provide a unit test - ramp_test_submission
.
First, ensure that ramp-worflow
is installed (see the github repo for installation instructions).
Now you can use ramp_test_submission
. This command will test on files in submissions/starting_kit
by default. To specify testing on a different folder use the flag --submission
. For example to run the test on submissions/solution1
use: ramp_test_submission --submission solution1
.
If it runs and print training and test errors on each fold, then you can submit the code.
For example, below we test the starting kit:
!ramp_test_submission
Once you found a good model, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then sign up for the 'FAN_revenue_prediction' challenge and the 'Saclay M2 Data Camp 2019/20' event. Both signups are controlled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request is accepted, you can go to your sandbox. Here you can either edit and save the code files on the left hand side of the page or upload your feature_extractor.py
and regressor.py
files on the right hand side of the page, then give your submission a name and click 'submit'. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get an email, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally then check that your submission works locally with ramp_test_submission
. The script will print mean cross-validation scores if your submission works.
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is FAN error
.
You can find more information in the README of the ramp-workflow library.