import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
Every year in France, just over 810,000 students in the fourth year of collège ("classe de troisième"), aged around 15, take the brevet des collèges, a diploma that completes the first cycle of secondary education. Since schooling is compulsory up to the age of 16 in France, all teenagers take the brevet exams, making it a universal measure of the level of French students of that age. In 2019, 86.5% of the 813,200 candidates obtained (passed) the national brevet diploma. Below is the evolution of the success rate in recent years:
tx = pd.read_csv("data/college/tx_succes.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.plot(tx.annee, tx.taux, '-o', color='orange')
plt.xlabel("Year")
plt.ylabel("Success rate (%)")
plt.title("Evolution of the national success rate in the 'brevet des collèges'", fontsize=14)
plt.ylim(50,100)
plt.show()
# Source : https://www.data.gouv.fr/fr/datasets/le-diplome-national-du-brevet-00000000/#_
However, this rate is not the same for everyone. There are two streams in the examination, the general stream, in which 90% of the students are enrolled, and the vocational stream, so called "fillière professionnelle". For 2019, the pass rate is 87.8% in the general stream compared to 77.2% for the vocational stream. There is a similarly large gap between girls and boys, with a pass rate of 90% compared to 83% for boys.
Examination outcomes are more finely divided into:
The distribution of students obtaining each level is shown below:
mentions = pd.read_csv("data/college/tx_mentions.csv", sep=";", decimal=',')
plt.figure(figsize=(10,7))
plt.pie(mentions.taux,
labels=mentions.mention,
colors=['darkgreen', 'green', 'lightgreen', 'lightyellow', 'red'],
autopct='%1.1f%%')
plt.title("Breakdown of students in the 2019 'brevet' according to their mention")
plt.axis('equal')
plt.show()
Education is a major issue and is a central element of the political policy of France. France spends more than 150 billion euros a year on education. Around 20% of this sum, 30 billion euros, is allocated to training at the first cycle of secondary education, in other words, at the collège level.
In 2017, the French national education system identified 7,200 public and private collèges. 79% of secondary school students were enrolled in a public school.
First of all, predicting the brevet pass rate per collège is important because it makes it possible to recognise spatial inequalities in the territory and to answer questions such as whether there is a divide between urban and rural colleges. Most importantly, predicting the success rate will make it possible to identify the factors related to the success of classe de troisième pupils in the brevet des collèges and, more generally. This will thus enable those in charge of education policy to (re)allocate human and financial resources where these essential factors of success are weaker than elsewhere, thus enabling the national education system to achieve its pedagogical objectives in secondary education. For example, small classes could be set up in the REP/REP+ zones (a kind of priority education network) in a targeted manner.
The aim of this challenge is to use the brevet success rate to identify factors associated with schools which have high success rates so that investment can be made in collèges without these factors.
The social impact of this system of allocating teaching and educational resources could help to move closer to the republican principle of equal opportunities nationwide.
If this resource allocation program works well, we can expect to see the lowest-performing collèges catch up. In other words, collèges that are currently, or repeatedly year after year, below the national average should be expected to catch up with the national average.
To capture some of this "catch-up" effect, one can first look at the evolution of the variance of the distribution of patent success rates. Furthermore, it is also possible to look at the evolution between two dates of the mean for the group located below the national average with respect to the national average: $$ \frac{\bar{y}^{1} - \bar{y}_{low}^{1}}{\bar{y}^{0} - \bar{y}_{low}^{0}}$$ where $\bar{y}^0$ is the national average success rate at time 0, and $\bar{y}_{low}^{0}$ is the average success rate among all schools that are in the first quartile of the distribution (the 25% of schools that have the lowest success rates), and similarly at time 1.
Thus, we expect this metric to be as small as possible, meaning that the below of the distribution is now more concentraded around the national average. On the contrary, a value close to 1 would mean that nothing has changed between the two dates, because even if the average of the last quartile would have increased, it is possible that this is only due to a general upward trend.
Another possible indicator would be to do the same thing but taking a specific group of collèges. For example, at date 0 we identify a group of collèges that are in difficulty (i.e., they have a very low success rate compared to other collèges), and it is within this group that we calculate the low average $\bar{y}_{low}^{0}$. At date 1, the low average $\bar{y}_{low}^{1}$ is calculated not on a new group of collèges but on the same as the previous data. Thus, the metric is more accurate in the sense that it can be used to track the specific evolution of a targeted group of collèges.
However, one must be aware that such a simplistic metric as this one does not make it possible to assess the effectiveness of a public policy as a whole, because on the one hand it takes only two dates, while on the other hand it is necessary to look over several years at the effects of a public policy. Moreover, in order to identify the real effects of a policy, it is necessary to evaluate a more advanced econometric model.
Aim: Predict the success rate per collège in 2017.
Scoring: "Root mean square error" defined as follows: $$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$
In fact we will use the normalized RMSE, which is the RMSE divided by the standard deviation of the target variable $\sigma_y$: $$RMSE = \frac{\sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i-\hat{y}_i)^2}}{\sigma_y}$$ The standard deviation of the target variable can be seen as a RMSE for a model that always predicts the average value. Thus, dividing the classic RMSE by $\sigma_y$ gives us a ratio which allow us to easily compare the performance of our model.
import geopandas as gpd
import folium
pd.set_option('display.max_columns', 500)
pd.set_option("display.max_rows", 500)
import warnings
warnings.filterwarnings("ignore")
First of all, data related to the collèges and their exam pass rates are loaded.
data_college = pd.read_csv('./data/train.csv')
print('shape of the college table:', data_college.shape)
data_college.head()
Notes on the data:
Let's have a look at the main variables that we have:
Important note: in this database there are no missing data, so we don't have to worry about this problem afterwards.
# number of unique values
data_college.nunique()
# Python type of each feature
data_college.dtypes
In order to add socio-economic context to our analysis, it was decided to use a second database, this time with city-level information.
cities_data = pd.read_csv("./submissions/starting_kit/external_data.csv", index_col=0)
print('shape of the cities table', cities_data.shape)
cities_data.head()
Let us briefly present the main variables of this second database.
# proportion of NaN values
cities_data.isna().sum() / cities_data.shape[0]
The variables in this database are quite well filled out, except for the variable coding for the poverty rate, which has more than 88% missing data. Subsequently, we will fill the missing data by taking the average value at the department level, so that there are no more missing data.
# Python type of each feature
cities_data.dtypes
As mentioned earlier, the objective here is to arrive at an accurate estimate of a college's pass rate on the Brevet exam. Therefore, the corresponding variable is named 'target'.
First, let's look at the simple distribution of these success rates.
plt.figure(figsize=(10,7))
plt.hist(data_college.target, bins=40)
plt.axvline(x=data_college.target.mean(), color="orange", label='mean')
plt.axvline(x=data_college.target.quantile(.5), color="red", label='median')
plt.xlabel("Success rate (%)")
plt.ylabel("Count")
plt.title("Distribution of the brevet success rates", fontsize=14)
plt.legend(loc='best')
plt.show()
data_college.target.describe()
print("The variance of the success rates in 2019 is %.2f" %data_college.target.std())
In order to integrate the socio-economic data available in our second database, it is possible to join the two databases at the city level, thanks to their unique code.
# merge on the city code
data_college = pd.merge(data_college, cities_data,
left_on='Commune et arrondissement code', right_on='insee_code', how='left')
# fill na by taking the average value at the departement level
city_col_with_na = []
for col in cities_data.columns:
if cities_data[col].isna().sum() > 0:
city_col_with_na.append(col)
#print(city_col_with_na)
for col in city_col_with_na:
data_college[col] = data_college[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
data_college.head()
plt.figure(figsize=(10,7))
plt.scatter(data_college['unemployment_rate'], data_college['target'])
plt.xlabel("Average unemployment rate in the departement")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the unemployment rate in the departement", fontsize=14)
plt.ylim(50,100)
plt.show()
At first glance, it would appear that the unemployment rate in the city has a negative effect on the exam pass rate. This supports the idea that socio-economic variables in the environment in which the school is located may be crucial explanatory factors.
Let's look at the distribution of the success rate by department.
# Create a dynamic map
def plot_per_department(column, data_college,
cmap='OrRd_r',
path_geo_data = './data/donnees_geographiques/fichiers_geopandas/'):
'''
Functions which returns an interactive map of France colored by department according
to the value of the feature "column" (for example the target).
Parameters:
column (string): the name of the column to plot per department
data_college: the data frame with the data per college
path_geo_data: path to a geojson file of france
'''
dep_df = data_college.groupby('Département code').agg({column:'mean'}).reset_index()
dep_df.columns = ['code', column] # we rename the department code because it's the key of the geojson file
m = folium.Map(location=[46.45,1], zoom_start=6) #map centered on France
# Add the chloropleth
m.choropleth(
geo_data=path_geo_data+'departements.geojson.txt', # geoJSON file or url to geojson
name='choropleth',
data=dep_df, # Pandas dataframe
columns=['code',column], # key and value of interest from the dataframe
key_on='feature.properties.code', # key to link the json file and the dataframe
fill_color=cmap, # colormap
fill_opacity=0.7,
line_opacity=0.2,
legend_name=column
)
display(m)
return None
plot_per_department(column='target', data_college=data_college)
Now we can look in detail, at the city level.
Thanks to the function below, you can choose a department and display an interactive map of the collèges. You can click on the little little circles (each of them corresponds to a collège) and see some information about it:
# Create a dynamic map for the cities in each department
def plot_cities_in_dep(column, dep_code, dep_name,
data_college,
url_parent='https://france-geojson.gregoiredavid.fr/repo/departements/',
cmap='OrRd_r'):
'''
column (string): name of the column to plot
dep_code(string): the code of the department to plot
dep_name (string): the name of the department
data_college : the data frame with data per college
url_parent : the url of France GEOJSON
'''
cities_df = data_college[data_college['Département code']==dep_code]
cities_df_group = cities_df.groupby('Commune et arrondissement code').agg({column:'mean'}).reset_index()
cities_df_group.columns = ['code', column]
#url to geojson of the department
url = url_parent+dep_code+'-'+dep_name+'/communes-'+dep_code+'-'+dep_name+'.geojson'
coords = gpd.read_file(url).loc[0].geometry.centroid.coords[0] #coordinates where the maps is centered
m = folium.Map(location=[coords[1],coords[0]], zoom_start=10) #map centered on France
# Add the chloropleth
m.choropleth(
geo_data=url,# url to geojson of the department
name='choropleth',
data=cities_df_group, # Pandas dataframe
columns=['code',column], # key and value of interest from the dataframe
key_on='feature.properties.code', # key to link the json file and the dataframe
fill_color=cmap, # colormap
fill_opacity=0.7,
line_opacity=0.2,
legend_name=column
)
#add the colleges marker
for ix, row in cities_df.iterrows():
# Create a popup tab with the college name and its success_rate
popup_df = pd.DataFrame(data=[['College', row['Name']],
['Sucess rate', str(row['target'])+'%'],
['City', row['Commune et arrondissement nom']],
['Department', row['Département nom']],
['Appartenance EP', row['Appartenance EP']]])
popup_html = popup_df.to_html(classes='table table-striped table-hover table-condensed table-responsive', index=False, header=False)
# Create a marker on the map
folium.CircleMarker(location = [row['Latitude'],row['Longitude']], radius=2, popup=folium.Popup(popup_html), color='red', alpha=0.5, fill_color='#0000FF').add_to(m)
display(m)
return None
plot_cities_in_dep(column='target',
dep_code='75',
dep_name='paris',
data_college=data_college)
Some of the following features' possible link with the target are worth investigating:
print('Mean median living wage: %.5f' %cities_data.med_std_living.mean())
plot_per_department(column='med_std_living', data_college=data_college, cmap='Blues_r')
plot_cities_in_dep(column='med_std_living', dep_code='75', dep_name='paris',
data_college=data_college, cmap='Blues_r')
plot_cities_in_dep(column='med_std_living', dep_code='93', dep_name='seine-saint-denis',
data_college=data_college, cmap='Blues_r')
print('Average unemployment rate: %.2f%%' %(cities_data.unemployment_rate.mean()*100))
plot_per_department(column='unemployment_rate', data_college=data_college, cmap='Purples')
plot_cities_in_dep(column='unemployment_rate', dep_code='75', dep_name='paris',
data_college=data_college, cmap='Purples')
print('Average poverty rate: %.02f%%' %cities_data.poverty_rate.mean())
plot_per_department('poverty_rate', data_college, cmap='Greens')
plot_cities_in_dep('poverty_rate', dep_code='62',
dep_name='pas-de-calais', data_college=data_college, cmap="Greens")
fig, ax = plt.subplots(1,3, figsize=(20,5))
sns.scatterplot(x='med_std_living', y='target', data=data_college, ax=ax[0])
sns.scatterplot(x='unemployment_rate', y='target', color='purple',
data=data_college.dropna(subset=['unemployment_rate']), ax=ax[1])
sns.scatterplot(x='poverty_rate', y='target', color='green',
data=data_college.dropna(subset=['poverty_rate']), ax=ax[2])
plt.show()
The priority education policy aims to reduce the gaps in achievement between pupils enrolled in priority education and those who are not. Two types of networks have been identified: the REP+, which concern neighbourhoods or isolated sectors with the greatest concentration of social difficulties that have a strong impact on educational success, and the more socially mixed REPs that are more socially mixed but encounter more significant social difficulties than those of collèges and schools outside priority education. Not every collèges are in the priority education network, and in this case they are labelled as "HEP" (outside the priority education network, Hors éducation prioritaire).
ax = sns.boxplot(x="Appartenance EP", y="target", data=data_college)
ax.axes.set_title("Success rate according to the type of education network")
ax.set_xlabel("Type of education network")
ax.set_ylabel("Success rate");
It can easily be observed that the success rate is significantly lower when the collège is part of a priority education network (this is all the more the case for the REP+ network). However, no causal link can be drawn from this, since collèges are placed in a priority education system precisely when they have a certain amount of ground to make up in relation to other collèges.
Allongside the priority education networks, there exists another label for collège in difficulties, the établissements sensibles. The so-called "sensitive" schools are secondary schools in which a climate of insecurity prevails that seriously compromises pupils' schooling. They are not necessarily in priority education.
The development of violence in schools has led the Ministers of National Education and the Interior to strengthen their collaboration. The latter has led, since 1992, to the classification of certain public secondary schools as "sensitive" schools, without saying, however, that violence is present only in these schools.
Like REPs, sensitive establishments benefit from special measures. They are the subject of exceptional efforts in terms of innovative and adapted pedagogy, by strengthening the timetable potential (class splitting, tutoring, tutoring, directed studies, etc.) and by the allocation of additional jobs, by strengthening the presence of adults (increase in the number of senior educational advisers, boarding school teachers, day school supervisors, etc.) and by appointing two head teachers per class.
sns.catplot(x="Appartenance EP", y="target", hue="Etablissement sensible", kind="box", data=data_college)
plt.title('Success rate according to the type of education network');
Another determinant of academic success is the number of students per class. It is easy to understand that a smaller class size makes it much easier for the teacher to spend more time with each student, thus ensuring that all students progress through the classroom without some being left behind. This is true at all levels of education but mainly during the first years of schooling.
Unfortunately, the variable "number of students" per class is not present in our database. However, it is possible to create a variable "average number of pupils per class" from the variables "total number of pupils in the school" and "number of classes".
data_college['average_class_size'] = data_college['Nb élèves'] / data_college['Nb divisions']
plt.scatter(data_college['average_class_size'], data_college['target'])
plt.xlabel("Average class size")
plt.ylabel("Success rate (%)")
plt.title("Success rate according to the average class size", fontsize=14)
plt.ylim(50,100)
plt.show()
In this case, the effect of class size on the exam pass rate cannot be determined directly because the rest of the variables must be controlled for. However, it is interesting to keep such a variable, given its importance in the literature.
In the same way, it is possible to create other new variables, such as the share of pupils who are in a general stream or the share of pupils who are in a European or international section.
# percentage of pupils in the general stream
data_college['percent_general_stream'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / data_college['Nb élèves']
# percentage of pupils in an european or international section
data_college['percent_euro_int_section'] = data_college['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / data_college['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = data_college['Nb 5èmes'] + data_college['Nb 4èmes générales'] + data_college['Nb 3èmes générales']
data_college['percent_latin_greek'] = data_college['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
data_college['percent_segpa'] = data_college['Nb SEGPA'] / data_college['Nb élèves']
quant_features = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb élèves pratiquant langue rare',
'Nb 3ème SEGPA',
'average_class_size', 'percent_general_stream', 'percent_euro_int_section']
data_college[quant_features].describe()
data_college[quant_features].hist(figsize=(16, 20), bins = 50, xlabelsize=8, ylabelsize=8)
plt.show()
We can observe that most of our quantitative features have a Gaussian distribution shape.
Finally, we can have a look on the correlation between our different quantitative features and our target.
features_for_corr = ['target', 'average_class_size',
'percent_general_stream', 'percent_euro_int_section',
'med_std_living', 'poverty_rate', 'unemployment_rate']
sns.heatmap(data_college[features_for_corr].corr(), cmap='YlGn')
plt.title('Anlaysis of correlations via heatmap');
The work flow is composed of two essential elements that make up the submission: the feature extractor and the regressor. The first allows both the preparation of initial data and the creation of new variables. The second, on the other hand, allows a supervised learning model to be trained so that the success rate on the exam can be correctly predicted. This model is trained on a part of the base obtained from the feature extractor output, then is evaluated on the remaining part.
We will use a Random Forest Regressor model in order to predict the different success rates.
data_college = pd.read_csv('./data/train.csv')
y_array = data_college['target'].values
X_df = data_college.drop('target', axis=1)
cities_data = pd.read_csv("./data/donnees_geographiques/cities_data_filtered.csv", index_col=0)
keep_col_cities = ['population', 'SUPERF', 'med_std_living', 'poverty_rate', 'unemployment_rate']
def process_students(X):
"""Create new features linked to the pupils"""
# average class size
X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
# percentage of pupils in the general stream
X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
# percentage of pupils in an european or international section
X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']
return np.c_[X['average_class_size'].values,
X['percent_general_stream'].values,
X['percent_euro_int_section'].values,
X['percent_latin_greek'].values,
X['percent_segpa'].values]
def merge_naive(X):
# merge the two databases at the city level
df = pd.merge(X, cities_data,
left_on='Commune et arrondissement code', right_on='insee_code', how='left')
# fill na by taking the average value at the departement level
for col in keep_col_cities:
if cities_data[col].isna().sum() > 0:
df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
return df[keep_col_cities]
from sklearn.preprocessing import FunctionTransformer, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
# Transformers
students_col = ['Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA']
students_transformer = FunctionTransformer(process_students, validate=False)
num_cols = ['Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
"Nb 6èmes provenant d'une école EP"]
numeric_transformer = Pipeline(steps=[('scale', StandardScaler())])
cat_cols = ['Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
'Situation relative à une zone rurale ou autre']
categorical_transformer = Pipeline(steps=[('encode', OneHotEncoder(handle_unknown='ignore'))])
merge_col = ['Commune et arrondissement code', 'Département code']
merge_transformer = FunctionTransformer(merge_naive, validate=False)
drop_cols = ['Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
'Commune et arrondissement code', 'Commune et arrondissement nom',
'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
'Longitude', 'Latitude', 'Position']
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_cols),
('cat', categorical_transformer, cat_cols),
('students', make_pipeline(students_transformer, SimpleImputer(strategy='mean'), StandardScaler()), students_col),
('merge', make_pipeline(merge_transformer, SimpleImputer(strategy='mean')), merge_col),
('drop cols', 'drop', drop_cols),
], remainder='drop') # remainder='drop' or 'passthrough'
# check it works
preprocessor.fit_transform(X_df)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=5, max_depth=50, max_features=10)
from sklearn.metrics import make_scorer, mean_squared_error
def normalized_rmse(y_true, y_pred):
"""Normalized RMSE"""
if isinstance(y_true, pd.Series):
y_true = y_true.values
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
return rmse / np.std(y_true)
custom_loss = make_scorer(normalized_rmse, greater_is_better=False)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
clf = Pipeline(steps=[
('preprocessing', preprocessor),
('classifier', regressor)])
cv = ShuffleSplit(n_splits=5, test_size=0.25)
scores_Xdf = -cross_val_score(clf, X_df, y_array, cv=cv, scoring=custom_loss)
print("mean: %.2e (+/- %.2e)" % (scores_Xdf.mean(), scores_Xdf.std()))
To make a RAMP submission you will need to create a new directory within submissions
, naming it as you wish, and a file named estimator.py
within the new directory, e.g. submissions/my_new_sub/estimator.py
Within estimator.py
, define a function named get_estimator
that returns a scikit-learn pipeline or estimator that performs the desired feature extraction and regression.
For example, the estimator.py
file below will perform the workflow detailed above:
import os
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import BaseEstimator
def _process_students(X):
"""Create new features linked to the pupils"""
# average class size
X['average_class_size'] = X['Nb élèves'] / X['Nb divisions']
# percentage of pupils in the general stream
X['percent_general_stream'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales'] / X['Nb élèves']
# percentage of pupils in an european or international section
X['percent_euro_int_section'] = X['Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales'] / X['Nb élèves']
# percentage of pupils doing Latin or Greek
sum_global_5_to_3 = X['Nb 5èmes'] + X['Nb 4èmes générales'] + X['Nb 3èmes générales']
X['percent_latin_greek'] = X['Nb 5èmes 4èmes et 3èmes générales Latin ou Grec'] / sum_global_5_to_3
# percentage of pupils that are in a SEGPA class
X['percent_segpa'] = X['Nb SEGPA'] / X['Nb élèves']
return np.c_[X['average_class_size'].values,
X['percent_general_stream'].values,
X['percent_euro_int_section'].values,
X['percent_latin_greek'].values,
X['percent_segpa'].values]
def _merge_naive(X):
# read the database with the city information
filepath = os.path.join(
os.path.dirname(__file__), 'external_data.csv'
)
cities_data = pd.read_csv(filepath)
# merge the two databases at the city level
df = pd.merge(
X, cities_data,left_on='Commune et arrondissement code',
right_on='insee_code', how='left'
)
keep_col_cities = [
'population',
'SUPERF',
'med_std_living',
'poverty_rate',
'unemployment_rate'
]
# fill na by taking the average value at the departement level
for col in keep_col_cities:
if cities_data[col].isna().sum() > 0:
df[col] = df[['Département code', col]].groupby('Département code').transform(lambda x: x.fillna(x.mean()))
return df[keep_col_cities]
def get_estimator():
students_col = [
'Nb élèves', 'Nb divisions', 'Nb 6èmes 5èmes 4èmes et 3èmes générales',
'Nb 6èmes 5èmes 4èmes et 3èmes générales sections européennes et internationales',
'Nb 5èmes', 'Nb 4èmes générales', 'Nb 3èmes générales',
'Nb 5èmes 4èmes et 3èmes générales Latin ou Grec', 'Nb SEGPA'
]
num_cols = [
'Nb élèves', 'Nb 3èmes générales', 'Nb 3èmes générales retardataires',
"Nb 6èmes provenant d'une école EP"
]
cat_cols = [
'Appartenance EP', 'Etablissement sensible', 'CATAEU2010',
'Situation relative à une zone rurale ou autre'
]
merge_col = [
'Commune et arrondissement code', 'Département code'
]
drop_cols = [
'Name', 'Coordonnée X', 'Coordonnée Y', 'Commune code', 'City_name',
'Commune et arrondissement code', 'Commune et arrondissement nom',
'Département nom', 'Académie nom', 'Région nom', 'Région 2016 nom',
'Longitude', 'Latitude', 'Position'
]
numeric_transformer = Pipeline(steps=[
('scale', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('encode', OneHotEncoder(handle_unknown='ignore'))
])
students_transformer = FunctionTransformer(
_process_students, validate=False
)
students_transformer = make_pipeline(
students_transformer, SimpleImputer(strategy='mean'),
StandardScaler()
)
merge_transformer = FunctionTransformer(_merge_naive, validate=False)
merge_transformer = make_pipeline(
merge_transformer, SimpleImputer(strategy='mean')
)
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_cols),
('cat', categorical_transformer, cat_cols),
('students', students_transformer, students_col),
('merge', merge_transformer, merge_col),
('drop cols', 'drop', drop_cols),
], remainder='passthrough') # remainder='drop' or 'passthrough'
regressor = RandomForestRegressor(
n_estimators=5, max_depth=50, max_features=10
)
pipeline = Pipeline(steps=[
('preprocessing', preprocessor),
('classifier', regressor)
])
return pipeline
To test and submit your code, you can refer to the online documentation.