Emanuela Boros (LIMSI/CNRS), Balรกzs Kรฉgl (LAL/CNRS), Roman Yurchak (Symerio)
This is an initiation project to introduce RAMP and get you to know how it works.
The goal is to develop prediction models able to identify which news is fake.
The data we will manipulate is from http://www.politifact.com. The input contains of short statements made by public figures (and sometimes anonymous bloggers), plus some metadata. The output is a truth level, judged by journalists at Politifact. They use six truth levels which we coded into integers to obtain an ordinal regression problem:
0: 'Pants on Fire!'
1: 'False'
2: 'Mostly False'
3: 'Half-True'
4: 'Mostly True'
5: 'True'
You goal is to classify each statement (+ metadata) into one of the categories.
Further, an nltk dataset needs to be downloaded:
python -m nltk.downloader popular
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename, sep='\t')
data = data.fillna('')
data['date'] = pd.to_datetime(data['date'])
data.head()
y_array = data['truth'].values
X_df = data.drop(columns='truth')
X_df.shape
y_array.shape
data.dtypes
data.describe()
data.count()
The original training data frame has 13000+ instances. In the starting kit, we give you a subset of 7569 instances for training and 2891 instances for testing.
Most columns are categorical, some have high cardinalities.
print(np.unique(data['state']))
print(len(np.unique(data['state'])))
data.groupby('state').count()[['job']].sort_values(
'job', ascending=False).reset_index().rename(
columns={'job': 'count'}).plot.bar(
x='state', y='count', figsize=(16, 10), fontsize=13);
print(np.unique(data['job']))
print(len(np.unique(data['job'])))
data.groupby('job').count()[['state']].rename(
columns={'state': 'count'}).sort_values(
'count', ascending=False).reset_index().plot.bar(
x='job', y='count', figsize=(16, 10), fontsize=13);
If you want to use the journalist and the editor as input, you will need to split the lists since sometimes there are more than one of them on an instance.
print(np.unique(data['edited_by']))
print(len(np.unique(data['edited_by'])))
data.groupby('edited_by').count()[['state']].rename(
columns={'state': 'count'}).sort_values(
'count', ascending=False).reset_index().iloc[:60,:].plot.bar(
x='edited_by', y='count', figsize=(16, 10), fontsize=10);
print(np.unique(data['researched_by']))
print(len(np.unique(data['researched_by'])))
data.groupby('researched_by').count()[['state']].sort_values(
'state', ascending=False).reset_index().rename(
columns={'state': 'count'}).iloc[:60,:].plot.bar(
x='researched_by', y='count', figsize=(16, 10), fontsize=13);
There are 2000+ different sources.
print(np.unique(data['source']))
print(len(np.unique(data['source'])))
data.groupby('source').count()[['state']].rename(
columns={'state': 'count'}).sort_values(
'count', ascending=False).reset_index().loc[:60].plot.bar(
x='source', y='count', figsize=(16, 10), fontsize=13);
The goal is to predict the truthfulness of statements. Let us group the data according to the truth
columns:
data.groupby('truth').count()[['source']].reset_index().plot.bar(x='truth', y='source');
The pipeline for predicting the 'truth' classification of each statement requires two main steps:
The sample solution presented in this starting kit uses Term Frequency-Inverse Documentation Frequency (tf-idf). A Term Frequency (tf) is a count of how many times a word occurs in a given document (synonymous with bag of words). The Inverse Document Frequency (idf) is the number of times a word occurs in a corpus of documents. tf-idf is used to weight words according to how important they are. Words that are used frequently in many documents (e.g., 'the', 'is', 'of') has less importance and thus will have a lower weighting while infrequent ones will have a higher weighting.
Built-in scikit-learn functions will be used to implement tf-idf:
CountVectorizer
converts a collection of text documents to a matrix of token (word) counts. This implementation produces a sparse representation of the counts to be passed to the TfidfTransformer
.
The TfidfTransformer
transforms a count matrix to a normalized tf or tf-idf representation.
A TfidfVectorizer
does both these steps and thus function will be used in our sample solution.
See the scikit-learn documentation for a general introduction to text feature extraction.
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
preprocessor = make_column_transformer(
(TfidfVectorizer(analyzer='word'), 'statement'),
remainder='drop', # drop all other columns
)
The scikit-learn RandomForestClassifier
will be used in the sample solution.
We will use a scikit-learn pipeline, which chains together preprocessing and estimator steps, to perform all steps in the workflow. This offers offers convenience and safety (help avoid leaking statistics from your test data into the trained model in cross-validation) and the whole pipeline can be evaluated with cross_val_score
.
Note that the output of TfidfVectorizer
is a sparse matrix.
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
clf = RandomForestClassifier()
pipeline = make_pipeline(preprocessor, clf)
Before we can evaluate of our pipeline, we must first define the score metric. For this challenge, the official score is smoothed accuracy.
from sklearn.metrics import make_scorer
from sklearn.preprocessing import OneHotEncoder
def smooth_acc(y_true, y_proba):
soft_score_matrix = np.array([
[1, 0.8, 0, 0, 0, 0],
[0.4, 1, 0.4, 0, 0, 0],
[0, 0.4, 1, 0.4, 0, 0],
[0, 0, 0.4, 1, 0.4, 0],
[0, 0, 0, 0.4, 1, 0.4],
[0, 0, 0, 0, 0.8, 1],
])
y_true_proba = OneHotEncoder().fit_transform(np.expand_dims(y_true, axis=1))
# Clip negative probas
y_proba_positive = np.clip(y_proba, 0, 1)
# Normalize rows
y_proba_normalized = y_proba_positive / np.sum(
y_proba_positive, axis=1, keepdims=True)
# Smooth true probabilities with score_matrix
y_true_smoothed = y_true_proba.dot(soft_score_matrix)
# Compute dot product between the predicted probabilities and
# the smoothed true "probabilites" ("" because it does not sum to 1)
scores = np.sum(y_proba_normalized * y_true_smoothed, axis=1)
scores = np.nan_to_num(scores)
score = np.mean(scores)
# to pick up all zero probabilities
score = np.nan_to_num(score)
return score
smooth_acc_score = make_scorer(smooth_acc, needs_proba=True)
Next we have to define a special cross validation function that splits each train/test iteration using the date that the statement was made. For example, for the first iteration, the test data set will consist of statements made within the first 1/8 of the time period, while the train data set will consist of statements made within the remaining 7/8 of the time period.
To save processing time, n_splits
is set to 2 below.
from datetime import timedelta
def get_cv(X, n_splits=2):
"""Slice folds by equal date intervals."""
date = pd.to_datetime(X['date'])
n_days = (date.max() - date.min()).days
fold_length = n_days // n_splits
fold_dates = [date.min() + timedelta(days=i * fold_length)
for i in range(n_splits + 1)]
for i in range(n_splits):
test_is = (date >= fold_dates[i]) & (date < fold_dates[i + 1])
test_is = test_is.values
train_is = ~test_is
yield np.arange(len(date))[train_is], np.arange(len(date))[test_is]
custom_cv = get_cv(X_df)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
pipeline, X_df, y_array, cv=custom_cv, scoring=smooth_acc_score
)
print("mean: %e (+/- %e)" % (scores.mean(), scores.std()))
This sample solution is implemented in RAMP within estimator.py
, which is in the folder submissions/starting_kit
.
estimator.py
defines a function named get_estimator
which returns the pipeline detailed above:
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
def get_estimator():
preprocessor = make_column_transformer(
(TfidfVectorizer(analyzer='word'), 'statement'),
remainder='drop', # drop all other columns
)
clf = RandomForestClassifier()
pipeline = make_pipeline(preprocessor, clf)
return pipeline
There are a number of ways to improve the basic solution presented above.
The document preprocessing can be customized in in the document_preprocessor
function.
For instance, to transform accentuated unicode symbols into their simple counterpart. รจ -> e, the following function can be used,
import unicodedata
def document_preprocessor(doc):
doc = unicodedata.normalize('NFD', doc)
doc = doc.encode('ascii', 'ignore')
doc = doc.decode("utf-8")
return str(doc)
see also the stip_accents
option of TfidfVectorizer
.
The most frequent words often do not carry much meaning. Examples: the, a, of, for, in, ....
Stop words removal can be enabled by passing the stopwords='english'
parameter at the initialization of the
TfidfVectorizer
.
A custom list of stop words (e.g. from NLTK) can also be used.
By default, the bag of words model is use in the starting kit. To use word or character n-grams, the analyser
and ngram_range
parameters of TfidfVectorizer
should be changed.
English words like look can be inflected with a morphological suffix to produce looks, looking, looked. They share the same stem look. Often (but not always) it is beneficial to map all inflected forms into the stem. The most commonly used stemmer is the Porter Stemmer. The name comes from its developer, Martin Porter. SnowballStemmer('english')
from NLTK is used. This stemmer is called Snowball, because Porter created a programming language with this name for creating new stemming algorithms.
Stemming can be enabled with a custom token_processor
function, e.g.
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
def token_processor(tokens):
for token in tokens:
yield stemmer.stem(token)
The document preprocessing and stemmer tokenization function defined above can be added to the estimator.py
submission like so:
import unicodedata
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from nltk.stem import SnowballStemmer
def _document_preprocessor(doc):
doc = unicodedata.normalize('NFD', doc)
doc = doc.encode('ascii', 'ignore')
doc = doc.decode("utf-8")
return str(doc)
def _token_processor(tokens):
stemmer = SnowballStemmer('english')
for token in tokens:
yield stemmer.stem(token)
def get_estimator():
vectorizer = TfidfVectorizer(
analyzer='word', preprocessor=_document_preprocessor, tokenizer=_token_processor
)
preprocessor = make_column_transformer(
(vectorizer, 'statement'),
remainder='drop', # drop all other columns
)
clf = RandomForestClassifier()
pipeline = make_pipeline(preprocessor, clf)
return pipeline
The above submission is a more complex version of the basic sample solution presented above (and within the submissions/starting_kit/estimator.py
file) and can be used as a guide for your own submission.
Once you have a submission you are happy with, you must test your submission files locally before submitting them.
This can be done by first installing ramp-workflow
(pip install ramp-workflow
or install it from the github repo) then running ramp-test
. For example to rest the example solution, make sure that the python file estimator.py
is in the submissions/starting_kit
folder, and the data train.csv
and test.csv
are in data
, then run ramp-test
. More details about testing RAMP submissions can be found here.
ramp-test
performs exactly the same cross validation as shown above with the function cross_val_score
(except n_splits
= 8 for ramp-test
). The scores from the 8 iterations will be printed to terminal. Note that 3 different accuracy scores are calculated for this challenge but the smooth accuracy ('sacc') is the offical score for this challenge.
Finally, you can submit to ramp.studio by following the online documentation.
Don't hesitate to contact us.