Paris Saclay Center for Data Science ¶

Fake news RAMP: classify statements of public figures¶

Emanuela Boros (LIMSI/CNRS), Balázs Kégl (LAL/CNRS), Roman Yurchak (Symerio)

Introduction¶

This is an initiation project to introduce RAMP and get you to know how it works.

The goal is to develop prediction models able to identify which news is fake.

The data we will manipulate is from http://www.politifact.com. The input contains of short statements made by public figures (and sometimes anonymous bloggers), plus some metadata. The output is a truth level, judged by journalists at Politifact. They use six truth levels which we coded into integers to obtain an ordinal regression problem:

0: 'Pants on Fire!'
1: 'False'
2: 'Mostly False'
3: 'Half-True'
4: 'Mostly True'
5: 'True'

You goal is to classify each statement (+ metadata) into one of the categories.

Requirements¶

numpy>=1.10.0
matplotlib>=1.5.0
pandas>=0.19.0
scikit-learn>=0.17 (different syntaxes for v0.17 and v0.18)
seaborn>=0.7.1
nltk

Further, an nltk dataset needs to be downloaded:

python -m nltk.downloader popular

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Exploratory data analysis¶

Loading the data¶

train_filename = 'data/train.csv'
data = pd.read_csv(train_filename, sep='\t')
data = data.fillna('')
data['date'] = pd.to_datetime(data['date'])
data.head()

y_array = data['truth'].values
X_df = data.drop(columns='truth')

X_df.shape

(7569, 8)

y_array.shape

(7569,)

data.dtypes

date             datetime64[ns]
edited_by                object
job                      object
researched_by            object
source                   object
state                    object
statement                object
subjects                 object
truth                     int64
dtype: object

data.describe()

data.count()

date             7569
edited_by        7569
job              7569
researched_by    7569
source           7569
state            7569
statement        7569
subjects         7569
truth            7569
dtype: int64

The original training data frame has 13000+ instances. In the starting kit, we give you a subset of 7569 instances for training and 2891 instances for testing.

Most columns are categorical, some have high cardinalities.

print(np.unique(data['state']))
print(len(np.unique(data['state'])))
data.groupby('state').count()[['job']].sort_values(
    'job', ascending=False).reset_index().rename(
    columns={'job': 'count'}).plot.bar(
    x='state', y='count', figsize=(16, 10), fontsize=13);

['' 'Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'Rhode island'
 'South Carolina' 'South Dakota' 'Tennesse' 'Tennessee' 'Texas'
 'United Kingdom' 'Utah' 'Vermont' 'Virgina' 'Virginia' 'Washington'
 'Washington state' 'Washington, D.C.' 'West Virginia' 'Wisconsin'
 'Wyoming' 'ohio' 'the United States']
60

print(np.unique(data['job']))
print(len(np.unique(data['job'])))
data.groupby('job').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().plot.bar(
        x='job', y='count', figsize=(16, 10), fontsize=13);

['' 'Activist' 'Business leader' 'Columnist' 'Constitution Party'
 'Democrat' 'Democratic Farmer-Labor' 'Independent' 'Journalist'
 'Labor leader' 'Libertarian' 'Newsmaker' 'None'
 'Ocean State Tea Party in Action' 'Organization' 'Republican'
 'State official' 'Talk show host' 'Tea Party member'
 'county commissioner']
20

If you want to use the journalist and the editor as input, you will need to split the lists since sometimes there are more than one of them on an instance.

print(np.unique(data['edited_by']))
print(len(np.unique(data['edited_by'])))
data.groupby('edited_by').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().iloc[:60,:].plot.bar(
        x='edited_by', y='count', figsize=(16, 10), fontsize=10);

['' 'Aaron Sharockman' 'Adriel Bettelheim, Amy Hollyfield'
 'Alexander Lane' 'Amy Hollyfield' 'Amy Hollyfield, Aaron Sharockman'
 'Amy Hollyfield, Greg Joyce' 'Amy Hollyfield, Scott Montgomery'
 'Amy Sherman' 'Angie Drobnic Holan'
 'Angie Drobnic Holan, Aaron Sharockman'
 'Angie Drobnic Holan, Elizabeth Miniet, Jim Tharpe' 'Bill Adair'
 'Bill Adair, Aaron Sharockman' 'Bill Adair, Amy Hollyfield'
 'Bill Adair, Angie Drobnic Holan' 'Bill Adair, Martha M. Hamilton'
 'Bill Adair, Scott Montgomery' 'Bill Adair, Sergio Bustos'
 'Bill Adair, Steve Ahillen, Zack McMillin' 'Bill Adair, Tom Chester'
 'Bill Adair, Tom Chester, Michael Erskine' 'Bill Adair, W. Gardner Selby'
 'Bill Adair, Zack McMillin' 'Bill Duryea' 'Bob Gee' 'Brenda Bell'
 'Brenda Bell, Jody Seaborn' 'Brenda Bell, W. Gardner Selby'
 'Bridget Hall Grumet' 'Bridget Hall Grumet, Angie Drobnic Holan'
 'Bruce  Hammond' 'C. Eugene Emery Jr.' 'Caryn Shinske'
 'Catharine Richert' 'Charles Gay' 'Charles Gay, Elizabeth Miniet'
 'Chris Quinn' 'Daniel Finnegan' 'Daniel Finnegan, Warren Fiske'
 'Dave Umhoefer' 'Dee Lane' 'Dee Lane, Caryn Shinske'
 'Dee Lane, Jody Seaborn' 'Doug Bennett'
 'Edward Epstein, Angie Drobnic Holan' 'Elizabeth Miniet'
 'Elizabeth Miniet, Jim Tharpe' 'Elizabeth Miniet, Lois Norder'
 'Greg Borowski' 'Greg Borowski, Steve Schultze' 'Greg Joyce'
 'Heather Urquides' 'James B. Nelson' 'Jane Kahoun' 'Janel Davis'
 'Janie Har' 'Jim  Denery' 'Jim  Denery, Angela Tuck'
 'Jim  Denery, Angie Drobnic Holan, Jim Tharpe' 'Jim  Denery, Charles Gay'
 'Jim  Denery, Charles Gay, Jim Tharpe' 'Jim  Denery, Jim Tharpe'
 'Jim  Denery, Lois Norder' 'Jim  Denery, Shawn McIntosh' 'Jim Tharpe'
 'Jody Seaborn' 'Jody Seaborn, W. Gardner Selby' 'Joe Guillen'
 'John Bartosek' 'John Bartosek, Amy Hollyfield'
 'John Bartosek, Andrew Jarosh' 'John Bartosek, Doug Bennett'
 'John Bartosek, Heather Urquides' 'John Bartosek, Sheldon Zoldan'
 'John Bridges' 'John Bridges, Bob Gee' 'Jonathan Van Fleet'
 'Kathryn A. Wolfe' 'Leroy Chapman, Charles Gay'
 'Leroy Chapman, Jim Tharpe' 'Lois Norder' 'Louis Jacobson'
 'Mark Vosburgh' 'Martha M. Hamilton' 'Martha M. Hamilton, Jody Seaborn'
 'Martha M. Hamilton, W. Gardner Selby' 'Meghan Ashford-Grooms'
 'Meghan Ashford-Grooms, John Bridges' 'Meghan Ashford-Grooms, Sue Owen'
 'Michelle Brence, Dee Lane' 'Mike Konrad' 'Morris Kennedy' 'Neil Brown'
 'Neil Brown, Scott Montgomery' 'Robert Farley'
 'Robert Farley, Amy Hollyfield' 'Robert Higgs'
 'Robert Higgs, Jane Kahoun' 'Scott Montgomery' 'Sergio Bustos'
 'Sergio Bustos, Amy Hollyfield' 'Shawn McIntosh, Elizabeth Miniet'
 'Stephen Koff' 'Steve Ahillen' 'Steve Ahillen, Tom Chester'
 'Steve Ahillen, Zack McMillin' 'Steve Liebman'
 'Steve McQuilkin, Sheldon Zoldan' 'Sue Owen' 'Susan Areson'
 'Susan Areson, Tim Murphy' 'Suzanne Pavkovic' 'Therese Bottomly'
 'Therese Bottomly, Dee Lane' 'Thomas Koetting' 'Tim Murphy' 'Tom Chester'
 'Tom Curran' 'Tom Feran' 'Tom Kertscher' 'W. Gardner Selby'
 'W. Gardner Selby, Aaron Sharockman' 'Warren Fiske'
 'Warren Fiske, Martha M. Hamilton' 'Warren Fiske, Robert Higgs'
 'Zack McMillin']
127

print(np.unique(data['researched_by']))
print(len(np.unique(data['researched_by'])))

['' 'Aaron Marshall' 'Aaron Sharockman' 'Aaron Sharockman, Amy Sherman'
 'Adriel Bettelheim' 'Adriel Bettelheim, Angie Drobnic Holan'
 'Adriel Bettelheim, David DeCamp' 'Adriel Bettelheim, Ryan Kelly'
 'Alaina Berner, Christopher Connors, Louis Jacobson' 'Alex Holt'
 'Alex Holt, Louis Jacobson' 'Alex Holt, Michelle Sutherland'
 'Alex Kuffner' 'Alex Leary' 'Alexander Lane' 'Amy Hollyfield'
 'Amy Sherman' 'Amy Sherman, Bartholomew Sullivan' 'Andra Lim'
 'Angie Drobnic Holan' 'Angie Drobnic Holan, Aaron Sharockman'
 'Angie Drobnic Holan, Alex Leary' 'Angie Drobnic Holan, Alexander Lane'
 'Angie Drobnic Holan, Amy Sherman'
 'Angie Drobnic Holan, Amy Sherman, Dave Umhoefer'
 'Angie Drobnic Holan, Catharine Richert'
 'Angie Drobnic Holan, Craig Pittman'
 'Angie Drobnic Holan, David G. Taylor'
 'Angie Drobnic Holan, Ian K. Kullgren'
 'Angie Drobnic Holan, Jeffrey S.  Solochek'
 'Angie Drobnic Holan, John Martin' 'Angie Drobnic Holan, Katie Sanders'
 'Angie Drobnic Holan, Louis Jacobson'
 'Angie Drobnic Holan, Louis Jacobson, Aaron Sharockman'
 'Angie Drobnic Holan, Louis Jacobson, Amy Sherman'
 "Angie Drobnic Holan, Louis Jacobson, Ciara O'Rourke"
 'Angie Drobnic Holan, Louis Jacobson, Katie Sanders'
 'Angie Drobnic Holan, Louis Jacobson, Molly Moorhead'
 'Angie Drobnic Holan, Molly Moorhead'
 'Angie Drobnic Holan, Molly Moorhead, Aaron Sharockman'
 'Angie Drobnic Holan, Shawn Zeller' 'Angie Drobnic Holan, Shirl Kennedy'
 'Angie Drobnic Holan, Sue Owen'
 'Angie Drobnic Holan, Tom Kertscher, Dave Umhoefer'
 'Angie Drobnic Holan, Willoughby Mariano' 'Asjylyn Loder' 'Bart Jansen'
 'Bartholomew Sullivan' 'Becky Bowers' 'Becky Bowers, Aaron Sharockman'
 'Becky Bowers, Adam Offitzer' 'Becky Bowers, Amy Sherman'
 'Becky Bowers, Angie Drobnic Holan'
 'Becky Bowers, Angie Drobnic Holan, Louis Jacobson, Amy Sherman, Bartholomew Sullivan'
 'Becky Bowers, Angie Drobnic Holan, Louis Jacobson, Stephen Koff, Molly Moorhead'
 'Becky Bowers, Caroline Houck' 'Becky Bowers, Jon Greenberg'
 'Becky Bowers, Jon Greenberg, Louis Jacobson'
 'Becky Bowers, Katie Sanders' 'Becky Bowers, Louis Jacobson'
 "Becky Bowers, Louis Jacobson, Erin O'Neill, Bill Wichert"
 'Becky Bowers, Louis Jacobson, Katie Sanders'
 'Becky Bowers, Maryalice Gill, Angie Drobnic Holan, Louis Jacobson'
 'Becky Bowers, Maryalice Gill, Louis Jacobson'
 'Becky Bowers, Molly Moorhead' 'Becky Bowers, Shirl Kennedy'
 'Becky Bowers, Stephen Koff'
 'Becky Bowers, Tom Feran, Angie Drobnic Holan, Willoughby Mariano, Amy Sherman'
 'Bill Adair' 'Bill Adair, Adriel Bettelheim'
 'Bill Adair, Angie Drobnic Holan' 'Bill Adair, Asjylyn Loder'
 'Bill Adair, Azhar Al Fadl' 'Bill Adair, John Frank'
 'Bill Adair, Louis Jacobson' 'Bill Adair, Nell Benton' 'Bill Thompson'
 'Bill Varian' 'Bill Wichert' 'Brad Schmidt' 'Brandon Blackwell'
 'Brent Walth' 'Brian Liberatore' 'Brittany Alana Davis'
 'Brittany Alana Davis, Katie Sanders' 'C. Eugene Emery Jr.'
 'C. Eugene Emery Jr., Angie Drobnic Holan'
 'C. Eugene Emery Jr., Peter Lord' 'C. Eugene Emery Jr., Tom Mooney'
 'Carol Rosenberg' 'Caroline Houck, Louis Jacobson'
 'Carolyn Edds, Connie Humburg' 'Carolyn Edds, Robert Farley'
 'Cary Spivak' 'Caryn Baird' 'Caryn Baird, Aaron Sharockman'
 'Caryn Baird, Angie Drobnic Holan' 'Caryn Baird, Katie Sanders'
 'Caryn Baird, Shirl Kennedy' 'Caryn Shinske' 'Catharine Richert'
 'Charles Pope' 'Charles Pope, W. Gardner Selby'
 'Charles Rabin, Amy Sherman' 'Chris Joyner'
 'Chris Joyner, Karishma Mehrotra' 'Christian Gaston' "Ciara O'Rourke"
 'Connie Humburg' 'Craig Gilbert, Dave Umhoefer'
 'Craig Pittman, Aaron Sharockman' 'Cristina Silva' 'Cynthia Needham'
 'Dan Gorenstein' 'Dana Tims' 'Danny Valentine'
 'Darla Cameron, Katie Sanders' 'Dave Umhoefer'
 'Dave Umhoefer, Don Walker' 'David Adams' 'David Adams, Robert Farley'
 'David Baumann' 'David Baumann, Angie Drobnic Holan' 'David DeCamp'
 'David DeCamp, Aaron Sharockman' 'Dina Cappiello'
 'Douglas Hanks, Patricia Mazzei, Amy Sherman' 'Edward Epstein'
 'Edward Epstein, Angie Drobnic Holan' 'Edward Fitzpatrick'
 'Elijah Herington' 'Eric Stirgus' 'Eric Stirgus, Jim Tharpe'
 'Erin McNeill' 'Erin Mershon' "Erin O'Neill" "Erin O'Neill, Bill Wichert"
 "Erin O'Neill, Caryn Shinske, Bill Wichert"
 "Erin O'Neill, Charles Pope, Bill Wichert" 'Erin Richards, Dave Umhoefer'
 'Greg Borowski' 'Gregory Smith' 'Gregory Trotter'
 'Guy Boulton, Dave Umhoefer' 'Hadas Gold' 'Hadas Gold, Louis Jacobson'
 'Harry  Esteve' 'Henry J.  Gomez' 'Ian Jannetta' 'Ian K. Kullgren'
 'J.B. Wogan' 'Jacob Geiger' 'Jacob Geiger, Molly Moorhead'
 'Jacob Geiger, Sean Gorman' 'Jake Berry' 'Jake Berry, Louis Jacobson'
 'Jake Berry, Maryalice Gill' 'James B. Nelson'
 'James B. Nelson, Dave Umhoefer' 'James B. Nelson, Don Walker'
 'James Ewinger' 'James McCarty' 'Jane Kahoun, Aaron Marshall'
 'Jane Kahoun, Mark Naymik' 'Janel Davis' 'Janel Davis, Eric Stirgus'
 'Janel Davis, Louis Jacobson' 'Janet Zink' 'Janie Har'
 'Janie Har, Ian K. Kullgren' 'Janie Har, Louis Jacobson' 'Jeffrey Good'
 'Jeffrey S.  Solochek' 'Jennifer Liberto' 'Jim Tharpe'
 'JoEllen Corrigan, Sabrina  Eaton' 'JoEllen Corrigan, Tom Feran'
 'Jodie Tillman' 'Jody Kyle' 'Joe Carey' 'Joe Guillen' 'John Bartosek'
 'John Bartosek, Amy Sherman' 'John Bartosek, Katie Sanders'
 'John Bartosek, Laura  Figueroa' 'John Bridges, W. Gardner Selby'
 'John Frank' 'John Frank, Aaron Sharockman'
 'John Frank, Angie Drobnic Holan' 'John Gregg' 'John Hill' 'John Martin'
 'Jon Greenberg' 'Jon Greenberg, Amy Sherman'
 'Jon Greenberg, Angie Drobnic Holan'
 'Jon Greenberg, Angie Drobnic Holan, Louis Jacobson, Molly Moorhead, Amy Sherman'
 'Jon Greenberg, Caroline Houck' 'Jon Greenberg, J.B. Wogan'
 'Jon Greenberg, Katie Sanders, Aaron Sharockman, Amy Sherman'
 'Jon Greenberg, Louis Jacobson'
 'Jon Greenberg, Louis Jacobson, Amy Sherman'
 'Jon Greenberg, Louis Jacobson, Molly Moorhead'
 'Jon Greenberg, Molly Moorhead'
 'Jon Greenberg, Tom Kertscher, Sue Owen, Bill Wichert' 'Joseph Lewis'
 'Josh Korr' 'Josh Rogers' 'Joshua Gillin' 'Joshua Gillin, Bill Varian'
 'Joshua Gillin, Kim Wilmath' 'Julie Kliegman'
 'Julie Kliegman, Katie Sanders' 'Karen Lee  Ziner' 'Karishma Mehrotra'
 'Kathleen McGrory, Katie Sanders' 'Kathryn A. Wolfe' 'Katie Glueck'
 'Katie Sanders' 'Katie Sanders, Aaron Sharockman'
 'Katie Sanders, Amy Sherman' 'Katie Sanders, Eric Stirgus'
 'Katie Sanders, Jeffrey S.  Solochek' 'Keith Perine'
 'Kevin Crowe, James B. Nelson' 'Kevin Landrigan' 'Kevin Robillard'
 'Kim Wilmath' 'Laura  Figueroa' 'Laura  Figueroa, Eric Stirgus'
 'Lee Bergquist' 'Lee Bergquist, Dave Umhoefer' 'Lee Logan'
 'Lesley  Clark' 'Lesley  Clark, Amy Sherman' 'Lissa August'
 'Lissa August, Angie Drobnic Holan'
 'Lissa August, Angie Drobnic Holan, Shirl Kennedy'
 'Lissa August, Jeffrey S.  Solochek' 'Lissa August, Jody Kyle'
 'Lissa August, John Frank' 'Lissa August, Shirl Kennedy' 'Louis Jacobson'
 'Louis Jacobson, Aaron Sharockman' 'Louis Jacobson, Amy Sherman'
 'Louis Jacobson, Amy Sherman, Eric Stirgus'
 'Louis Jacobson, Carol Rosenberg' "Louis Jacobson, Ciara O'Rourke"
 'Louis Jacobson, Eric Stirgus' 'Louis Jacobson, J.B. Wogan'
 'Louis Jacobson, Julie Kliegman' 'Louis Jacobson, Katie Sanders'
 'Louis Jacobson, Kevin Landrigan' 'Louis Jacobson, Molly Moorhead'
 'Louis Jacobson, Molly Moorhead, Adam Offitzer'
 'Louis Jacobson, Nancy  Madsen, Katie Sanders'
 'Louis Jacobson, Patrick Kennedy' 'Louis Jacobson, Stephen Koff'
 'Louis Jacobson, Sue Owen' 'Louis Jacobson, Tom Mooney'
 'Louis Jacobson, W. Gardner Selby' 'Louis Jacobson, Willoughby Mariano'
 'Lucas Graves' 'Lukas Pleva' 'Lynn Arditi' 'M.B. Pell' 'Mark Naymik'
 'Martha M. Hamilton' 'Martha M. Hamilton, Louis Jacobson, Amy Sherman'
 'Martha M. Hamilton, Lukas Pleva' 'Mary Ellen Klas'
 'Mary Ellen Klas, Katie Sanders' 'Maryalice Gill'
 'Maryalice Gill, Amy Sherman' 'Matt Clary' 'Meghan Ashford-Grooms'
 'Meghan Ashford-Grooms, Angie Drobnic Holan'
 'Meghan Ashford-Grooms, Ben Wear'
 'Meghan Ashford-Grooms, C. Eugene Emery Jr., W. Gardner Selby, Bartholomew Sullivan'
 "Meghan Ashford-Grooms, Ciara O'Rourke"
 "Meghan Ashford-Grooms, Ciara O'Rourke, W. Gardner Selby"
 'Meghan Ashford-Grooms, Janie Har, Louis Jacobson'
 'Meghan Ashford-Grooms, Louis Jacobson'
 'Meghan Ashford-Grooms, Louis Jacobson, Aaron Sharockman, Amy Sherman'
 'Meghan Ashford-Grooms, Molly Moorhead, Katie Sanders'
 'Meghan Ashford-Grooms, Sean Gorman' 'Meghan Ashford-Grooms, Sue Owen'
 'Meghan Ashford-Grooms, Tara Merrigan'
 'Meghan Ashford-Grooms, W. Gardner Selby' 'Michael  McKinney'
 'Michael C. Bender, Aaron Sharockman' 'Michael Collins'
 'Michael Collins, Tom Humphrey' 'Michael Martz' 'Michael Van Sickler'
 'Miranda Blue' 'Miranda Blue, Jody Kyle' 'Miranda Blue, John Frank'
 'Molly Moorhead' 'Molly Moorhead, Charles Pope'
 'Molly Moorhead, J.B. Wogan' 'Molly Moorhead, Katie Sanders'
 'Molly Moorhead, W. Gardner Selby' 'Morris Kennedy' 'Nancy  Madsen'
 'Nancy  Madsen, Dave Umhoefer'
 'Nancy  Madsen, Katie Sanders, Amy Sherman' 'Nancy Madsen'
 'Nancy Madsen, Molly Moorhead' 'Natalie Fuelner' 'Neil Skene'
 'Nell Benton' 'Nell Benton, Rachel  Bloom' 'Nell Benton, Shirl Kennedy'
 'Pat Gillespie' 'Patricia Mazzei, Aaron Sharockman'
 'Patricia Mazzei, Amy Sherman' 'Patrick Marley, Dave Umhoefer'
 'Peter Krouse' 'Peter Lord' 'Rachel  Bloom' 'Rachel Revehl'
 'Reginald Fields' 'Reginald Fields, Louis Jacobson'
 'Rich Exner, Stephen Koff' 'Rich Exner, Tom Feran' 'Richard Danielson'
 'Richard Danielson, Janet Zink' 'Richard Locker' 'Richard Rubin'
 'Richard Salit' 'Rob  Feinberg, Louis Jacobson' 'Robert Farley'
 'Robert Farley, Aaron Sharockman' 'Robert Farley, Angie Drobnic Holan'
 'Robert Farley, Angie Drobnic Holan, Louis Jacobson'
 'Robert Farley, Catharine Richert' "Robert Farley, Ciara O'Rourke"
 'Robert Farley, Erin Mershon' 'Robert Farley, Louis Jacobson'
 'Robert Farley, Lukas Pleva' 'Robert Farley, Shirl Kennedy'
 'Robert Farley, Tom Feran, Jacob Geiger'
 'Robert Farley, Will Short Gorham, Angie Drobnic Holan' 'Robert Higgs'
 'Robert Higgs, Louis Jacobson' 'Rochelle Koff' 'Ryan Kelly'
 'Ryan Kelly, John Martin' 'Sabrina  Eaton'
 'Sabrina  Eaton, Angie Drobnic Holan'
 'Sabrina  Eaton, Angie Drobnic Holan, David G. Taylor'
 'Sabrina  Eaton, Angie Drobnic Holan, Louis Jacobson, Willoughby Mariano, Molly Moorhead'
 'Sabrina  Eaton, Catharine Richert' 'Sabrina  Eaton, Jacob Geiger'
 'Sabrina  Eaton, Jane Kahoun' 'Sabrina  Eaton, Louis Jacobson'
 'Sabrina  Eaton, Louis Jacobson, Stephen Koff, W. Gardner Selby'
 'Sabrina  Eaton, Molly Moorhead, W. Gardner Selby'
 'Sabrina  Eaton, Rich Exner' 'Sabrina  Eaton, Stephen Koff'
 'Sabrina  Eaton, Tom Feran' 'Sara Myers' 'Sasha Bartolf'
 'Sasha Bartolf, Miranda Blue'
 'Sasha Bartolf, Miranda Blue, Philip Burrowes, Angie Drobnic Holan, Ryan Kelly, Jody Kyle'
 'Scott Montgomery' 'Sean Gorman' 'Sean Gorman, Louis Jacobson'
 'Sean Gorman, Molly Moorhead' 'Sean Gorman, Nancy  Madsen'
 'Sean Gorman, Willoughby Mariano, Eric Stirgus' 'Seth Stern'
 'Shawn Zeller' 'Shirl Kennedy' 'Shirl Kennedy, John Martin'
 'Stephen Koff' 'Stephen Koff, Charles Pope'
 'Stephen Koff, Molly Moorhead' 'Steve Ahillen' 'Steve Schultze'
 'Steve Schultze, Dave Umhoefer' 'Sue Owen' 'Sue Owen, Ben Wear'
 'Sue Owen, W. Gardner Selby' 'Thomas Content, Dave Umhoefer' 'Tim Murphy'
 'Timothy Engstrom' 'Toluse Olorunnipa' 'Toluse Olorunnipa, Katie Sanders'
 'Tom Feran' 'Tom Feran, Aaron Marshall' 'Tom Feran, Aaron Sharockman'
 'Tom Feran, Angie Drobnic Holan' 'Tom Feran, Charles Pope'
 'Tom Feran, Dave Umhoefer' 'Tom Feran, Joe Guillen'
 'Tom Feran, Louis Jacobson' 'Tom Feran, Mark Naymik'
 'Tom Feran, Molly Moorhead' 'Tom Feran, Robert Schoenberger'
 'Tom Feran, Sean Gorman' 'Tom Feran, Stephen Koff' 'Tom Humphrey'
 'Tom Humphrey, Zack McMillin' 'Tom Kertscher'
 'Tom Kertscher, Amy Sherman' 'Tom Kertscher, Ben Poston'
 'Tom Kertscher, Dave Umhoefer' 'Tom Kertscher, Don Walker'
 'Tom Kertscher, James B. Nelson'
 'Tom Kertscher, James B. Nelson, Dave Umhoefer'
 'Tom Kertscher, Patrick Marley' 'Tom Mooney' 'Tom Tobin' 'Trenton Daniel'
 'W. Gardner Selby' 'W. Gardner Selby, J.B. Wogan' 'Warren Fiske'
 'Warren Fiske, Jacob Geiger' 'Warren Fiske, Louis Jacobson'
 'Warren Fiske, Louis Jacobson, Nancy  Madsen'
 'Warren Fiske, Nancy  Madsen' 'Warren Fiske, Sean Gorman' 'Wes Allison'
 'Wes Allison, Angie Drobnic Holan' 'Wes Allison, Lissa August'
 'Wes Hester' 'Wes Hester, Angie Drobnic Holan' 'Will Short Gorham'
 'Will Short Gorham, Angie Drobnic Holan' 'Will Van Sant'
 'Willoughby Mariano' 'Willoughby Mariano, Eric Stirgus'
 'Willoughby Mariano, Jim Tharpe' 'Yuxing Zheng' 'Zack McMillin']
436

data.groupby('researched_by').count()[['state']].sort_values(
    'state', ascending=False).reset_index().rename(
    columns={'state': 'count'}).iloc[:60,:].plot.bar(
        x='researched_by', y='count', figsize=(16, 10), fontsize=13);

There are 2000+ different sources.

print(np.unique(data['source']))
print(len(np.unique(data['source'])))
data.groupby('source').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().loc[:60].plot.bar(
        x='source', y='count', figsize=(16, 10), fontsize=13);

['13th District GOP slate' '18% of the American public'
 '60 Plus Association' ... 'Zell Miller' 'Zoe Lofgren' 'billhislam.com']
2124

Predicting truth level¶

The goal is to predict the truthfulness of statements. Let us group the data according to the truth columns:

data.groupby('truth').count()[['source']].reset_index().plot.bar(x='truth', y='source');

The workflow¶

The pipeline for predicting the 'truth' classification of each statement requires two main steps:

feature extraction - which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features).
classification - to predict the 'truth' class.

Feature extraction¶

The sample solution presented in this starting kit uses Term Frequency-Inverse Documentation Frequency (tf-idf). A Term Frequency (tf) is a count of how many times a word occurs in a given document (synonymous with bag of words). The Inverse Document Frequency (idf) is the number of times a word occurs in a corpus of documents. tf-idf is used to weight words according to how important they are. Words that are used frequently in many documents (e.g., 'the', 'is', 'of') has less importance and thus will have a lower weighting while infrequent ones will have a higher weighting.

Built-in scikit-learn functions will be used to implement tf-idf:

CountVectorizer converts a collection of text documents to a matrix of token (word) counts. This implementation produces a sparse representation of the counts to be passed to the TfidfTransformer. The TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation.

A TfidfVectorizer does both these steps and thus function will be used in our sample solution.

See the scikit-learn documentation for a general introduction to text feature extraction.

from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer


preprocessor = make_column_transformer(
    (TfidfVectorizer(analyzer='word'), 'statement'),
    remainder='drop',  # drop all other columns
)

Classification¶

The scikit-learn RandomForestClassifier will be used in the sample solution.

Pipeline¶

We will use a scikit-learn pipeline, which chains together preprocessing and estimator steps, to perform all steps in the workflow. This offers offers convenience and safety (help avoid leaking statistics from your test data into the trained model in cross-validation) and the whole pipeline can be evaluated with cross_val_score.

Note that the output of TfidfVectorizer is a sparse matrix.

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

clf = RandomForestClassifier()

pipeline = make_pipeline(preprocessor, clf)

Before we can evaluate of our pipeline, we must first define the score metric. For this challenge, the official score is smoothed accuracy.

from sklearn.metrics import make_scorer
from sklearn.preprocessing import OneHotEncoder

def smooth_acc(y_true, y_proba):
    
    soft_score_matrix = np.array([
        [1, 0.8, 0, 0, 0, 0],
        [0.4, 1, 0.4, 0, 0, 0],
        [0, 0.4, 1, 0.4, 0, 0],
        [0, 0, 0.4, 1, 0.4, 0],
        [0, 0, 0, 0.4, 1, 0.4],
        [0, 0, 0, 0, 0.8, 1],
    ])
    y_true_proba = OneHotEncoder().fit_transform(np.expand_dims(y_true, axis=1))
    # Clip negative probas
    y_proba_positive = np.clip(y_proba, 0, 1)
    # Normalize rows
    y_proba_normalized = y_proba_positive / np.sum(
        y_proba_positive, axis=1, keepdims=True)
    # Smooth true probabilities with score_matrix
    y_true_smoothed = y_true_proba.dot(soft_score_matrix)
    # Compute dot product between the predicted probabilities and
    # the smoothed true "probabilites" ("" because it does not sum to 1)
    scores = np.sum(y_proba_normalized * y_true_smoothed, axis=1)
    scores = np.nan_to_num(scores)
    score = np.mean(scores)
    # to pick up all zero probabilities
    score = np.nan_to_num(score)
    return score
    
smooth_acc_score = make_scorer(smooth_acc, needs_proba=True)

Next we have to define a special cross validation function that splits each train/test iteration using the date that the statement was made. For example, for the first iteration, the test data set will consist of statements made within the first 1/8 of the time period, while the train data set will consist of statements made within the remaining 7/8 of the time period.

To save processing time, n_splits is set to 2 below.

from datetime import timedelta

def get_cv(X, n_splits=2):
    """Slice folds by equal date intervals."""
    date = pd.to_datetime(X['date'])
    n_days = (date.max() - date.min()).days
    fold_length = n_days // n_splits
    fold_dates = [date.min() + timedelta(days=i * fold_length)
                  for i in range(n_splits + 1)]
    for i in range(n_splits):
        test_is = (date >= fold_dates[i]) & (date < fold_dates[i + 1])
        test_is = test_is.values
        train_is = ~test_is
        yield np.arange(len(date))[train_is], np.arange(len(date))[test_is]
        
custom_cv = get_cv(X_df)

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, X_df, y_array, cv=custom_cv, scoring=smooth_acc_score
)

print("mean: %e (+/- %e)" % (scores.mean(), scores.std()))

mean: 3.286387e-01 (+/- 3.306209e-03)

This sample solution is implemented in RAMP within estimator.py, which is in the folder submissions/starting_kit.

estimator.py defines a function named get_estimator which returns the pipeline detailed above:

from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline


def get_estimator():
    
    preprocessor = make_column_transformer(
        (TfidfVectorizer(analyzer='word'), 'statement'),
        remainder='drop',  # drop all other columns
    )
    clf = RandomForestClassifier()
    pipeline = make_pipeline(preprocessor, clf)
    return pipeline

Improving feature extraction¶

There are a number of ways to improve the basic solution presented above.

Preprocessing¶

The document preprocessing can be customized in in the document_preprocessor function.

For instance, to transform accentuated unicode symbols into their simple counterpart. è -> e, the following function can be used,

import unicodedata

def document_preprocessor(doc):
    doc = unicodedata.normalize('NFD', doc)
    doc = doc.encode('ascii', 'ignore')
    doc = doc.decode("utf-8")
    return str(doc)

see also the stip_accents option of TfidfVectorizer.

Stopword removal¶

The most frequent words often do not carry much meaning. Examples: the, a, of, for, in, ....

Stop words removal can be enabled by passing the stopwords='english' parameter at the initialization of the TfidfVectorizer.

A custom list of stop words (e.g. from NLTK) can also be used.

Word / character n-grams¶

By default, the bag of words model is use in the starting kit. To use word or character n-grams, the analyser and ngram_range parameters of TfidfVectorizer should be changed.

Stemming and Lemmatization¶

English words like look can be inflected with a morphological suffix to produce looks, looking, looked. They share the same stem look. Often (but not always) it is beneficial to map all inflected forms into the stem. The most commonly used stemmer is the Porter Stemmer. The name comes from its developer, Martin Porter. SnowballStemmer('english') from NLTK is used. This stemmer is called Snowball, because Porter created a programming language with this name for creating new stemming algorithms.

Stemming can be enabled with a custom token_processor function, e.g.

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

def token_processor(tokens):
    for token in tokens:
        yield stemmer.stem(token)

The document preprocessing and stemmer tokenization function defined above can be added to the estimator.py submission like so:

import unicodedata

from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

from nltk.stem import SnowballStemmer


def _document_preprocessor(doc):
    doc = unicodedata.normalize('NFD', doc)
    doc = doc.encode('ascii', 'ignore')
    doc = doc.decode("utf-8")
    return str(doc)


def _token_processor(tokens):
    stemmer = SnowballStemmer('english')
    for token in tokens:
        yield stemmer.stem(token)

def get_estimator():
    vectorizer = TfidfVectorizer(
        analyzer='word', preprocessor=_document_preprocessor, tokenizer=_token_processor
    )
    preprocessor = make_column_transformer(
        (vectorizer, 'statement'),
        remainder='drop',  # drop all other columns
    )
    clf = RandomForestClassifier()
    pipeline = make_pipeline(preprocessor, clf)
    return pipeline

Submitting to ramp.studio ¶

The above submission is a more complex version of the basic sample solution presented above (and within the submissions/starting_kit/estimator.py file) and can be used as a guide for your own submission.

Once you have a submission you are happy with, you must test your submission files locally before submitting them.

This can be done by first installing ramp-workflow (pip install ramp-workflow or install it from the github repo) then running ramp-test. For example to rest the example solution, make sure that the python file estimator.py is in the submissions/starting_kit folder, and the data train.csv and test.csv are in data, then run ramp-test. More details about testing RAMP submissions can be found here.

ramp-test performs exactly the same cross validation as shown above with the function cross_val_score (except n_splits = 8 for ramp-test). The scores from the 8 iterations will be printed to terminal. Note that 3 different accuracy scores are calculated for this challenge but the smooth accuracy ('sacc') is the offical score for this challenge.

Finally, you can submit to ramp.studio by following the online documentation.

Contact¶

Don't hesitate to contact us.

	truth
count	7569.000000
mean	2.740917
std	1.588681
min	0.000000
25%	1.000000
50%	3.000000
75%	4.000000
max	5.000000

	date	edited_by	job	researched_by	source	state	statement	subjects	truth
0	2013-08-29	Angie Drobnic Holan	Republican	Jon Greenberg	Scott Walker	Wisconsin	In the Wisconsin health insurance exchange, "t...	['Health Care']	3
1	2013-08-29	Angie Drobnic Holan	Republican	Louis Jacobson	Mike Huckabee	Arkansas	"America’s gun-related homicide rate … would b...	['Crime', 'Guns', 'Pundits']	0
2	2013-08-29	Greg Borowski		Tom Kertscher	League of Conservation Voters		Says U.S. Sen. Ron Johnson voted to let oil an...	['Climate Change', 'Energy', 'Environment', 'T...	5
3	2013-08-28	Aaron Sharockman		Rochelle Koff	National Republican Congressional Committee		"Congressman Patrick Murphy voted to keep the ...	['Health Care']	2
4	2013-08-28	Aaron Sharockman		Angie Drobnic Holan	Janet Napolitano		The 2010 DREAM Act failed despite "strong bipa...	['Bipartisanship', 'Immigration']	2

Paris Saclay Center for Data Science¶