Fake news: classify statements of public figures
Current events on this problem
Keywords
fake_news_starting_kit

Paris Saclay Center for Data Science¶

Fake news RAMP: classify statements of public figures¶

Emanuela Boros (LIMSI/CNRS), Balázs Kégl (LAL/CNRS)

Introduction¶

This is an initiation project to introduce RAMP and get you to know how it works.

The goal is to develop prediction models able to identify which news is fake.

The data we will manipulate is from http://www.politifact.com. The input contains of short statements of public figures (and sometimes anomymous bloggers), plus some metadata. The output is a truth level, judged by journalists at Politifact. They use six truth levels which we coded into integers to obtain an ordinal regression problem:

0: 'Pants on Fire!'
1: 'False'
2: 'Mostly False'
3: 'Half-True'
4: 'Mostly True'
5: 'True'

You goal is to classify each statement (+ metadata) into one of the categories.

Requirements¶

  • numpy>=1.10.0
  • matplotlib>=1.5.0
  • pandas>=0.19.0
  • scikit-learn>=0.17 (different syntaxes for v0.17 and v0.18)
  • seaborn>=0.7.1
  • nltk

Further, an nltk dataset needs to be downloaded:

python -m nltk.downloader popular
In [113]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Exploratory data analysis¶

Loading the data¶

In [114]:
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename, sep='\t')
data = data.fillna('')
data['date'] = pd.to_datetime(data['date'])
data
Out[114]:
date edited_by job researched_by source state statement subjects truth
0 2013-08-29 Angie Drobnic Holan Republican Jon Greenberg Scott Walker Wisconsin In the Wisconsin health insurance exchange, "t... ['Health Care'] 3
1 2013-08-29 Angie Drobnic Holan Republican Louis Jacobson Mike Huckabee Arkansas "America’s gun-related homicide rate … would b... ['Crime', 'Guns', 'Pundits'] 0
2 2013-08-29 Greg Borowski Tom Kertscher League of Conservation Voters Says U.S. Sen. Ron Johnson voted to let oil an... ['Climate Change', 'Energy', 'Environment', 'T... 5
3 2013-08-28 Aaron Sharockman Rochelle Koff National Republican Congressional Committee "Congressman Patrick Murphy voted to keep the ... ['Health Care'] 2
4 2013-08-28 Aaron Sharockman Angie Drobnic Holan Janet Napolitano The 2010 DREAM Act failed despite "strong bipa... ['Bipartisanship', 'Immigration'] 2
5 2013-08-28 W. Gardner Selby Republican Sue Owen Steve Stockman Texas Says U.N. arms treaty will mandate a "new inte... ['Guns'] 1
6 2013-08-28 Greg Borowski Democrat Dave Umhoefer Mark Harris Wisconsin "I’ve got the spending down, I’ve got the debt... ['County Budget', 'County Government', 'Debt',... 4
7 2013-08-28 Jim Denery, Jim Tharpe Democrat Eric Stirgus Doug Stoner Georgia "I fought hard for that (state Senate) seat. I... ['Campaign Finance', 'Candidate Biography'] 2
8 2013-08-27 Angie Drobnic Holan None Louis Jacobson Moms Demand Action for Gun Sense In America Indiana The book Little Red Riding Hood is something "... ['Education', 'Guns'] 2
9 2013-08-27 Angie Drobnic Holan Julie Kliegman BookerFail "Three years after getting the $100 million (f... ['Education'] 2
10 2013-08-27 John Bridges Republican W. Gardner Selby David Dewhurst Texas "I am every year the No.1 pick of all of the l... ['Candidate Biography', 'Crime', 'Criminal Jus... 1
11 2013-08-27 Angie Drobnic Holan Republican Amy Sherman Rene Garcia Florida Miami-Dade County is the "No. 1 donor county i... ['State Budget', 'Taxes'] 2
12 2013-08-27 Jim Denery, Jim Tharpe Republican Eric Stirgus Jack Kingston Georgia Says an illegal immigrant fraudulently claimed... ['Immigration', 'Taxes'] 4
13 2013-08-26 Angie Drobnic Holan Independent Louis Jacobson Michael Bloomberg New York "New York is the safest big city in the nation... ['Crime', 'Criminal Justice', 'Urban'] 4
14 2013-08-26 Warren Fiske Republican Sean Gorman Ken Cuccinelli Virginia Says Terry McAuliffe is "the person who invent... ['Campaign Finance', 'Ethics', 'History'] 1
15 2013-08-26 Jim Tharpe Janel Davis, Eric Stirgus Kasim Reed "State law says that once the state appraises ... ['City Government', 'Corrections and Updates',... 5
16 2013-08-25 Tom Curran Republican Bill Wichert Steve Lonegan New Jersey Says under Mayor Cory Booker, Newark has seen ... ['Crime'] 3
17 2013-08-25 Greg Borowski Republican Dave Umhoefer Alberta Darling Wisconsin Wisconsin’s criminal threshold for drunken-dri... ['Alcohol', 'Criminal Justice', 'State Budget'... 2
18 2013-08-25 Tim Murphy Activist C. Eugene Emery Jr. Steve Goreham Illinois "Global surface temperatures have been flat fo... ['Climate Change', 'Energy', 'Environment', 'G... 3
19 2013-08-24 Dee Lane Republican Ian K. Kullgren Doug Whitsett Oregon Says "your Legislative Assembly was within one... ['Guns'] 1
20 2013-08-23 Angie Drobnic Holan State official Amy Sherman Pam Stewart Florida Florida’s "high school graduation rates contin... ['Diversity', 'Education'] 3
21 2013-08-22 Angie Drobnic Holan Republican Katie Sanders Marco Rubio Florida Says President Barack Obama could "basically" ... ['Immigration'] 2
22 2013-08-22 Amy Hollyfield Republican Julie Kliegman Carly Fiorina California "There are only four countries in the world th... ['Abortion'] 3
23 2013-08-22 Greg Borowski Democrat Dave Umhoefer Peter Barca Wisconsin Under a new law regulating abortions, "Even if... ['Abortion', 'Health Care', 'Women'] 3
24 2013-08-21 John Bridges Republican Sue Owen Barry Smitherman Texas Says he "sued Obama’s EPA seven times." ['Energy', 'Environment'] 5
25 2013-08-21 Angie Drobnic Holan Becky Bowers Bloggers Obamacare provision will allow "forced home in... ['Civil Rights', 'Health Care', 'Privacy', 'Pu... 0
26 2013-08-21 Angie Drobnic Holan Republican Jon Greenberg Mike Lee Utah Says unions call Obamacare "bad for workers." ['Health Care', 'Unions'] 4
27 2013-08-21 Elizabeth Miniet, Jim Tharpe Janel Davis Karen Handel "Only in Washington would politicians spend $2... ['Federal Budget', 'Government Efficiency'] 2
28 2013-08-20 Angie Drobnic Holan Democrat Jon Greenberg Barack Obama Illinois "We now sell more products made in America to ... ['Corrections and Updates', 'Trade'] 5
29 2013-08-19 Aaron Sharockman Rochelle Koff Marijuana Policy Project Marijuana is "less toxic" than alcohol. ['Drugs', 'Health Care', 'Marijuana'] 4
... ... ... ... ... ... ... ... ... ...
7539 2007-09-14 Scott Montgomery Democrat Jody Kyle Chris Dodd Connecticut "We're spending $1.6 billion for all of Latin ... ['Foreign Policy'] 4
7540 2007-09-13 Scott Montgomery Democrat Lissa August John Edwards North Carolina Poor people go to a "payday lender...and they ... ['Poverty'] 5
7541 2007-09-12 Scott Montgomery Republican Wes Allison Fred Thompson Tennessee "I've cast a couple of 99-1 votes" and been th... ['Candidate Biography'] 5
7542 2007-09-12 Scott Montgomery Republican Ryan Kelly Sam Brownback Kansas "I'm pro-life. He's not." ['Abortion'] 1
7543 2007-09-11 Scott Montgomery Democrat Sasha Bartolf Hillary Clinton New York English "is our national language...if it beco... ['Immigration'] 1
7544 2007-09-11 Scott Montgomery Democrat Miranda Blue Dennis Kucinich Ohio "Our whole food system in this country...most ... ['Consumer Safety'] 3
7545 2007-09-10 Scott Montgomery Republican Bill Adair Duncan Hunter California Guantanamo detainees "get taxpayer-paid-for pr... ['Terrorism'] 5
7546 2007-09-10 Democrat Hillary Clinton New York \u201CThirty-four percent of Hispanics don\u20... [] 5
7547 2007-09-10 Neil Brown, Scott Montgomery Democrat Shirl Kennedy Hillary Clinton New York "About 40 percent of Hispanic homeowners have ... ['Economy'] 5
7548 2007-09-10 Scott Montgomery Democrat Lissa August Hillary Clinton New York "In New York, when she ran for reelection, she... ['Elections'] 5
7549 2007-09-10 Scott Montgomery Democrat Lissa August Hillary Clinton New York "In every single country, she had a majority o... ['Elections'] 4
7550 2007-09-10 Scott Montgomery Democrat David Baumann Bill Richardson New Mexico "Iowa, for good reason, for constitutional rea... ['Elections'] 0
7551 2007-09-09 Scott Montgomery Democrat Angie Drobnic Holan Joe Biden Delaware "Joe Biden is the only candidate with a plan t... ['Iraq'] 3
7552 2007-09-09 Scott Montgomery Democrat Sasha Bartolf, Miranda Blue, Philip Burrowes, ... Barack Obama Illinois "During his tenure in Washington and in the Il... ['Job Accomplishments'] 3
7553 2007-09-09 Scott Montgomery Republican Jody Kyle Ron Paul Texas "We've lost over 5,000 Americans over there in... ['Iraq'] 4
7554 2007-09-08 Scott Montgomery Democrat Sasha Bartolf, Miranda Blue, Philip Burrowes, ... Barack Obama Illinois "Sen. Obama worked on some of the deepest issu... ['Job Accomplishments'] 5
7555 2007-09-07 Bill Adair Democrat Angie Drobnic Holan Bill Richardson New Mexico "Congress only funded half the wall" between M... ['Immigration'] 4
7556 2007-09-07 Bill Adair Democrat Shirl Kennedy Dennis Kucinich Ohio "When NAFTA was passed, there was an accelerat... ['Immigration'] 3
7557 2007-09-07 Bill Adair Democrat Angie Drobnic Holan Chris Dodd Connecticut "John Kerry was at 4 percent in the polls in D... ['Elections'] 3
7558 2007-09-06 Bill Adair Republican Angie Drobnic Holan Sam Brownback Kansas In countries that allow gay marriage, the rate... ['Families', 'Gays and Lesbians'] 2
7559 2007-09-06 Bill Adair Democrat John Martin Dennis Kucinich Ohio "Thirty-four percent of Hispanics don't have a... ['Health Care'] 5
7560 2007-09-06 Scott Montgomery Republican Nell Benton Mike Huckabee Arkansas "I would love to see us have in this country w... ['Abortion'] 2
7561 2007-09-06 Scott Montgomery Angie Drobnic Holan Obama Girl "At least Obama didn't marry his cousin," as G... ['Candidate Biography'] 5
7562 2007-09-06 Scott Montgomery Republican Sasha Bartolf Mitt Romney Massachusetts "The Z-visa that was offered in that Senate bi... ['Immigration'] 1
7563 2007-09-05 Bill Duryea Republican Angie Drobnic Holan Mitt Romney Massachusetts "Senator McCain voted against the Bush tax cut... ['Taxes'] 5
7564 2007-09-04 Scott Montgomery Democrat Angie Drobnic Holan Barack Obama Illinois If African-Americans vote their percentage of ... ['Elections'] 1
7565 2007-09-03 Bill Adair Republican Angie Drobnic Holan Sam Brownback Kansas "Currently, we're at 36 percent of our childre... ['Families'] 5
7566 2007-09-03 Scott Montgomery Republican Miranda Blue Duncan Hunter California "I built that border fence in San Diego...and ... ['Immigration'] 4
7567 2007-09-01 Scott Montgomery Republican Angie Drobnic Holan John McCain Arizona "The failings in our civil service are encoura... ['Federal Budget'] 4
7568 2007-09-01 Scott Montgomery Republican Wes Allison Rudy Giuliani New York "The crime decline in the United States would ... ['Crime'] 1

7569 rows × 9 columns

In [115]:
data.dtypes
Out[115]:
date             datetime64[ns]
edited_by                object
job                      object
researched_by            object
source                   object
state                    object
statement                object
subjects                 object
truth                     int64
dtype: object
In [134]:
data.describe()
Out[134]:
truth
count 7569.000000
mean 2.740917
std 1.588681
min 0.000000
25% 1.000000
50% 3.000000
75% 4.000000
max 5.000000
In [135]:
data.count()
Out[135]:
date             7569
edited_by        7569
job              7569
researched_by    7569
source           7569
state            7569
statement        7569
subjects         7569
truth            7569
dtype: int64

The original training data frame has 13000+ instances. In the starting kit, we give you a subset of 7569 instances for training and 2891 instances for tesing.

Most columns are categorical, some have high cardinalities.

In [148]:
print(np.unique(data['state']))
print(len(np.unique(data['state'])))
data.groupby('state').count()[['job']].sort_values(
    'job', ascending=False).reset_index().rename(
    columns={'job': 'count'}).plot.bar(
    x='state', y='count', figsize=(16, 10), fontsize=18);
['' 'Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'District of Columbia' 'Florida' 'Georgia'
 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas' 'Kentucky'
 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan' 'Minnesota'
 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada' 'New Hampshire'
 'New Jersey' 'New Mexico' 'New York' 'North Carolina' 'North Dakota'
 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhode Island' 'Rhode island'
 'South Carolina' 'South Dakota' 'Tennesse' 'Tennessee' 'Texas'
 'United Kingdom' 'Utah' 'Vermont' 'Virgina' 'Virginia' 'Washington'
 'Washington state' 'Washington, D.C.' 'West Virginia' 'Wisconsin'
 'Wyoming' 'ohio' 'the United States']
60
In [150]:
print(np.unique(data['job']))
print(len(np.unique(data['job'])))
data.groupby('job').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().plot.bar(
        x='job', y='count', figsize=(16, 10), fontsize=18);
['' 'Activist' 'Business leader' 'Columnist' 'Constitution Party'
 'Democrat' 'Democratic Farmer-Labor' 'Independent' 'Journalist'
 'Labor leader' 'Libertarian' 'Newsmaker' 'None'
 'Ocean State Tea Party in Action' 'Organization' 'Republican'
 'State official' 'Talk show host' 'Tea Party member' 'county commissioner']
20

If you want to use the journalist and the editor as input, you will need to split the lists since somtimes there are more than one of them on an instance.

In [151]:
print(np.unique(data['edited_by']))
print(len(np.unique(data['edited_by'])))
data.groupby('edited_by').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().plot.bar(
        x='edited_by', y='count', figsize=(16, 10), fontsize=10);
['' 'Aaron Sharockman' 'Adriel Bettelheim, Amy Hollyfield' 'Alexander Lane'
 'Amy Hollyfield' 'Amy Hollyfield, Aaron Sharockman'
 'Amy Hollyfield, Greg Joyce' 'Amy Hollyfield, Scott Montgomery'
 'Amy Sherman' 'Angie Drobnic Holan'
 'Angie Drobnic Holan, Aaron Sharockman'
 'Angie Drobnic Holan, Elizabeth Miniet, Jim Tharpe' 'Bill Adair'
 'Bill Adair, Aaron Sharockman' 'Bill Adair, Amy Hollyfield'
 'Bill Adair, Angie Drobnic Holan' 'Bill Adair, Martha M. Hamilton'
 'Bill Adair, Scott Montgomery' 'Bill Adair, Sergio Bustos'
 'Bill Adair, Steve Ahillen, Zack McMillin' 'Bill Adair, Tom Chester'
 'Bill Adair, Tom Chester, Michael Erskine' 'Bill Adair, W. Gardner Selby'
 'Bill Adair, Zack McMillin' 'Bill Duryea' 'Bob Gee' 'Brenda Bell'
 'Brenda Bell, Jody Seaborn' 'Brenda Bell, W. Gardner Selby'
 'Bridget Hall Grumet' 'Bridget Hall Grumet, Angie Drobnic Holan'
 'Bruce  Hammond' 'C. Eugene Emery Jr.' 'Caryn Shinske' 'Catharine Richert'
 'Charles Gay' 'Charles Gay, Elizabeth Miniet' 'Chris Quinn'
 'Daniel Finnegan' 'Daniel Finnegan, Warren Fiske' 'Dave Umhoefer'
 'Dee Lane' 'Dee Lane, Caryn Shinske' 'Dee Lane, Jody Seaborn'
 'Doug Bennett' 'Edward Epstein, Angie Drobnic Holan' 'Elizabeth Miniet'
 'Elizabeth Miniet, Jim Tharpe' 'Elizabeth Miniet, Lois Norder'
 'Greg Borowski' 'Greg Borowski, Steve Schultze' 'Greg Joyce'
 'Heather Urquides' 'James B. Nelson' 'Jane Kahoun' 'Janel Davis'
 'Janie Har' 'Jim  Denery' 'Jim  Denery, Angela Tuck'
 'Jim  Denery, Angie Drobnic Holan, Jim Tharpe' 'Jim  Denery, Charles Gay'
 'Jim  Denery, Charles Gay, Jim Tharpe' 'Jim  Denery, Jim Tharpe'
 'Jim  Denery, Lois Norder' 'Jim  Denery, Shawn McIntosh' 'Jim Tharpe'
 'Jody Seaborn' 'Jody Seaborn, W. Gardner Selby' 'Joe Guillen'
 'John Bartosek' 'John Bartosek, Amy Hollyfield'
 'John Bartosek, Andrew Jarosh' 'John Bartosek, Doug Bennett'
 'John Bartosek, Heather Urquides' 'John Bartosek, Sheldon Zoldan'
 'John Bridges' 'John Bridges, Bob Gee' 'Jonathan Van Fleet'
 'Kathryn A. Wolfe' 'Leroy Chapman, Charles Gay'
 'Leroy Chapman, Jim Tharpe' 'Lois Norder' 'Louis Jacobson' 'Mark Vosburgh'
 'Martha M. Hamilton' 'Martha M. Hamilton, Jody Seaborn'
 'Martha M. Hamilton, W. Gardner Selby' 'Meghan Ashford-Grooms'
 'Meghan Ashford-Grooms, John Bridges' 'Meghan Ashford-Grooms, Sue Owen'
 'Michelle Brence, Dee Lane' 'Mike Konrad' 'Morris Kennedy' 'Neil Brown'
 'Neil Brown, Scott Montgomery' 'Robert Farley'
 'Robert Farley, Amy Hollyfield' 'Robert Higgs' 'Robert Higgs, Jane Kahoun'
 'Scott Montgomery' 'Sergio Bustos' 'Sergio Bustos, Amy Hollyfield'
 'Shawn McIntosh, Elizabeth Miniet' 'Stephen Koff' 'Steve Ahillen'
 'Steve Ahillen, Tom Chester' 'Steve Ahillen, Zack McMillin'
 'Steve Liebman' 'Steve McQuilkin, Sheldon Zoldan' 'Sue Owen'
 'Susan Areson' 'Susan Areson, Tim Murphy' 'Suzanne Pavkovic'
 'Therese Bottomly' 'Therese Bottomly, Dee Lane' 'Thomas Koetting'
 'Tim Murphy' 'Tom Chester' 'Tom Curran' 'Tom Feran' 'Tom Kertscher'
 'W. Gardner Selby' 'W. Gardner Selby, Aaron Sharockman' 'Warren Fiske'
 'Warren Fiske, Martha M. Hamilton' 'Warren Fiske, Robert Higgs'
 'Zack McMillin']
127
In [143]:
print(np.unique(data['researched_by']))
print(len(np.unique(data['researched_by'])))
['' 'Aaron Marshall' 'Aaron Sharockman' 'Aaron Sharockman, Amy Sherman'
 'Adriel Bettelheim' 'Adriel Bettelheim, Angie Drobnic Holan'
 'Adriel Bettelheim, David DeCamp' 'Adriel Bettelheim, Ryan Kelly'
 'Alaina Berner, Christopher Connors, Louis Jacobson' 'Alex Holt'
 'Alex Holt, Louis Jacobson' 'Alex Holt, Michelle Sutherland'
 'Alex Kuffner' 'Alex Leary' 'Alexander Lane' 'Amy Hollyfield'
 'Amy Sherman' 'Amy Sherman, Bartholomew Sullivan' 'Andra Lim'
 'Angie Drobnic Holan' 'Angie Drobnic Holan, Aaron Sharockman'
 'Angie Drobnic Holan, Alex Leary' 'Angie Drobnic Holan, Alexander Lane'
 'Angie Drobnic Holan, Amy Sherman'
 'Angie Drobnic Holan, Amy Sherman, Dave Umhoefer'
 'Angie Drobnic Holan, Catharine Richert'
 'Angie Drobnic Holan, Craig Pittman'
 'Angie Drobnic Holan, David G. Taylor'
 'Angie Drobnic Holan, Ian K. Kullgren'
 'Angie Drobnic Holan, Jeffrey S.  Solochek'
 'Angie Drobnic Holan, John Martin' 'Angie Drobnic Holan, Katie Sanders'
 'Angie Drobnic Holan, Louis Jacobson'
 'Angie Drobnic Holan, Louis Jacobson, Aaron Sharockman'
 'Angie Drobnic Holan, Louis Jacobson, Amy Sherman'
 "Angie Drobnic Holan, Louis Jacobson, Ciara O'Rourke"
 'Angie Drobnic Holan, Louis Jacobson, Katie Sanders'
 'Angie Drobnic Holan, Louis Jacobson, Molly Moorhead'
 'Angie Drobnic Holan, Molly Moorhead'
 'Angie Drobnic Holan, Molly Moorhead, Aaron Sharockman'
 'Angie Drobnic Holan, Shawn Zeller' 'Angie Drobnic Holan, Shirl Kennedy'
 'Angie Drobnic Holan, Sue Owen'
 'Angie Drobnic Holan, Tom Kertscher, Dave Umhoefer'
 'Angie Drobnic Holan, Willoughby Mariano' 'Asjylyn Loder' 'Bart Jansen'
 'Bartholomew Sullivan' 'Becky Bowers' 'Becky Bowers, Aaron Sharockman'
 'Becky Bowers, Adam Offitzer' 'Becky Bowers, Amy Sherman'
 'Becky Bowers, Angie Drobnic Holan'
 'Becky Bowers, Angie Drobnic Holan, Louis Jacobson, Amy Sherman, Bartholomew Sullivan'
 'Becky Bowers, Angie Drobnic Holan, Louis Jacobson, Stephen Koff, Molly Moorhead'
 'Becky Bowers, Caroline Houck' 'Becky Bowers, Jon Greenberg'
 'Becky Bowers, Jon Greenberg, Louis Jacobson'
 'Becky Bowers, Katie Sanders' 'Becky Bowers, Louis Jacobson'
 "Becky Bowers, Louis Jacobson, Erin O'Neill, Bill Wichert"
 'Becky Bowers, Louis Jacobson, Katie Sanders'
 'Becky Bowers, Maryalice Gill, Angie Drobnic Holan, Louis Jacobson'
 'Becky Bowers, Maryalice Gill, Louis Jacobson'
 'Becky Bowers, Molly Moorhead' 'Becky Bowers, Shirl Kennedy'
 'Becky Bowers, Stephen Koff'
 'Becky Bowers, Tom Feran, Angie Drobnic Holan, Willoughby Mariano, Amy Sherman'
 'Bill Adair' 'Bill Adair, Adriel Bettelheim'
 'Bill Adair, Angie Drobnic Holan' 'Bill Adair, Asjylyn Loder'
 'Bill Adair, Azhar Al Fadl' 'Bill Adair, John Frank'
 'Bill Adair, Louis Jacobson' 'Bill Adair, Nell Benton' 'Bill Thompson'
 'Bill Varian' 'Bill Wichert' 'Brad Schmidt' 'Brandon Blackwell'
 'Brent Walth' 'Brian Liberatore' 'Brittany Alana Davis'
 'Brittany Alana Davis, Katie Sanders' 'C. Eugene Emery Jr.'
 'C. Eugene Emery Jr., Angie Drobnic Holan'
 'C. Eugene Emery Jr., Peter Lord' 'C. Eugene Emery Jr., Tom Mooney'
 'Carol Rosenberg' 'Caroline Houck, Louis Jacobson'
 'Carolyn Edds, Connie Humburg' 'Carolyn Edds, Robert Farley' 'Cary Spivak'
 'Caryn Baird' 'Caryn Baird, Aaron Sharockman'
 'Caryn Baird, Angie Drobnic Holan' 'Caryn Baird, Katie Sanders'
 'Caryn Baird, Shirl Kennedy' 'Caryn Shinske' 'Catharine Richert'
 'Charles Pope' 'Charles Pope, W. Gardner Selby'
 'Charles Rabin, Amy Sherman' 'Chris Joyner'
 'Chris Joyner, Karishma Mehrotra' 'Christian Gaston' "Ciara O'Rourke"
 'Connie Humburg' 'Craig Gilbert, Dave Umhoefer'
 'Craig Pittman, Aaron Sharockman' 'Cristina Silva' 'Cynthia Needham'
 'Dan Gorenstein' 'Dana Tims' 'Danny Valentine'
 'Darla Cameron, Katie Sanders' 'Dave Umhoefer' 'Dave Umhoefer, Don Walker'
 'David Adams' 'David Adams, Robert Farley' 'David Baumann'
 'David Baumann, Angie Drobnic Holan' 'David DeCamp'
 'David DeCamp, Aaron Sharockman' 'Dina Cappiello'
 'Douglas Hanks, Patricia Mazzei, Amy Sherman' 'Edward Epstein'
 'Edward Epstein, Angie Drobnic Holan' 'Edward Fitzpatrick'
 'Elijah Herington' 'Eric Stirgus' 'Eric Stirgus, Jim Tharpe'
 'Erin McNeill' 'Erin Mershon' "Erin O'Neill" "Erin O'Neill, Bill Wichert"
 "Erin O'Neill, Caryn Shinske, Bill Wichert"
 "Erin O'Neill, Charles Pope, Bill Wichert" 'Erin Richards, Dave Umhoefer'
 'Greg Borowski' 'Gregory Smith' 'Gregory Trotter'
 'Guy Boulton, Dave Umhoefer' 'Hadas Gold' 'Hadas Gold, Louis Jacobson'
 'Harry  Esteve' 'Henry J.  Gomez' 'Ian Jannetta' 'Ian K. Kullgren'
 'J.B. Wogan' 'Jacob Geiger' 'Jacob Geiger, Molly Moorhead'
 'Jacob Geiger, Sean Gorman' 'Jake Berry' 'Jake Berry, Louis Jacobson'
 'Jake Berry, Maryalice Gill' 'James B. Nelson'
 'James B. Nelson, Dave Umhoefer' 'James B. Nelson, Don Walker'
 'James Ewinger' 'James McCarty' 'Jane Kahoun, Aaron Marshall'
 'Jane Kahoun, Mark Naymik' 'Janel Davis' 'Janel Davis, Eric Stirgus'
 'Janel Davis, Louis Jacobson' 'Janet Zink' 'Janie Har'
 'Janie Har, Ian K. Kullgren' 'Janie Har, Louis Jacobson' 'Jeffrey Good'
 'Jeffrey S.  Solochek' 'Jennifer Liberto' 'Jim Tharpe'
 'JoEllen Corrigan, Sabrina  Eaton' 'JoEllen Corrigan, Tom Feran'
 'Jodie Tillman' 'Jody Kyle' 'Joe Carey' 'Joe Guillen' 'John Bartosek'
 'John Bartosek, Amy Sherman' 'John Bartosek, Katie Sanders'
 'John Bartosek, Laura  Figueroa' 'John Bridges, W. Gardner Selby'
 'John Frank' 'John Frank, Aaron Sharockman'
 'John Frank, Angie Drobnic Holan' 'John Gregg' 'John Hill' 'John Martin'
 'Jon Greenberg' 'Jon Greenberg, Amy Sherman'
 'Jon Greenberg, Angie Drobnic Holan'
 'Jon Greenberg, Angie Drobnic Holan, Louis Jacobson, Molly Moorhead, Amy Sherman'
 'Jon Greenberg, Caroline Houck' 'Jon Greenberg, J.B. Wogan'
 'Jon Greenberg, Katie Sanders, Aaron Sharockman, Amy Sherman'
 'Jon Greenberg, Louis Jacobson'
 'Jon Greenberg, Louis Jacobson, Amy Sherman'
 'Jon Greenberg, Louis Jacobson, Molly Moorhead'
 'Jon Greenberg, Molly Moorhead'
 'Jon Greenberg, Tom Kertscher, Sue Owen, Bill Wichert' 'Joseph Lewis'
 'Josh Korr' 'Josh Rogers' 'Joshua Gillin' 'Joshua Gillin, Bill Varian'
 'Joshua Gillin, Kim Wilmath' 'Julie Kliegman'
 'Julie Kliegman, Katie Sanders' 'Karen Lee  Ziner' 'Karishma Mehrotra'
 'Kathleen McGrory, Katie Sanders' 'Kathryn A. Wolfe' 'Katie Glueck'
 'Katie Sanders' 'Katie Sanders, Aaron Sharockman'
 'Katie Sanders, Amy Sherman' 'Katie Sanders, Eric Stirgus'
 'Katie Sanders, Jeffrey S.  Solochek' 'Keith Perine'
 'Kevin Crowe, James B. Nelson' 'Kevin Landrigan' 'Kevin Robillard'
 'Kim Wilmath' 'Laura  Figueroa' 'Laura  Figueroa, Eric Stirgus'
 'Lee Bergquist' 'Lee Bergquist, Dave Umhoefer' 'Lee Logan' 'Lesley  Clark'
 'Lesley  Clark, Amy Sherman' 'Lissa August'
 'Lissa August, Angie Drobnic Holan'
 'Lissa August, Angie Drobnic Holan, Shirl Kennedy'
 'Lissa August, Jeffrey S.  Solochek' 'Lissa August, Jody Kyle'
 'Lissa August, John Frank' 'Lissa August, Shirl Kennedy' 'Louis Jacobson'
 'Louis Jacobson, Aaron Sharockman' 'Louis Jacobson, Amy Sherman'
 'Louis Jacobson, Amy Sherman, Eric Stirgus'
 'Louis Jacobson, Carol Rosenberg' "Louis Jacobson, Ciara O'Rourke"
 'Louis Jacobson, Eric Stirgus' 'Louis Jacobson, J.B. Wogan'
 'Louis Jacobson, Julie Kliegman' 'Louis Jacobson, Katie Sanders'
 'Louis Jacobson, Kevin Landrigan' 'Louis Jacobson, Molly Moorhead'
 'Louis Jacobson, Molly Moorhead, Adam Offitzer'
 'Louis Jacobson, Nancy  Madsen, Katie Sanders'
 'Louis Jacobson, Patrick Kennedy' 'Louis Jacobson, Stephen Koff'
 'Louis Jacobson, Sue Owen' 'Louis Jacobson, Tom Mooney'
 'Louis Jacobson, W. Gardner Selby' 'Louis Jacobson, Willoughby Mariano'
 'Lucas Graves' 'Lukas Pleva' 'Lynn Arditi' 'M.B. Pell' 'Mark Naymik'
 'Martha M. Hamilton' 'Martha M. Hamilton, Louis Jacobson, Amy Sherman'
 'Martha M. Hamilton, Lukas Pleva' 'Mary Ellen Klas'
 'Mary Ellen Klas, Katie Sanders' 'Maryalice Gill'
 'Maryalice Gill, Amy Sherman' 'Matt Clary' 'Meghan Ashford-Grooms'
 'Meghan Ashford-Grooms, Angie Drobnic Holan'
 'Meghan Ashford-Grooms, Ben Wear'
 'Meghan Ashford-Grooms, C. Eugene Emery Jr., W. Gardner Selby, Bartholomew Sullivan'
 "Meghan Ashford-Grooms, Ciara O'Rourke"
 "Meghan Ashford-Grooms, Ciara O'Rourke, W. Gardner Selby"
 'Meghan Ashford-Grooms, Janie Har, Louis Jacobson'
 'Meghan Ashford-Grooms, Louis Jacobson'
 'Meghan Ashford-Grooms, Louis Jacobson, Aaron Sharockman, Amy Sherman'
 'Meghan Ashford-Grooms, Molly Moorhead, Katie Sanders'
 'Meghan Ashford-Grooms, Sean Gorman' 'Meghan Ashford-Grooms, Sue Owen'
 'Meghan Ashford-Grooms, Tara Merrigan'
 'Meghan Ashford-Grooms, W. Gardner Selby' 'Michael  McKinney'
 'Michael C. Bender, Aaron Sharockman' 'Michael Collins'
 'Michael Collins, Tom Humphrey' 'Michael Martz' 'Michael Van Sickler'
 'Miranda Blue' 'Miranda Blue, Jody Kyle' 'Miranda Blue, John Frank'
 'Molly Moorhead' 'Molly Moorhead, Charles Pope'
 'Molly Moorhead, J.B. Wogan' 'Molly Moorhead, Katie Sanders'
 'Molly Moorhead, W. Gardner Selby' 'Morris Kennedy' 'Nancy  Madsen'
 'Nancy  Madsen, Dave Umhoefer' 'Nancy  Madsen, Katie Sanders, Amy Sherman'
 'Nancy Madsen' 'Nancy Madsen, Molly Moorhead' 'Natalie Fuelner'
 'Neil Skene' 'Nell Benton' 'Nell Benton, Rachel  Bloom'
 'Nell Benton, Shirl Kennedy' 'Pat Gillespie'
 'Patricia Mazzei, Aaron Sharockman' 'Patricia Mazzei, Amy Sherman'
 'Patrick Marley, Dave Umhoefer' 'Peter Krouse' 'Peter Lord'
 'Rachel  Bloom' 'Rachel Revehl' 'Reginald Fields'
 'Reginald Fields, Louis Jacobson' 'Rich Exner, Stephen Koff'
 'Rich Exner, Tom Feran' 'Richard Danielson'
 'Richard Danielson, Janet Zink' 'Richard Locker' 'Richard Rubin'
 'Richard Salit' 'Rob  Feinberg, Louis Jacobson' 'Robert Farley'
 'Robert Farley, Aaron Sharockman' 'Robert Farley, Angie Drobnic Holan'
 'Robert Farley, Angie Drobnic Holan, Louis Jacobson'
 'Robert Farley, Catharine Richert' "Robert Farley, Ciara O'Rourke"
 'Robert Farley, Erin Mershon' 'Robert Farley, Louis Jacobson'
 'Robert Farley, Lukas Pleva' 'Robert Farley, Shirl Kennedy'
 'Robert Farley, Tom Feran, Jacob Geiger'
 'Robert Farley, Will Short Gorham, Angie Drobnic Holan' 'Robert Higgs'
 'Robert Higgs, Louis Jacobson' 'Rochelle Koff' 'Ryan Kelly'
 'Ryan Kelly, John Martin' 'Sabrina  Eaton'
 'Sabrina  Eaton, Angie Drobnic Holan'
 'Sabrina  Eaton, Angie Drobnic Holan, David G. Taylor'
 'Sabrina  Eaton, Angie Drobnic Holan, Louis Jacobson, Willoughby Mariano, Molly Moorhead'
 'Sabrina  Eaton, Catharine Richert' 'Sabrina  Eaton, Jacob Geiger'
 'Sabrina  Eaton, Jane Kahoun' 'Sabrina  Eaton, Louis Jacobson'
 'Sabrina  Eaton, Louis Jacobson, Stephen Koff, W. Gardner Selby'
 'Sabrina  Eaton, Molly Moorhead, W. Gardner Selby'
 'Sabrina  Eaton, Rich Exner' 'Sabrina  Eaton, Stephen Koff'
 'Sabrina  Eaton, Tom Feran' 'Sara Myers' 'Sasha Bartolf'
 'Sasha Bartolf, Miranda Blue'
 'Sasha Bartolf, Miranda Blue, Philip Burrowes, Angie Drobnic Holan, Ryan Kelly, Jody Kyle'
 'Scott Montgomery' 'Sean Gorman' 'Sean Gorman, Louis Jacobson'
 'Sean Gorman, Molly Moorhead' 'Sean Gorman, Nancy  Madsen'
 'Sean Gorman, Willoughby Mariano, Eric Stirgus' 'Seth Stern'
 'Shawn Zeller' 'Shirl Kennedy' 'Shirl Kennedy, John Martin' 'Stephen Koff'
 'Stephen Koff, Charles Pope' 'Stephen Koff, Molly Moorhead'
 'Steve Ahillen' 'Steve Schultze' 'Steve Schultze, Dave Umhoefer'
 'Sue Owen' 'Sue Owen, Ben Wear' 'Sue Owen, W. Gardner Selby'
 'Thomas Content, Dave Umhoefer' 'Tim Murphy' 'Timothy Engstrom'
 'Toluse Olorunnipa' 'Toluse Olorunnipa, Katie Sanders' 'Tom Feran'
 'Tom Feran, Aaron Marshall' 'Tom Feran, Aaron Sharockman'
 'Tom Feran, Angie Drobnic Holan' 'Tom Feran, Charles Pope'
 'Tom Feran, Dave Umhoefer' 'Tom Feran, Joe Guillen'
 'Tom Feran, Louis Jacobson' 'Tom Feran, Mark Naymik'
 'Tom Feran, Molly Moorhead' 'Tom Feran, Robert Schoenberger'
 'Tom Feran, Sean Gorman' 'Tom Feran, Stephen Koff' 'Tom Humphrey'
 'Tom Humphrey, Zack McMillin' 'Tom Kertscher' 'Tom Kertscher, Amy Sherman'
 'Tom Kertscher, Ben Poston' 'Tom Kertscher, Dave Umhoefer'
 'Tom Kertscher, Don Walker' 'Tom Kertscher, James B. Nelson'
 'Tom Kertscher, James B. Nelson, Dave Umhoefer'
 'Tom Kertscher, Patrick Marley' 'Tom Mooney' 'Tom Tobin' 'Trenton Daniel'
 'W. Gardner Selby' 'W. Gardner Selby, J.B. Wogan' 'Warren Fiske'
 'Warren Fiske, Jacob Geiger' 'Warren Fiske, Louis Jacobson'
 'Warren Fiske, Louis Jacobson, Nancy  Madsen'
 'Warren Fiske, Nancy  Madsen' 'Warren Fiske, Sean Gorman' 'Wes Allison'
 'Wes Allison, Angie Drobnic Holan' 'Wes Allison, Lissa August'
 'Wes Hester' 'Wes Hester, Angie Drobnic Holan' 'Will Short Gorham'
 'Will Short Gorham, Angie Drobnic Holan' 'Will Van Sant'
 'Willoughby Mariano' 'Willoughby Mariano, Eric Stirgus'
 'Willoughby Mariano, Jim Tharpe' 'Yuxing Zheng' 'Zack McMillin']
436
In [152]:
data.groupby('researched_by').count()[['state']].sort_values(
    'state', ascending=False).reset_index().rename(
    columns={'state': 'count'}).plot.bar(
        x='researched_by', y='count', figsize=(16, 10), fontsize=6);

There are 2000+ different sources.

In [159]:
print(np.unique(data['source']))
print(len(np.unique(data['source'])))
data.groupby('source').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().loc[:100].plot.bar(
        x='source', y='count', figsize=(16, 10), fontsize=10);
['13th District GOP slate' '18% of the American public'
 '60 Plus Association' ..., 'Zell Miller' 'Zoe Lofgren' 'billhislam.com']
2124

Predicting truth level¶

The goal is to predict the truthfulness of statements. Let us group the data according to the truth columns:

In [160]:
data.groupby('truth').count()[['source']].reset_index().plot.bar(x='truth', y='source');

The pipeline¶

For submitting at the RAMP site, you will have to write two classes, saved in two different files:

  • the class FeatureExtractor, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features).
  • a class Classifier to predict

Feature extractor¶

The feature extractor implements a transform member function. It is saved in the file submissions/starting_kit/feature_extractor.py. It receives the pandas dataframe X_df defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.

Note that the following code cells are not executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

In [1]:
%%file submissions/starting_kit/feature_extractor.py
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

import numpy as np
import string
import unicodedata

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.utils.validation import check_is_fitted
from sklearn.preprocessing import OneHotEncoder, MaxAbsScaler

def clean_str(sentence, stem=True):
    english_stopwords = set(
        [stopword for stopword in stopwords.words('english')])
    punctuation = set(string.punctuation)
    punctuation.update(["``", "`", "..."])
    if stem:
        stemmer = SnowballStemmer('english')
        return list((filter(lambda x: x.lower() not in english_stopwords and
                            x.lower() not in punctuation,
                            [stemmer.stem(t.lower())
                             for t in word_tokenize(sentence)
                             if t.isalpha()])))

    return list((filter(lambda x: x.lower() not in english_stopwords and
                        x.lower() not in punctuation,
                        [t.lower() for t in word_tokenize(sentence)
                         if t.isalpha()])))


def strip_accents_unicode(s):
    try:
        s = unicode(s, 'utf-8')
    except NameError:  # unicode is a default on python 3
        pass
    s = unicodedata.normalize('NFD', s)
    s = s.encode('ascii', 'ignore')
    s = s.decode("utf-8")
    return str(s)

from sklearn.feature_extraction.text import TfidfVectorizer
class FeatureExtractor(TfidfVectorizer):
    """Convert a collection of raw documents to a matrix of TF-IDF features. """

    def __init__(self):
        super(FeatureExtractor, self).__init__(
            input='content', encoding='utf-8',
            decode_error='strict', strip_accents=None, lowercase=True,
            preprocessor=None, tokenizer=None, analyzer='word',
            stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
            ngram_range=(1, 1), max_df=1.0, min_df=1,
            max_features=None, vocabulary=None, binary=False,
            dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
            sublinear_tf=False)

    def fit(self, X_df, y=None):
        """Learn a vocabulary dictionary of all tokens in the raw documents.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which yields either str, unicode or file objects.
        Returns
        -------
        self
        """
        self._feat = np.array([' '.join(
            clean_str(strip_accents_unicode(dd)))
            for dd in X_df.statement])
        super(FeatureExtractor, self).fit(self._feat)
        return self

    def fit_transform(self, X_df, y=None):
        self.fit(X_df)
        return self.transform(self.X_df)

    def transform(self, X_df):
        X = np.array([' '.join(clean_str(strip_accents_unicode(dd)))
                      for dd in X_df.statement])
        check_is_fitted(self, '_feat', 'The tfidf vector is not fitted')
        X = super(FeatureExtractor, self).transform(X)
        return X
Overwriting submissions/starting_kit/feature_extractor.py

Steps¶

  1. Preprocess the text
  2. Create a count vector
  3. Build a tf-idf matrix

Preprocessing¶

First, we preprocess the text. Preprocessing text is called tokenization or text normalization.

Tokenization¶

The first step or preprocessing. Split sentences in words.

Stopword removal¶

The most frequent words often do not carry much meaning. Examples: the, a, of, for, in, .... This stopword list can be found in NLTK library stopwords.words('english'). Throw away unwanted stuf as in ["`", "", "..."] or numbers.

Stemming (Lemmatization)¶

This is optional. English words like look can be inflected with a morphological suffix to produce looks, looking, looked. They share the same stem look. Often (but not always) it is beneficial to map all inflected forms into the stem. The most commonly used stemmer is the Porter Stemmer. The name comes from its developer, Martin Porter. SnowballStemmer('english') from NLTK is used. This stemmer is called Snowball, because Porter created a programming language with this name for creating new stemming algorithms.

strip_accents_unicode¶

Transform accentuated unicode symbols into their simple counterpart. è -> e

Feature Extractor¶

  1. Create a count vector
  2. Build a tf-idf matrix

Before going through the code, we first need to understand how tf-idf works. A Term Frequency is a count of how many times a word occurs in a given document (synonymous with bag of words). The Inverse Document Frequency is the the number of times a word occurs in a corpus of documents. tf-idf is used to weight words according to how important they are. Words that are used frequently in many documents will have a lower weighting while infrequent ones will have a higher weighting.

class FeatureExtractor(TfidfVectorizer) inherits a TfidfVectorizer which is a CountVectorizer followed by TfidfTransformer.

CountVectorizer converts a collection of text documents to a matrix of token (word) counts. This implementation produces a sparse representation of the counts to be passed to the TfidfTransformer. The TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation.

A TfidfVectorizer does these two steps.

The feature extractor overrides fit by provinding the TfidfVectorizer with a new preprocessing step that has been presented before.

Classifier¶

The classifier follows a classical scikit-learn classifier template. It should be saved in the file submissions/starting_kit/classifier.py. In its simplest form it takes a scikit-learn pipeline, assigns it to self.clf in __init__, then calls its fit and predict_proba functions in the corresponding member funtions.

In [162]:
%%file submissions/starting_kit/classifier.py
# -*- coding: utf-8 -*-
from sklearn.base import BaseEstimator
from sklearn.ensemble import RandomForestClassifier


class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = RandomForestClassifier()

    def fit(self, X, y):
        self.clf.fit(X.todense(), y)

    def predict(self, X):
        return self.clf.predict(X.todense())

    def predict_proba(self, X):
        return self.clf.predict_proba(X)
Overwriting submissions/starting_kit/classifier.py

Local testing (before submission)¶

It is important that you test your submission files before submitting them. For this we provide a unit test. Note that the test runs on your files in submissions/starting_kit, not on the classes defined in the cells of this notebook.

First pip install ramp-workflow or install it from the github repo. Make sure that the python files feature_extractor.py and classifier.py are in the submissions/starting_kit folder, and the data train.csv and test.csv are in data. Then run

ramp_test_submission

If it runs and print training and test errors on each fold, then you can submit the code.

In [164]:
!ramp_test_submission --quick-test
Testing Fake news: classify statements of public figures
Reading train and test files from ./data ...
Reading cv ...
Training ./submissions/starting_kit ...
CV fold 0
	train sacc = 0.767
	valid sacc = 0.325
	test sacc = 0.359
	train acc = 0.966
	valid acc = 0.364
	test acc = 0.18
	train tfacc = 0.828
	valid tfacc = 0.5
	test tfacc = 0.537
CV fold 1
	train sacc = 0.769
	valid sacc = 0.431
	test sacc = 0.361
	train acc = 0.978
	valid acc = 0.0
	test acc = 0.18
	train tfacc = 0.838
	valid tfacc = 0.689
	test tfacc = 0.544
CV fold 2
	train sacc = 0.756
	valid sacc = 0.368
	test sacc = 0.368
	train acc = 0.988
	valid acc = 0.059
	test acc = 0.25
	train tfacc = 0.823
	valid tfacc = 0.594
	test tfacc = 0.555
CV fold 3
	train sacc = 0.763
	valid sacc = 0.367
	test sacc = 0.348
	train acc = 0.989
	valid acc = 0.222
	test acc = 0.22
	train tfacc = 0.822
	valid tfacc = 0.511
	test tfacc = 0.517
CV fold 4
	train sacc = 0.795
	valid sacc = 0.429
	test sacc = 0.372
	train acc = 0.978
	valid acc = 0.182
	test acc = 0.18
	train tfacc = 0.864
	valid tfacc = 0.618
	test tfacc = 0.576
CV fold 5
	train sacc = 0.758
	valid sacc = 0.378
	test sacc = 0.328
	train acc = 1.0
	valid acc = 0.222
	test acc = 0.19
	train tfacc = 0.821
	valid tfacc = 0.756
	test tfacc = 0.509
CV fold 6
	train sacc = 0.777
	valid sacc = 0.31
	test sacc = 0.354
	train acc = 0.989
	valid acc = 0.1
	test acc = 0.19
	train tfacc = 0.843
	valid tfacc = 0.47
	test tfacc = 0.556
CV fold 7
	train sacc = 0.775
	valid sacc = 0.28
	test sacc = 0.354
	train acc = 0.979
	valid acc = 0.0
	test acc = 0.19
	train tfacc = 0.844
	valid tfacc = 0.46
	test tfacc = 0.556
----------------------------
train sacc = 0.77 ± 0.012
train acc = 0.983 ± 0.01
train tfacc = 0.835 ± 0.014
valid sacc = 0.361 ± 0.05
valid acc = 0.144 ± 0.119
valid tfacc = 0.575 ± 0.101
test sacc = 0.355 ± 0.013
test acc = 0.197 ± 0.023
test tfacc = 0.544 ± 0.021

Submitting to ramp.studio¶

Once you found a good feature extractor and classifier, you can submit them to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then find an open event on the particular problem, for example, the event fake_news (Saclay Datacamp, DataFest Tbilisi) for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request is accepted, you can go to your sandbox (Saclay Datacamp, DataFest Tbilisi) and copy-paste (or upload) feature_extractor.py and classifier.py from submissions/starting_kit. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions (Saclay Datacamp, DataFest Tbilisi). Once it is trained, you get a mail, and your submission shows up on the public leaderboard (Saclay Datacamp, DataFest Tbilisi). If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions (Saclay Datacamp, DataFest Tbilisi). You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., locally, and checking them with ramp_test_submission. The script prints mean cross-validation scores

----------------------------
train sacc = 0.77 ± 0.012
train acc = 0.983 ± 0.01
train tfacc = 0.835 ± 0.014
valid sacc = 0.361 ± 0.05
valid acc = 0.144 ± 0.119
valid tfacc = 0.575 ± 0.101
test sacc = 0.355 ± 0.013
test acc = 0.197 ± 0.023
test tfacc = 0.544 ± 0.021

The official score in this RAMP (the first score column after "historical contributivity" on the leader board (Saclay Datacamp, DataFest Tbilisi) is smoothed accuracy, so the line that is relevant in the output of ramp_test_submission is valid sacc = 0.361 ± 0.05. When the score is good enough, you can submit it at the RAMP.

More information¶

You can find more information in the README of the ramp-workflow library.

Contact¶

Don't hesitate to contact us.