Skip to content

Latest commit

 

History

History
1792 lines (1531 loc) · 37.2 KB

File metadata and controls

1792 lines (1531 loc) · 37.2 KB

2016 Election Project

Part 1 of Processing Pipeline

This notebook is intended to document my data processing throughout this project. I'll be poking around and modifying my data in this file. The data I am starting out with are transcripts of the presidential debates from the 2016 US Election. I am processing the Democratic and Republican primary debates, and the debates of the general election between Hillary Clinton and Donald Trump. The transcripts were taken from UCSB's American Presidency Project, and the citation for each of the transcripts can be found in the README.

Table of Contents

%pprint
import nltk
from nltk.corpus import PlaintextCorpusReader
import pandas as pd
import glob
import os
import re
Pretty printing has been turned OFF

Reading In Files

os.chdir('/Users/Paige/Documents/Data_Science/2016-Election-Project/data/Debates/transcripts/')
files = glob.glob("*.txt")
files
['1-14-16_rep.txt', '1-17-16_dem.txt', '1-25-16_dem.txt', '1-28-16_rep.txt', '10-13-15_dem.txt', '10-19-16.txt', '10-28-15_rep.txt', '10-9-16.txt', '11-10-15_rep.txt', '11-14-15_dem.txt', '12-15-15_rep.txt', '12-19-15_dem.txt', '2-11-16_dem.txt', '2-13-16_rep.txt', '2-25-16_rep.txt', '2-4-16_dem.txt', '2-6-16_rep.txt', '3-10-16_rep.txt', '3-3-16_rep.txt', '3-6-16_dem.txt', '3-9-16_dem.txt', '4-14-16_dem.txt', '8-6-15_rep.txt', '9-16-15_rep.txt', '9-26-16.txt']
len(files)
25
#I'm creating a list where each entry in the list is a transcript
transcripts = []
for f in files:
    fi = open(f, 'r')
    txt = fi.read()
    fi.close
    transcripts.append(txt)
print(transcripts[0][:200])
PARTICIPANTS:
Former Governor Jeb Bush (FL);
Ben Carson;
Governor Chris Christie (NJ);
Senator Ted Cruz (TX);
Governor John Kasich (OH);
Senator Marco Rubio (FL);
Donald Trump;
MODERATORS:
Maria Barti
words0= nltk.word_tokenize(transcripts[0])
len(words0)
27658
sents0= nltk.sent_tokenize(transcripts[0])
len(sents0)
1498
print(transcripts[1][:200])
PARTICIPANTS:
Former Secretary of State Hillary Clinton;
Former Governor Martin O'Malley (MD);
Senator Bernie Sanders (VT);
MODERATORS:
Lester Holt (NBC News)
Andrea Mitchell (NBC News)

HOLT: Good ev

Cleaning up: I would eventually like to end up with a dataframe where the columns are Date, Type (primary or general), Speaker, Sents, where the Sents are in the order that they are said.

Splitting Transcripts by Speaker

#I want to split large chunks of the transcript based on who is speaking.
#Since the transcript data has a pretty standardized fomat (The speaker is in all caps followed by a colon)
#I can add a marker to each of these sections, and split the data on that marker

speaker_split = []

for txt in transcripts:
    #To take care of the first one where there is no newline preceding the label..
    txt = txt.replace("PARTICIPANTS:", 'PARTICIPANTS%:')
    #get rid of [through translator] label in 3-19-16 debate
    txt = txt.replace(" [through translator]:", ":")
    #The ' in Martin O'Malley's name was causing some issues so I'm changing his name (for the speaker column)
    #to OMALLEY
    txt = txt.replace("O'MALLEY:", "OMALLEY:")
    txt = re.sub(r"\n([A-Z]+)(\s[A-Z]+)?:", r"#$&\1%:", txt)
    speaker_split.append(txt)

#Split each chunk by the special marker
speaker_split = [txt.strip().split("#$&") for txt in speaker_split]
speaker_split[0][:4]
['PARTICIPANTS%:\nFormer Governor Jeb Bush (FL);\nBen Carson;\nGovernor Chris Christie (NJ);\nSenator Ted Cruz (TX);\nGovernor John Kasich (OH);\nSenator Marco Rubio (FL);\nDonald Trump;', 'MODERATORS%:\nMaria Bartiromo (Fox Business Network); and\nNeil Cavuto (Fox Business Network)\n', "CAVUTO%: It is 9:00 p.m. here at the North Charleston Coliseum and Performing Arts Center in South Carolina. Welcome to the sixth Republican presidential of the 2016 campaign, here on the Fox Business Network. I'm Neil Cavuto, alongside my friend and co-moderator Maria Bartiromo.\n", 'BARTIROMO%: Tonight we are working with Facebook to ask the candidates the questions voters want answered. And according to Facebook, the U.S. election has dominated the global conversation, with 131 million people talking about the 2016 race. That makes it the number one issue talked about on Facebook last year worldwide.\n']
#Creating a giant list so I don't have to handle things one at a time
#Splitting each chunk into two elements: speaker, speech
debates = [[txt.split("%:") for txt in split] for split in speaker_split]
debates[0][:4]
[['PARTICIPANTS', '\nFormer Governor Jeb Bush (FL);\nBen Carson;\nGovernor Chris Christie (NJ);\nSenator Ted Cruz (TX);\nGovernor John Kasich (OH);\nSenator Marco Rubio (FL);\nDonald Trump;'], ['MODERATORS', '\nMaria Bartiromo (Fox Business Network); and\nNeil Cavuto (Fox Business Network)\n'], ['CAVUTO', " It is 9:00 p.m. here at the North Charleston Coliseum and Performing Arts Center in South Carolina. Welcome to the sixth Republican presidential of the 2016 campaign, here on the Fox Business Network. I'm Neil Cavuto, alongside my friend and co-moderator Maria Bartiromo.\n"], ['BARTIROMO', ' Tonight we are working with Facebook to ask the candidates the questions voters want answered. And according to Facebook, the U.S. election has dominated the global conversation, with 131 million people talking about the 2016 race. That makes it the number one issue talked about on Facebook last year worldwide.\n']]

Tokenizing Each Speaker's Sentences

debate_sents = []
#For each debate, then for each [speaker, speech] chunk in that debate, get a list of tokenized sents to replace the speech
for debate in debates:
    sents_toks = []
    for chunk in debate:
        sents = nltk.sent_tokenize(chunk[1])
        for sent in sents:
            sents_toks.append([chunk[0], sent])
    debate_sents.append(sents_toks)

Mapping to Debate Type

#I am creating a list of 25 dataframes, one for each debate
# Adding a column specifying the type of debate, the date, the speaker, and sent

dataframes = []
for f in files:
    index = files.index(f)
    df = pd.DataFrame(debate_sents[index])
    if f.endswith('_dem.txt'):
        df['Type'] = 'primary_dem'
        df['Date'] = f[:-8]
    elif f.endswith('_rep.txt'):
        df['Type'] = 'primary_rep'
        df['Date'] = f[:-8]
    else:
        df['Type'] = 'general'
        df['Date'] = f[:-4]
    dataframes.append(df)
# Every returned Out[] is displayed, not just the last one.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
for df in dataframes:
    df.head()
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 1-14-16
1 MODERATORS \nMaria Bartiromo (Fox Business Network); and\... primary_rep 1-14-16
2 CAVUTO It is 9:00 p.m. here at the North Charleston ... primary_rep 1-14-16
3 CAVUTO Welcome to the sixth Republican presidential o... primary_rep 1-14-16
4 CAVUTO I'm Neil Cavuto, alongside my friend and co-mo... primary_rep 1-14-16
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 1-17-16
1 MODERATORS \nLester Holt (NBC News)\nAndrea Mitchell (NBC... primary_dem 1-17-16
2 HOLT Good evening and welcome to the NBC News Yout... primary_dem 1-17-16
3 HOLT After all the campaigning, soon, Americans wil... primary_dem 1-17-16
4 HOLT And New Hampshire not far behind. primary_dem 1-17-16
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 1-25-16
1 MODERATOR \nChris Cuomo, CNN primary_dem 1-25-16
2 CUOMO All right. primary_dem 1-25-16
3 CUOMO We are live at Drake University in Des Moines,... primary_dem 1-25-16
4 CUOMO Welcome to our viewers in the United States an... primary_dem 1-25-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 1-28-16
1 MODERATORS \nBret Baier (Fox News);\nMegyn Kelly (Fox New... primary_rep 1-28-16
2 BAIER Nine p.m. on the East Coast. primary_rep 1-28-16
3 BAIER Eight o'clock here in Des Moines, Iowa. primary_rep 1-28-16
4 BAIER Welcome to the seventh Republican presidential... primary_rep 1-28-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Lincoln Chafee (RI);\nFormer... primary_dem 10-13-15
1 MODERATORS \nAnderson Cooper (CNN);\nDana Bash (CNN);\nDo... primary_dem 10-13-15
2 COOPER I'm Anderson Cooper. primary_dem 10-13-15
3 COOPER Thanks for joining us. primary_dem 10-13-15
4 COOPER We've already welcomed the candidates on stage. primary_dem 10-13-15
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton (D... general 10-19-16
1 MODERATOR \nChris Wallace (Fox News) general 10-19-16
2 WALLACE Good evening from the Thomas and Mack Center ... general 10-19-16
3 WALLACE I'm Chris Wallace of Fox News, and I welcome y... general 10-19-16
4 WALLACE This debate is sponsored by the Commission on ... general 10-19-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 10-28-15
1 MODERATORS \nJohn Harwood (CNBC);\nBecky Quick (CNBC); an... primary_rep 10-28-15
2 QUINTANILLA Good evening, I'm Carl Quintanilla, with my c... primary_rep 10-28-15
3 QUINTANILLA We'll be joined tonight by some of CNBC's top ... primary_rep 10-28-15
4 QUINTANILLA Let's get through the rules of the road. primary_rep 10-28-15
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton (D... general 10-9-16
1 MODERATORS \nAnderson Cooper (CNN) and\nMartha Raddatz (A... general 10-9-16
2 RADDATZ Ladies and gentlemen the Republican nominee f... general 10-9-16
3 RADDATZ [applause] general 10-9-16
4 COOPER Thank you very much for being here. general 10-9-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 11-10-15
1 MODERATORS \nGerard Baker (The Wall Street Journal);\nMar... primary_rep 11-10-15
2 CAVUTO It is 9:00 p.m. on the East Coast, 8:00 p.m. ... primary_rep 11-10-15
3 CAVUTO Welcome to the Republican presidential debate ... primary_rep 11-10-15
4 CAVUTO I'm Neil Cavuto, alongside my co-moderators, M... primary_rep 11-10-15
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 11-14-15
1 MODERATORS \nNancy Cordes (CBS News);\nKevin Cooney (CBS ... primary_dem 11-14-15
2 DICKERSON Before we start the debate here are the rules. primary_dem 11-14-15
3 DICKERSON The candidates have one minute to respond to o... primary_dem 11-14-15
4 DICKERSON Any candidate who is attacked by another candi... primary_dem 11-14-15
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 12-15-15
1 MODERATORS \nWolf Blitzer (CNN);\nDana Bash (CNN); and\nH... primary_rep 12-15-15
2 BLITZER Welcome to the CNN-Facebook Republican presid... primary_rep 12-15-15
3 BLITZER We have a very enthusiastic audience. primary_rep 12-15-15
4 BLITZER Everyone is here. primary_rep 12-15-15
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 12-19-15
1 MODERATORS \nMartha Raddatz (ABC News)\nDavid Muir (ABC N... primary_dem 12-19-15
2 RADDATZ Good evening to you all. primary_dem 12-19-15
3 RADDATZ The rules for tonight are very basic and have ... primary_dem 12-19-15
4 RADDATZ Candidates can take up to a minute-and-a-half ... primary_dem 12-19-15
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 2-11-16
1 MODERATORS \nGwen Ifill (PBS);\nJudy Woodruff (PBS) primary_dem 2-11-16
2 WOODRUFF Good evening, and thank you. primary_dem 2-11-16
3 WOODRUFF We are happy to welcome you to Milwaukee for t... primary_dem 2-11-16
4 WOODRUFF We are especially pleased to thank our partner... primary_dem 2-11-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 2-13-16
1 MODERATOR \nJohn Dickerson (CBS News); with primary_rep 2-13-16
2 PANELISTS \nMajor Garrett (CBS News); and\nKimberly Stra... primary_rep 2-13-16
3 DICKERSON Good evening. primary_rep 2-13-16
4 DICKERSON I'm John Dickerson. primary_rep 2-13-16
0 1 Type Date
0 PARTICIPANTS \nBen Carson;\nSenator Ted Cruz (TX);\nGoverno... primary_rep 2-25-16
1 MODERATOR \nWolf Blitzer (CNN); with primary_rep 2-25-16
2 PANELISTS \nMaria Celeste Arrarás (Telemundo);\nDana Bas... primary_rep 2-25-16
3 BLITZER We're live here at the University of Houston ... primary_rep 2-25-16
4 BLITZER [applause]\n\nAn enthusiastic crowd is on hand... primary_rep 2-25-16
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 2-4-16
1 MODERATORS \nChuck Todd (MSNBC);\nRachel Maddow (MSNBC) primary_dem 2-4-16
2 TODD Good evening, and welcome to the MSNBC Democr... primary_dem 2-4-16
3 MADDOW We are super excited to be here at the Univer... primary_dem 2-4-16
4 MADDOW Tonight, this is the first time that Hillary C... primary_dem 2-4-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 2-6-16
1 MODERATORS \nDavid Muir (ABC News); and\nMartha Raddatz (... primary_rep 2-6-16
2 MUIR Good evening, again, everyone. primary_rep 2-6-16
3 MUIR This is the first time since Iowa and the only... primary_rep 2-6-16
4 MUIR The people of Iowa have been heard. primary_rep 2-6-16
0 1 Type Date
0 PARTICIPANTS \nSenator Ted Cruz (TX);\nGovernor John Kasich... primary_rep 3-10-16
1 MODERATORS \nJake Tapper (CNN);\nDana Bash (CNN);\nHugh H... primary_rep 3-10-16
2 TAPPER Live from the Bank United Center on the campu... primary_rep 3-10-16
3 TAPPER For our viewers in the United States and aroun... primary_rep 3-10-16
4 TAPPER In just five days voters will go to the polls ... primary_rep 3-10-16
0 1 Type Date
0 PARTICIPANTS \nSenator Ted Cruz (TX);\nGovernor John Kasich... primary_rep 3-3-16
1 MODERATORS \nBret Baier (Fox News);\nMegyn Kelly (Fox New... primary_rep 3-3-16
2 KELLY Good evening, and welcome to the fabulous FOX... primary_rep 3-3-16
3 KELLY I'm Megyn Kelly, along with my co-moderators, ... primary_rep 3-3-16
4 BAIER 59 Republican delegates are at stake here in ... primary_rep 3-3-16
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 3-6-16
1 MODERATORS \nAnderson Cooper (CNN);\nDon Lemon (CNN) primary_dem 3-6-16
2 COOPER And welcome to The Whiting Auditorium on the ... primary_dem 3-6-16
3 COOPER I'm Anderson Cooper. primary_dem 3-6-16
4 COOPER I want to welcome our viewers in the United St... primary_dem 3-6-16
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 3-9-16
1 MODERATORS \nJorge Ramos (Univision);\nMaría Elena Salina... primary_dem 3-9-16
2 RAMOS [Speaking in Spanish] primary_dem 3-9-16
3 SALINAS This will be the first and only debate the ca... primary_dem 3-9-16
4 RAMOS Here with us tonight is Karen Tumulty, Washin... primary_dem 3-9-16
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton;\n... primary_dem 4-14-16
1 MODERATOR \nWolf Blitzer (CNN); primary_dem 4-14-16
2 PANELISTS \nDana Bash (CNN); and\nErrol Louis (NY1) primary_dem 4-14-16
3 BLITZER Secretary Clinton and Senator Sanders, you ca... primary_dem 4-14-16
4 BLITZER As moderator, I'll guide the discussion, askin... primary_dem 4-14-16
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 8-6-15
1 MODERATORS \nBret Baier (Fox News);\nMegyn Kelly (Fox New... primary_rep 8-6-15
2 KELLY Welcome to the first debate night of the 2016... primary_rep 8-6-15
3 KELLY I'm Megyn Kelly... [applause]... along with my... primary_rep 8-6-15
4 KELLY Tonight... [applause] Nice. primary_rep 8-6-15
0 1 Type Date
0 PARTICIPANTS \nFormer Governor Jeb Bush (FL);\nBen Carson;\... primary_rep 9-16-15
1 MODERATORS \nJake Tapper (CNN);\nDana Bash (CNN); and\nHu... primary_rep 9-16-15
2 TAPPER I'm Jake Tapper. primary_rep 9-16-15
3 TAPPER We're live at the Ronald Reagan Library in Sim... primary_rep 9-16-15
4 TAPPER Round 2 of CNN's presidential debate starts now. primary_rep 9-16-15
0 1 Type Date
0 PARTICIPANTS \nFormer Secretary of State Hillary Clinton (D... general 9-26-16
1 MODERATOR \nLester Holt (NBC News) general 9-26-16
2 HOLT Good evening from Hofstra University in Hemps... general 9-26-16
3 HOLT I'm Lester Holt, anchor of "NBC Nightly News." general 9-26-16
4 HOLT I want to welcome you to the first presidentia... general 9-26-16

Reordering and Naming Columns

#Creating a new giant list of cleaned dataframes where the columns are reordered and cleaned up
dataframes_clean = []
for df in dataframes:
    #Drop the first two rows because they don't matter
    df.drop(0, inplace=True)
    df.drop(1, inplace=True)
    #Renaming the first two columns
    df.columns = ['Speaker', 'Sents', 'Debate Type', 'Date']
    #Strip newlines from Speaker and Sents columns
    df['Speaker'] = df['Speaker'].apply(lambda x: x.strip('\n'))
    df['Sents'] = df['Sents'].apply(lambda x: x.strip('\n'))
    #Reorder columns
    dataframes_clean.append(df[['Date','Debate Type', 'Speaker', 'Sents']])
dataframes_clean[0].head()
Date Debate Type Speaker Sents
2 1-14-16 primary_rep CAVUTO It is 9:00 p.m. here at the North Charleston ...
3 1-14-16 primary_rep CAVUTO Welcome to the sixth Republican presidential o...
4 1-14-16 primary_rep CAVUTO I'm Neil Cavuto, alongside my friend and co-mo...
5 1-14-16 primary_rep BARTIROMO Tonight we are working with Facebook to ask t...
6 1-14-16 primary_rep BARTIROMO And according to Facebook, the U.S. election h...
dataframes_clean[-1].head()
Date Debate Type Speaker Sents
2 9-26-16 general HOLT Good evening from Hofstra University in Hemps...
3 9-26-16 general HOLT I'm Lester Holt, anchor of "NBC Nightly News."
4 9-26-16 general HOLT I want to welcome you to the first presidentia...
5 9-26-16 general HOLT The participants tonight are Donald Trump and ...
6 9-26-16 general HOLT This debate is sponsored by the Commission on ...

Now I have a nice data frame for each debate. For any utterance in any debate, I provide information about who said it, what kind of debate it was, and when the debate took place. Now I'm going to export these dataframes to CSV files and process them with NER annotation in a different notebook.

Saving DataFrames

#i=-1
#for df in dataframes_clean:
#    i+=1
#    df.to_csv('../csv/'+str(files[i][:-4])+'.csv')
import pickle
f = open('/Users/Paige/Documents/Data_Science/dataframes_list.p', 'wb')
pickle.dump(dataframes_clean, f, -1)
f.close()