You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This notebook is intended to document my data processing throughout this project. I'll be poking around and modifying my data in this file. The data I am starting out with are transcripts of the presidential debates from the 2016 US Election. I am processing the Democratic and Republican primary debates, and the debates of the general election between Hillary Clinton and Donald Trump. The transcripts were taken from UCSB's American Presidency Project, and the citation for each of the transcripts can be found in the README.
#I'm creating a list where each entry in the list is a transcripttranscripts= []
forfinfiles:
fi=open(f, 'r')
txt=fi.read()
fi.closetranscripts.append(txt)
print(transcripts[0][:200])
PARTICIPANTS:
Former Governor Jeb Bush (FL);
Ben Carson;
Governor Chris Christie (NJ);
Senator Ted Cruz (TX);
Governor John Kasich (OH);
Senator Marco Rubio (FL);
Donald Trump;
MODERATORS:
Maria Barti
PARTICIPANTS:
Former Secretary of State Hillary Clinton;
Former Governor Martin O'Malley (MD);
Senator Bernie Sanders (VT);
MODERATORS:
Lester Holt (NBC News)
Andrea Mitchell (NBC News)
HOLT: Good ev
Cleaning up: I would eventually like to end up with a dataframe where the columns are Date, Type (primary or general), Speaker, Sents, where the Sents are in the order that they are said.
Splitting Transcripts by Speaker
#I want to split large chunks of the transcript based on who is speaking.#Since the transcript data has a pretty standardized fomat (The speaker is in all caps followed by a colon)#I can add a marker to each of these sections, and split the data on that markerspeaker_split= []
fortxtintranscripts:
#To take care of the first one where there is no newline preceding the label..txt=txt.replace("PARTICIPANTS:", 'PARTICIPANTS%:')
#get rid of [through translator] label in 3-19-16 debatetxt=txt.replace(" [through translator]:", ":")
#The ' in Martin O'Malley's name was causing some issues so I'm changing his name (for the speaker column)#to OMALLEYtxt=txt.replace("O'MALLEY:", "OMALLEY:")
txt=re.sub(r"\n([A-Z]+)(\s[A-Z]+)?:", r"#$&\1%:", txt)
speaker_split.append(txt)
#Split each chunk by the special markerspeaker_split= [txt.strip().split("#$&") fortxtinspeaker_split]
speaker_split[0][:4]
['PARTICIPANTS%:\nFormer Governor Jeb Bush (FL);\nBen Carson;\nGovernor Chris Christie (NJ);\nSenator Ted Cruz (TX);\nGovernor John Kasich (OH);\nSenator Marco Rubio (FL);\nDonald Trump;', 'MODERATORS%:\nMaria Bartiromo (Fox Business Network); and\nNeil Cavuto (Fox Business Network)\n', "CAVUTO%: It is 9:00 p.m. here at the North Charleston Coliseum and Performing Arts Center in South Carolina. Welcome to the sixth Republican presidential of the 2016 campaign, here on the Fox Business Network. I'm Neil Cavuto, alongside my friend and co-moderator Maria Bartiromo.\n", 'BARTIROMO%: Tonight we are working with Facebook to ask the candidates the questions voters want answered. And according to Facebook, the U.S. election has dominated the global conversation, with 131 million people talking about the 2016 race. That makes it the number one issue talked about on Facebook last year worldwide.\n']
#Creating a giant list so I don't have to handle things one at a time#Splitting each chunk into two elements: speaker, speechdebates= [[txt.split("%:") fortxtinsplit] forsplitinspeaker_split]
debates[0][:4]
[['PARTICIPANTS', '\nFormer Governor Jeb Bush (FL);\nBen Carson;\nGovernor Chris Christie (NJ);\nSenator Ted Cruz (TX);\nGovernor John Kasich (OH);\nSenator Marco Rubio (FL);\nDonald Trump;'], ['MODERATORS', '\nMaria Bartiromo (Fox Business Network); and\nNeil Cavuto (Fox Business Network)\n'], ['CAVUTO', " It is 9:00 p.m. here at the North Charleston Coliseum and Performing Arts Center in South Carolina. Welcome to the sixth Republican presidential of the 2016 campaign, here on the Fox Business Network. I'm Neil Cavuto, alongside my friend and co-moderator Maria Bartiromo.\n"], ['BARTIROMO', ' Tonight we are working with Facebook to ask the candidates the questions voters want answered. And according to Facebook, the U.S. election has dominated the global conversation, with 131 million people talking about the 2016 race. That makes it the number one issue talked about on Facebook last year worldwide.\n']]
Tokenizing Each Speaker's Sentences
debate_sents= []
#For each debate, then for each [speaker, speech] chunk in that debate, get a list of tokenized sents to replace the speechfordebateindebates:
sents_toks= []
forchunkindebate:
sents=nltk.sent_tokenize(chunk[1])
forsentinsents:
sents_toks.append([chunk[0], sent])
debate_sents.append(sents_toks)
Mapping to Debate Type
#I am creating a list of 25 dataframes, one for each debate# Adding a column specifying the type of debate, the date, the speaker, and sentdataframes= []
forfinfiles:
index=files.index(f)
df=pd.DataFrame(debate_sents[index])
iff.endswith('_dem.txt'):
df['Type'] ='primary_dem'df['Date'] =f[:-8]
eliff.endswith('_rep.txt'):
df['Type'] ='primary_rep'df['Date'] =f[:-8]
else:
df['Type'] ='general'df['Date'] =f[:-4]
dataframes.append(df)
# Every returned Out[] is displayed, not just the last one.fromIPython.core.interactiveshellimportInteractiveShellInteractiveShell.ast_node_interactivity="all"
fordfindataframes:
df.head()
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
1-14-16
1
MODERATORS
\nMaria Bartiromo (Fox Business Network); and\...
primary_rep
1-14-16
2
CAVUTO
It is 9:00 p.m. here at the North Charleston ...
primary_rep
1-14-16
3
CAVUTO
Welcome to the sixth Republican presidential o...
primary_rep
1-14-16
4
CAVUTO
I'm Neil Cavuto, alongside my friend and co-mo...
primary_rep
1-14-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
1-17-16
1
MODERATORS
\nLester Holt (NBC News)\nAndrea Mitchell (NBC...
primary_dem
1-17-16
2
HOLT
Good evening and welcome to the NBC News Yout...
primary_dem
1-17-16
3
HOLT
After all the campaigning, soon, Americans wil...
primary_dem
1-17-16
4
HOLT
And New Hampshire not far behind.
primary_dem
1-17-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
1-25-16
1
MODERATOR
\nChris Cuomo, CNN
primary_dem
1-25-16
2
CUOMO
All right.
primary_dem
1-25-16
3
CUOMO
We are live at Drake University in Des Moines,...
primary_dem
1-25-16
4
CUOMO
Welcome to our viewers in the United States an...
primary_dem
1-25-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
1-28-16
1
MODERATORS
\nBret Baier (Fox News);\nMegyn Kelly (Fox New...
primary_rep
1-28-16
2
BAIER
Nine p.m. on the East Coast.
primary_rep
1-28-16
3
BAIER
Eight o'clock here in Des Moines, Iowa.
primary_rep
1-28-16
4
BAIER
Welcome to the seventh Republican presidential...
primary_rep
1-28-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Lincoln Chafee (RI);\nFormer...
primary_dem
10-13-15
1
MODERATORS
\nAnderson Cooper (CNN);\nDana Bash (CNN);\nDo...
primary_dem
10-13-15
2
COOPER
I'm Anderson Cooper.
primary_dem
10-13-15
3
COOPER
Thanks for joining us.
primary_dem
10-13-15
4
COOPER
We've already welcomed the candidates on stage.
primary_dem
10-13-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton (D...
general
10-19-16
1
MODERATOR
\nChris Wallace (Fox News)
general
10-19-16
2
WALLACE
Good evening from the Thomas and Mack Center ...
general
10-19-16
3
WALLACE
I'm Chris Wallace of Fox News, and I welcome y...
general
10-19-16
4
WALLACE
This debate is sponsored by the Commission on ...
general
10-19-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
10-28-15
1
MODERATORS
\nJohn Harwood (CNBC);\nBecky Quick (CNBC); an...
primary_rep
10-28-15
2
QUINTANILLA
Good evening, I'm Carl Quintanilla, with my c...
primary_rep
10-28-15
3
QUINTANILLA
We'll be joined tonight by some of CNBC's top ...
primary_rep
10-28-15
4
QUINTANILLA
Let's get through the rules of the road.
primary_rep
10-28-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton (D...
general
10-9-16
1
MODERATORS
\nAnderson Cooper (CNN) and\nMartha Raddatz (A...
general
10-9-16
2
RADDATZ
Ladies and gentlemen the Republican nominee f...
general
10-9-16
3
RADDATZ
[applause]
general
10-9-16
4
COOPER
Thank you very much for being here.
general
10-9-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
11-10-15
1
MODERATORS
\nGerard Baker (The Wall Street Journal);\nMar...
primary_rep
11-10-15
2
CAVUTO
It is 9:00 p.m. on the East Coast, 8:00 p.m. ...
primary_rep
11-10-15
3
CAVUTO
Welcome to the Republican presidential debate ...
primary_rep
11-10-15
4
CAVUTO
I'm Neil Cavuto, alongside my co-moderators, M...
primary_rep
11-10-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
11-14-15
1
MODERATORS
\nNancy Cordes (CBS News);\nKevin Cooney (CBS ...
primary_dem
11-14-15
2
DICKERSON
Before we start the debate here are the rules.
primary_dem
11-14-15
3
DICKERSON
The candidates have one minute to respond to o...
primary_dem
11-14-15
4
DICKERSON
Any candidate who is attacked by another candi...
primary_dem
11-14-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
12-15-15
1
MODERATORS
\nWolf Blitzer (CNN);\nDana Bash (CNN); and\nH...
primary_rep
12-15-15
2
BLITZER
Welcome to the CNN-Facebook Republican presid...
primary_rep
12-15-15
3
BLITZER
We have a very enthusiastic audience.
primary_rep
12-15-15
4
BLITZER
Everyone is here.
primary_rep
12-15-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
12-19-15
1
MODERATORS
\nMartha Raddatz (ABC News)\nDavid Muir (ABC N...
primary_dem
12-19-15
2
RADDATZ
Good evening to you all.
primary_dem
12-19-15
3
RADDATZ
The rules for tonight are very basic and have ...
primary_dem
12-19-15
4
RADDATZ
Candidates can take up to a minute-and-a-half ...
primary_dem
12-19-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
2-11-16
1
MODERATORS
\nGwen Ifill (PBS);\nJudy Woodruff (PBS)
primary_dem
2-11-16
2
WOODRUFF
Good evening, and thank you.
primary_dem
2-11-16
3
WOODRUFF
We are happy to welcome you to Milwaukee for t...
primary_dem
2-11-16
4
WOODRUFF
We are especially pleased to thank our partner...
primary_dem
2-11-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
2-13-16
1
MODERATOR
\nJohn Dickerson (CBS News); with
primary_rep
2-13-16
2
PANELISTS
\nMajor Garrett (CBS News); and\nKimberly Stra...
primary_rep
2-13-16
3
DICKERSON
Good evening.
primary_rep
2-13-16
4
DICKERSON
I'm John Dickerson.
primary_rep
2-13-16
0
1
Type
Date
0
PARTICIPANTS
\nBen Carson;\nSenator Ted Cruz (TX);\nGoverno...
primary_rep
2-25-16
1
MODERATOR
\nWolf Blitzer (CNN); with
primary_rep
2-25-16
2
PANELISTS
\nMaria Celeste Arrarás (Telemundo);\nDana Bas...
primary_rep
2-25-16
3
BLITZER
We're live here at the University of Houston ...
primary_rep
2-25-16
4
BLITZER
[applause]\n\nAn enthusiastic crowd is on hand...
primary_rep
2-25-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
2-4-16
1
MODERATORS
\nChuck Todd (MSNBC);\nRachel Maddow (MSNBC)
primary_dem
2-4-16
2
TODD
Good evening, and welcome to the MSNBC Democr...
primary_dem
2-4-16
3
MADDOW
We are super excited to be here at the Univer...
primary_dem
2-4-16
4
MADDOW
Tonight, this is the first time that Hillary C...
primary_dem
2-4-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
2-6-16
1
MODERATORS
\nDavid Muir (ABC News); and\nMartha Raddatz (...
primary_rep
2-6-16
2
MUIR
Good evening, again, everyone.
primary_rep
2-6-16
3
MUIR
This is the first time since Iowa and the only...
primary_rep
2-6-16
4
MUIR
The people of Iowa have been heard.
primary_rep
2-6-16
0
1
Type
Date
0
PARTICIPANTS
\nSenator Ted Cruz (TX);\nGovernor John Kasich...
primary_rep
3-10-16
1
MODERATORS
\nJake Tapper (CNN);\nDana Bash (CNN);\nHugh H...
primary_rep
3-10-16
2
TAPPER
Live from the Bank United Center on the campu...
primary_rep
3-10-16
3
TAPPER
For our viewers in the United States and aroun...
primary_rep
3-10-16
4
TAPPER
In just five days voters will go to the polls ...
primary_rep
3-10-16
0
1
Type
Date
0
PARTICIPANTS
\nSenator Ted Cruz (TX);\nGovernor John Kasich...
primary_rep
3-3-16
1
MODERATORS
\nBret Baier (Fox News);\nMegyn Kelly (Fox New...
primary_rep
3-3-16
2
KELLY
Good evening, and welcome to the fabulous FOX...
primary_rep
3-3-16
3
KELLY
I'm Megyn Kelly, along with my co-moderators, ...
primary_rep
3-3-16
4
BAIER
59 Republican delegates are at stake here in ...
primary_rep
3-3-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
3-6-16
1
MODERATORS
\nAnderson Cooper (CNN);\nDon Lemon (CNN)
primary_dem
3-6-16
2
COOPER
And welcome to The Whiting Auditorium on the ...
primary_dem
3-6-16
3
COOPER
I'm Anderson Cooper.
primary_dem
3-6-16
4
COOPER
I want to welcome our viewers in the United St...
primary_dem
3-6-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
3-9-16
1
MODERATORS
\nJorge Ramos (Univision);\nMaría Elena Salina...
primary_dem
3-9-16
2
RAMOS
[Speaking in Spanish]
primary_dem
3-9-16
3
SALINAS
This will be the first and only debate the ca...
primary_dem
3-9-16
4
RAMOS
Here with us tonight is Karen Tumulty, Washin...
primary_dem
3-9-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton;\n...
primary_dem
4-14-16
1
MODERATOR
\nWolf Blitzer (CNN);
primary_dem
4-14-16
2
PANELISTS
\nDana Bash (CNN); and\nErrol Louis (NY1)
primary_dem
4-14-16
3
BLITZER
Secretary Clinton and Senator Sanders, you ca...
primary_dem
4-14-16
4
BLITZER
As moderator, I'll guide the discussion, askin...
primary_dem
4-14-16
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
8-6-15
1
MODERATORS
\nBret Baier (Fox News);\nMegyn Kelly (Fox New...
primary_rep
8-6-15
2
KELLY
Welcome to the first debate night of the 2016...
primary_rep
8-6-15
3
KELLY
I'm Megyn Kelly... [applause]... along with my...
primary_rep
8-6-15
4
KELLY
Tonight... [applause] Nice.
primary_rep
8-6-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Governor Jeb Bush (FL);\nBen Carson;\...
primary_rep
9-16-15
1
MODERATORS
\nJake Tapper (CNN);\nDana Bash (CNN); and\nHu...
primary_rep
9-16-15
2
TAPPER
I'm Jake Tapper.
primary_rep
9-16-15
3
TAPPER
We're live at the Ronald Reagan Library in Sim...
primary_rep
9-16-15
4
TAPPER
Round 2 of CNN's presidential debate starts now.
primary_rep
9-16-15
0
1
Type
Date
0
PARTICIPANTS
\nFormer Secretary of State Hillary Clinton (D...
general
9-26-16
1
MODERATOR
\nLester Holt (NBC News)
general
9-26-16
2
HOLT
Good evening from Hofstra University in Hemps...
general
9-26-16
3
HOLT
I'm Lester Holt, anchor of "NBC Nightly News."
general
9-26-16
4
HOLT
I want to welcome you to the first presidentia...
general
9-26-16
Reordering and Naming Columns
#Creating a new giant list of cleaned dataframes where the columns are reordered and cleaned updataframes_clean= []
fordfindataframes:
#Drop the first two rows because they don't matterdf.drop(0, inplace=True)
df.drop(1, inplace=True)
#Renaming the first two columnsdf.columns= ['Speaker', 'Sents', 'Debate Type', 'Date']
#Strip newlines from Speaker and Sents columnsdf['Speaker'] =df['Speaker'].apply(lambdax: x.strip('\n'))
df['Sents'] =df['Sents'].apply(lambdax: x.strip('\n'))
#Reorder columnsdataframes_clean.append(df[['Date','Debate Type', 'Speaker', 'Sents']])
dataframes_clean[0].head()
Date
Debate Type
Speaker
Sents
2
1-14-16
primary_rep
CAVUTO
It is 9:00 p.m. here at the North Charleston ...
3
1-14-16
primary_rep
CAVUTO
Welcome to the sixth Republican presidential o...
4
1-14-16
primary_rep
CAVUTO
I'm Neil Cavuto, alongside my friend and co-mo...
5
1-14-16
primary_rep
BARTIROMO
Tonight we are working with Facebook to ask t...
6
1-14-16
primary_rep
BARTIROMO
And according to Facebook, the U.S. election h...
dataframes_clean[-1].head()
Date
Debate Type
Speaker
Sents
2
9-26-16
general
HOLT
Good evening from Hofstra University in Hemps...
3
9-26-16
general
HOLT
I'm Lester Holt, anchor of "NBC Nightly News."
4
9-26-16
general
HOLT
I want to welcome you to the first presidentia...
5
9-26-16
general
HOLT
The participants tonight are Donald Trump and ...
6
9-26-16
general
HOLT
This debate is sponsored by the Commission on ...
Now I have a nice data frame for each debate. For any utterance in any debate, I provide information about who said it, what kind of debate it was, and when the debate took place. Now I'm going to export these dataframes to CSV files and process them with NER annotation in a different notebook.
Saving DataFrames
#i=-1#for df in dataframes_clean:# i+=1# df.to_csv('../csv/'+str(files[i][:-4])+'.csv')