GitHub - blake1407/Thesis: Repository for A Psycholinguistic Analysis Of Changes In Stereotypes And Hate Speech Associated With Muslim And/Or Arab Individuals And Jewish And/Or Israel Individuals In Political Discourse On “X” In The Year Before And After October 7th

This repository contains a suite of scripts designed for scraping, processing, and analyzing tweets from selected X (formerly Twitter) accounts focusing on the 2023 Israel-Hamas Conflict. The project aims to analyze sentiment and implicit bias in social media discourse. The full thesis is available here. An in-depth overview of the data analysis pipeline and results is available here.

📋 Overview

The workflow consists of several integrated components:

Data Collection - Scraping tweets and supplementary data
Data Processing - Parsing, anonymizing, and tokenizing text
Semantic Analysis - Generating embeddings and analyzing sentiment
Bias Assessment - Calculate implicit associations in discourse

🔧 Components

Supplementary Materials Preparation:

NYT Metadata Collection
- NYT Metadata/NYT API - Metadata Scraper.ipynb - Collects metadata from NYT articles using NYT API
- NYT Metadata/Creating List of Keywords.ipynb - Identifies top mentioned entities for filtering
NYT Metadata Collection
- X Trending Before and After/Selenium for Archive Twitter.ipynb - Scrapes trending terms using Selenium and BeautifulSoup from https://archive.twitter-trending.com

Tweet Collection:

MINVERA_AdvancedScrape_X.ipynb - Scrapes tweets containing specified keywords with minimum engagement metrics
- Outputs structured JSON files for further processing

Data Processing:

Parsing_Script_for_Raw_HTML.ipynb - Extracts and processes tweet data (likes, replies, retweets, views)
Concatenate Parsed Data.ipynb - Anonymizes content and organizes by time period (before/after Oct 7, 2023)
Generating Tokens Details.ipynb - Text preprocessing pipeline:
- Removes usernames, punctuation, and stop words
- Applies sentence markers
- Tokenizes text with unique token and segment IDs

Embedding Generation:

Tokenized Data/BERT - Getting Embeddings.ipynb - Generates BERT embeddings, using the pre-trained model available on Hugging Face.
Tokenized Data/word2vec - Getting Embeddings.ipynb - Generates word2vec embeddings

Sentiment Analysis:

Dictionary Creation
- Create Dictionary/Data compiling.ipynb is a modified SADCAT script (Semi-Automated Dictionary Creation for Analyzing Text; Nicolas et al., 2019; Nicolas et al., 2021; https://github.com/gandalfnicolas/SADCAT) that was adapted from R to Python (Gautam et al., under review).
- Based on Kurdi et al. (2019) stereotype categories: warm, cold, incompetence, competence, Jewish, Muslim, Arabic, Israeli.
Sentiment proportion generation
- Positive and negative words are identified using the Linguistic Inquiry and Word Count toolbox (LIWC; Cohn et al., 2004).
Satistical Modeling
- R Codes/Finalizing_models.R - Creates hierarchical models to analyze sentiment patterns
- R Codes/get_simslopes.R - Decomposes significant interactions (developed by Richa Gautam)

Implicit Association Analysis:

WEAT.ipynb - Implements Word Embedding Association Test based on Caliskan et al. (2017) and Charlesworth et al. (2021)
- Examines implicit biases and stereotype associations in tweet content

📦 Dependencies

Python 3.x
Data colection: Selenium, BeautifulSoup
NLP processing: Transformers (for BERT embeddings)
Statistical analysis: R libraries for hierarchical modeling
WEAT implementation from Charlesworth et al. (2021)

🚀 Usage

Data Collection

Run MINVERA_AdvancedScrape_X.ipynb to collect tweets
Configure X login credentials in the 5th code block (your_email, your_username, your_password)

Data Processing

Use Parsing_Script_for_Raw_HTML.ipynb to extract and anonymize collected data
Process text with Concatenate Parsed Data.ipynb, then Generating Tokens Details.ipynb

Generate Embeddings

Use either BERT or word2vec embedding notebooks

Analysis

Apply WEAT for implicit association analysis
Use R scripts for statistical modeling of sentiment patterns

📁 Available Resources

Pre-scraped tweets related to the 2023 Israel-Hamas Conflict are available in the Raw Data folder
Full list of analyzed influencers available in Supplementary Materials/Followers List & Categories - Accounts Kept.csv

📚 References

Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183–186. https://doi.org/10.1126/science.aal4230
Charlesworth, T. E. S., Yang, V., Mann, T. C., Kurdi, B., & Banaji, M. R. (2021). Gender Stereotypes in Natural Language: Word Embeddings Show Robust Consistency Across Child and Adult Language Corpora of More Than 65 Million Words. Psychological Science, 32(2), 218–240. https://doi.org/10.1177/0956797620963619
Cohn, M. A., Mehl, M. R., & Pennebaker, J. W. (2004). Linguistic Markers of Psychological Change Surrounding September 11, 2001. Psychological Science, 15(10), 687–693. https://doi.org/10.1111/j.0956-7976.2004.00741.x
Nicolas, G., Bai X, & Fiske, S. (2019). Automated Dictionary Creation for Analyzing Text: An Illustration from Stereotype Content. PsyArXiv (OSF Preprints). https://doi.org/10.31234/osf.io/afm8k.
Nicolas, G., Bai, X., & Fiske, S. T. (2021). Comprehensive stereotype content dictionaries using a semi‐automated method. European Journal of Social Psychology, 51(1), 178–196. https://doi.org/10.1002/ejsp.2724.
Kurdi, B., Mann, T. C., Charlesworth, T. E. S., & Banaji, M. R. (2019). The relationship between implicit intergroup attitudes and beliefs. Proceedings of the National Academy of Sciences, 116(13), 5862–5871. https://doi.org/10.1073/pnas.1820240116

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
Cleaned Data		Cleaned Data
Create Dictionary		Create Dictionary
Other		Other
Parsed Data		Parsed Data
R Codes		R Codes
R Scripts		R Scripts
Raw Data		Raw Data
Result Documents		Result Documents
Results		Results
Supplementary Materials		Supplementary Materials
Tokenized Data		Tokenized Data
.DS_Store		.DS_Store
Calculate Stereotype Proportion.ipynb		Calculate Stereotype Proportion.ipynb
Concatenate Parsed Data.ipynb		Concatenate Parsed Data.ipynb
Generating Tokens Details.ipynb		Generating Tokens Details.ipynb
Graphing for Interaction.R		Graphing for Interaction.R
Hate Speech Detection.ipynb		Hate Speech Detection.ipynb
MINVERA_AdvancedScrape_X.ipynb		MINVERA_AdvancedScrape_X.ipynb
Parsing Script for Raw HTML.ipynb		Parsing Script for Raw HTML.ipynb
README.md		README.md
Term frequency proportion.ipynb		Term frequency proportion.ipynb
WEAT.ipynb		WEAT.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📋 Overview

🔧 Components

📦 Dependencies

🚀 Usage

📁 Available Resources

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📋 Overview

🔧 Components

📦 Dependencies

🚀 Usage

📁 Available Resources

📚 References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages