This repository contains a comprehensive system for sampling and analyzing Reddit "Am I The Asshole" (AITA) data. The system helps researchers and analysts work with large Reddit datasets by creating manageable, stratified samples.
AITA-Data-Analysis/
βββ data/ # Original data files
β βββ submission.csv # ~31K submissions (30MB+)
β βββ comment.csv # ~9.1M comments (2GB+)
β βββ AmItheAsshole.sqlite # SQLite database
β
βββ config/ # Configuration files
β βββ sampling_config.yaml # YAML configuration for sampling parameters
β
βββ samples/ # Generated sample files
β βββ sampled_submissions.csv # CSV format for analysis
β βββ sampled_comments.csv # CSV format for analysis
β βββ sampled_review.txt # Human-readable TXT format
β βββ sampled_metadata.yaml # YAML metadata with statistics
β βββ *_summary.txt # Simple text summaries
β βββ balanced/ # Balanced sampling outputs
β β βββ balanced_comments.csv # Balanced comments with placeholder categories
β β βββ balanced_submissions.csv # Corresponding submission context
β βββ verdict/ # Verdict extraction outputs
β βββ verdict_all_verdicts.csv # All extracted verdicts
β βββ verdict_balanced_samples.csv # Balanced samples based on actual verdicts
β βββ verdict_summary.txt # Verdict distribution statistics
β
βββ favorites/ # Manually selected favorites
β βββ engagement/ # Engagement-based sampling favorites
β β βββ engagement_favorite_submissions.csv # CSV format for analysis
β β βββ engagement_favorite_comments.csv # CSV format for analysis
β β βββ engagement_favorite_submissions.txt # Human-readable TXT format
β βββ balanced/ # Balanced sampling favorites
β β βββ balanced_favorite_comments.csv # Balanced sample favorites
β β βββ balanced_favorite_submissions.csv # Corresponding submission context
β β βββ balanced_favorite_comments.txt # Human-readable balanced favorites
β βββ stratified/ # Stratified sampling favorites
β βββ stratified_favorite_submissions.csv # Stratified sample favorites
β βββ stratified_favorite_comments.csv # All comments for stratified favorites
β βββ stratified_favorite_submissions.txt # Human-readable stratified favorites
β
βββ Scripts
β βββ sample_data.py # Engagement-based sampling
β βββ explore_data.py # Data exploration and analysis
β βββ preview_sample.py # Preview sampled data
β βββ simple_select.py # Interactive selection
β βββ extract_verdicts.py # Verdict extraction and balanced sampling
β βββ select_balanced_favorites.py # Balanced sample selection
β βββ run_balanced_workflow.py # Complete balanced workflow
β βββ stratified_aita_sample.py # Stratified AITA sampling
β βββ select_stratified_favorites.py # Stratified sample selection
β βββ reading_data.ipynb # Jupyter notebook for data loading
β
βββ README.md # This file
# Run the setup script to create directories and install dependencies
python setup.py# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or: venv\Scripts\activate # On Windows
# Install dependencies
pip install -r requirements.txt
# Create directories
mkdir -p data samples favorites config# Install dependencies globally (not recommended)
pip install -r requirements.txt
# Create directories
mkdir -p data samples favorites configPlace your data files in the data/ directory:
data/submission.csvdata/comment.csv
If you have a SQLite database, you can use the provided notebook to export CSVs:
jupyter notebook reading_data.ipynbGet a sense of your dataset and decide on filtering parameters:
python explore_data.pyCreate a stratified, balanced sample of submissions and comments:
python sample_data.pyYou can customize the sample with options (see below).
See examples from each engagement tier and review sample statistics:
python preview_sample.pyInteractively choose your preferred samples for detailed analysis:
python simple_select.pyYour selections will be saved in the favorites/ directory.
This approach samples based on post popularity and engagement:
# 1. Explore your data
python explore_data.py
# 2. Generate engagement-based sample
python sample_data.py --sample-type standard
# 3. Preview the sample
python preview_sample.py
# 4. Select your favorites
python simple_select.pyOutput: favorites/engagement/engagement_favorite_submissions.csv and favorites/engagement/engagement_favorite_comments.csv
This approach creates balanced samples across verdict categories:
# 1. Run complete balanced workflow with interactive selection
python run_balanced_workflow.py --interactive
# OR run individual steps:
# 2a. Extract verdicts and create balanced samples
python extract_verdicts.py --sample-size 100000 --samples-per-category 10
# 2b. Select your favorites from balanced samples
python select_balanced_favorites.pyOutput: favorites/balanced/balanced_favorite_comments.csv and favorites/balanced/balanced_favorite_submissions.csv
This approach samples AITA submissions stratified by verdict, then lets you select your favorites:
# 1. Create stratified sample of submissions by verdict
python stratified_aita_sample.py
# 2. Select your favorite submissions from each verdict category
python select_stratified_favorites.pyOutput: favorites/stratified/stratified_favorite_submissions.csv and favorites/stratified/stratified_favorite_comments.csv
Key Features:
- Samples submissions (not comments) stratified by dominant verdict
- Filters by length for manageable content
- 5x oversampling for selection flexibility
- Groups submissions by verdict category during selection
- Includes all comments for selected submissions
Engagement-Based Workflow:
sample_data.pyβ createssamples/sampled_submissions.csvandsamples/sampled_comments.csvsimple_select.pyβ reads from samples, lets you select submissions- Selected submissions + their top comments β saved to
favorites/engagement/
Verdict-Based Workflow:
extract_verdicts.pyβ createssamples/verdict/verdict_balanced_samples.csv+samples/verdict/verdict_balanced_submissions.csvselect_balanced_favorites.pyβ reads balanced samples, lets you select comments- Selected comments + their full AITA submission context β saved to
favorites/balanced/
Stratified AITA Workflow:
stratified_aita_sample.pyβ createssamples/stratified/*_submissions.csv+samples/stratified/*_comments.csvselect_stratified_favorites.pyβ reads stratified samples, lets you select submissions- Selected submissions + all their comments β saved to
favorites/stratified/
Key Differences:
- Engagement workflow: You select submissions, get their comments
- Verdict workflow: You select comments, get their submission context
- Stratified workflow: You select submissions, get all their comments
Engagement-Based Selection (simple_select.py):
SUBMISSION 1/20 - HIGH ENGAGEMENT TIER
Title: AITA for refusing to babysit my nephew?
Score: 1540
TEXT: [Full submission text...]
TOP COMMENTS:
1. (Score: 45): NTA, it's not your responsibility...
2. (Score: 32): You're absolutely right to say no...
Select this submission? (y/n/q to quit):
Verdict-Based Selection (select_balanced_favorites.py):
COMMENT 1/6 - NOT THE ASSHOLE
Comment ID: jmjed4l
Submission ID: 13xix2x
Score: 1
Verdict: not the asshole
COMMENT TEXT:
NTA. Its not your child, not your responsibility...
SUBMISSION CONTEXT:
Title: AITA for denying my sister of babysitting my nephew?
Score: 1540
Submission Text: [First 300 characters...]
Select this comment? (y/n/q to quit):
Stratified Selection (select_stratified_favorites.py):
SUBMISSION 1/90 - NOT THE ASSHOLE
Submission ID: 13xix2x
Title: AITA for denying my sister of babysitting my nephew?
Score: 1540
Dominant Verdict: not the asshole
Verdict Count: 45
Length: 1,247 characters
AITA SUBMISSION:
[Full submission text...]
TOP COMMENTS (203 total):
Comment (Score: 45): NTA, it's not your responsibility...
Comment (Score: 32): You're absolutely right to say no...
Select this submission? (y/n/q to quit):
-
Change sample size or filtering:
Use command-line options withsample_data.py, e.g.:python sample_data.py --max-submission-chars 1000 --max-comment-chars 300 --target-n 30
-
Use a preset sample type:
python sample_data.py --sample-type conservative # or: --sample-type standard, --sample-type large -
Change output locations:
All paths are managed inconfig.py. Edit this file to change directory names or file locations.
samples/sampled_submissions.csvβ Sampled submissions with engagement metricssamples/sampled_comments.csvβ Top comments for each sampled submissionfavorites/engagement/engagement_favorite_submissions.csvβ Your manually selected submissionsfavorites/engagement/engagement_favorite_comments.csvβ Comments for your selected submissions
Balanced Sampling Files:
samples/verdict/verdict_all_verdicts.csvβ All extracted verdicts with metadatasamples/verdict/verdict_balanced_samples.csvβ Balanced samples based on actual verdictsfavorites/balanced/balanced_favorite_comments.csvβ Your selected favorites from balanced samplesfavorites/balanced/balanced_favorite_submissions.csvβ Corresponding submission context
Stratified Sampling Files:
samples/stratified/*_submissions.csvβ Stratified submissions by verdict categorysamples/stratified/*_comments.csvβ All comments for stratified submissionsfavorites/stratified/stratified_favorite_submissions.csvβ Your selected favorite submissionsfavorites/stratified/stratified_favorite_comments.csvβ All comments for selected submissions
samples/sampled_review.txtβ Complete human-readable sample with all submissions and commentsfavorites/engagement/engagement_favorite_submissions.txtβ Your selected submissions in easy-to-read formatsamples/*_summary.txtβ Simple text summaries of sampling statistics
config/sampling_config.yamlβ All sampling parameters and configurationssamples/sampled_metadata.yamlβ Detailed metadata about each sampling run
- All scripts use the paths defined in
config.pyandconfig/sampling_config.yaml - TXT files are perfect for manual review and sharing with collaborators
- YAML files store configuration and metadata for reproducibility
- CSV files remain the primary format for data analysis
# Create a new virtual environment
python -m venv venv
# Activate the virtual environment
source venv/bin/activate # On macOS/Linux
# or: venv\Scripts\activate # On Windows
# Verify activation (you should see (venv) in your prompt)
which python # Should point to venv/bin/python# Install dependencies in the virtual environment
pip install -r requirements.txt
# Run scripts (make sure venv is activated)
python explore_data.py
python sample_data.py
# Deactivate when done
deactivate- Isolation: Prevents conflicts between project dependencies
- Reproducibility: Ensures consistent environment across different machines
- Cleanup: Easy to remove all project dependencies by deleting the venv folder
- Best Practice: Standard practice for Python development
- If you see "command not found": Make sure the virtual environment is activated
- If packages aren't found: Run
pip install -r requirements.txtagain - To remove the environment: Simply delete the
venv/folder
This system offers two complementary sampling approaches:
Uses sample_data.py to create stratified samples based on post engagement (score quintiles).
Creates balanced samples across verdict categories for fair analysis.
The AITA dataset has a natural imbalance in verdicts:
- not the asshole: ~63% of comments
- asshole: ~31% of comments
- everyone sucks: ~4% of comments
- no assholes here: ~2% of comments
Balanced sampling ensures equal representation across all verdict categories.
# Complete workflow with interactive selection
python run_balanced_workflow.py --interactive
# Custom parameters
python run_balanced_workflow.py --samples-per-category 10 --max-comment-chars 500
# Individual steps
python extract_verdicts.py --sample-size 100000 --samples-per-category 10
python select_balanced_favorites.pyUse Engagement-Based Sampling (sample_data.py) when:
- Studying community engagement patterns
- Analyzing what makes posts popular
- Researching content virality
- Need diverse representation across popularity levels
Use Verdict-Based Sampling (extract_verdicts.py) when:
- Analyzing moral judgments and verdicts
- Studying community decision-making
- Need balanced representation across verdict categories
- Researching bias in community judgments
- Similar to your sexism study approach
Use Stratified AITA Sampling (stratified_aita_sample.py) when:
- Want to select submissions (not comments) as your primary unit
- Need balanced representation across verdict categories
- Want to see full AITA stories with all their comments
- Studying narrative patterns in AITA submissions
- Need submission-level analysis with complete comment context
- Similar to paper examples that sample submissions stratified by category
Purpose: Creates stratified, balanced samples from large Reddit datasets.
Key Features:
- Character filtering: Limits content length for manageable analysis
- Engagement stratification: Balances samples across popularity levels
- Oversampling: Generates 5x more samples than needed for selection
- Representative comments: Includes top-scoring comments for each submission
Parameters:
python sample_data.py [OPTIONS]
Options:
--max-submission-chars INT Maximum characters for submissions (default: 2000)
--max-comment-chars INT Maximum characters for comments (default: 500)
--target-n INT Target number of samples (default: 50)
--oversample-factor INT Oversample factor (default: 5)
--comments-per-submission INT Number of top comments per submission (default: 3)
--output-prefix STR Output file prefix (default: sampled)Example Usage:
# Conservative sample (shorter content)
python sample_data.py --max-submission-chars 1000 --max-comment-chars 300 --target-n 30
# Large sample (longer content)
python sample_data.py --max-submission-chars 3000 --max-comment-chars 800 --target-n 100Purpose: Analyzes data distributions to inform sampling decisions.
Outputs:
- Character length distributions for submissions and comments
- Score/engagement statistics
- Impact analysis of different filtering thresholds
- Sample submissions for manual review
Usage:
python explore_data.pyPurpose: Shows readable previews of generated samples.
Features:
- One example from each engagement tier
- Sample statistics and distributions
- Length breakdowns
- Top comment previews
Usage:
python preview_sample.pyPurpose: Streamlined tool for manually selecting preferred samples.
Features:
- Shows 2 samples from each engagement tier
- Simple y/n/q interface
- Automatic saving of selections
- Error handling and validation
Usage:
python simple_select.pyPurpose: Extracts actual verdicts from comments and creates balanced samples.
Features:
- Uses regex patterns to identify YTA, NTA, ESH, NAH in comment text
- Creates truly balanced samples based on actual verdict distribution
- Provides detailed statistics on verdict distribution
- Filters by comment length for manageable samples
Usage:
# Extract verdicts and create balanced samples
python extract_verdicts.py --sample-size 100000 --samples-per-category 10
# Custom parameters
python extract_verdicts.py --sample-size 50000 --samples-per-category 15 --max-comment-chars 400Purpose: Interactive selection from balanced verdict samples.
Features:
- Shows comments grouped by verdict category
- Displays submission context for each comment
- Allows manual selection of preferred comments
- Saves selections in multiple formats
Usage:
python select_balanced_favorites.pyPurpose: One-command execution of the entire balanced sampling workflow.
Features:
- Automates verdict extraction and balanced sampling
- Optional interactive selection
- Customizable parameters
- Comprehensive error handling
Usage:
# Complete workflow with interactive selection
python run_balanced_workflow.py --interactive
# Custom parameters
python run_balanced_workflow.py --samples-per-category 10 --max-comment-chars 500
# Quick test
python run_balanced_workflow.py --sample-size 10000 --samples-per-category 5Purpose: Creates stratified samples of AITA submissions balanced by verdict category.
Features:
- Samples submissions (not comments) stratified by dominant verdict
- Filters by submission and comment length for manageable content
- 5x oversampling for selection flexibility
- Extracts verdicts from comments to categorize submissions
- Balances samples across verdict categories (YTA, NTA, ESH, NAH)
Usage:
# Default stratified sampling
python stratified_aita_sample.py
# Custom parameters
python stratified_aita_sample.py --max-submission-chars 2000 --max-comment-chars 500 --oversample-factor 3Output: Creates samples/stratified/ directory with balanced submission samples.
Purpose: Interactive selection from stratified AITA submission samples.
Features:
- Shows submissions grouped by verdict category
- Displays full submission text with top comments
- Allows manual selection of preferred submissions
- Saves selections with all associated comments
- Exports to human-readable TXT format
Usage:
python select_stratified_favorites.pyOutput: Creates favorites/stratified/ directory with selected submissions and all their comments.
- Submissions: Filter by
selftextlength - Comments: Filter by
messagelength - Rationale: Shorter content is easier to analyze and annotate
- 5 Tiers: Very Low, Low, Medium, High, Very High
- Based on: Submission score (upvotes)
- Balance: Equal representation across popularity levels
- Rationale: Ensures diverse perspectives, not just viral posts
- Factor: 5x target size
- Purpose: Provides selection flexibility
- Example: Target 50 β Generate 250 β Select 5-10 favorites
- Method: Top-scoring comments per submission
- Count: 2-3 comments per submission
- Rationale: Representative community responses
id,submission_id,title,selftext,created_utc,permalink,scoreid,submission_id,message,comment_id,parent_id,created_utc,scoreid,submission_id,title,selftext,created_utc,permalink,score,comment_count,avg_comment_score,engagement_tier- Content Analysis: Study narrative patterns in AITA posts
- Community Dynamics: Analyze voting patterns and engagement
- Linguistic Studies: Examine writing styles and argument structures
- Social Psychology: Study moral reasoning and judgment
- Text Classification: Train models to predict verdicts
- Sentiment Analysis: Analyze emotional content
- Topic Modeling: Identify common themes and issues
- Engagement Prediction: Model what makes posts popular
# For very short content (easier annotation)
python sample_data.py --max-submission-chars 500 --max-comment-chars 200
# For longer content (more context)
python sample_data.py --max-submission-chars 4000 --max-comment-chars 1000# Small pilot study
python sample_data.py --target-n 10 --oversample-factor 3
# Large comprehensive study
python sample_data.py --target-n 200 --oversample-factor 2# More comments per submission
python sample_data.py --comments-per-submission 5
# Fewer comments (faster processing)
python sample_data.py --comments-per-submission 1*_submissions.csv: Sampled submissions with engagement metrics*_comments.csv: Top comments for each sampled submission*_summary.txt: Detailed sampling statistics and distributionsengagement_favorite_submissions.csv: Manually selected submissions for analysisengagement_favorite_comments.csv: Comments for selected submissionsfavorite_summary.txt: Complete text and metadata for selected samples
SAMPLING SUMMARY
==================================================
Total submissions sampled: 250
Total comments sampled: 750
ENGAGEMENT TIER DISTRIBUTION:
Very Low: 50
Low: 50
Medium: 50
High: 50
Very High: 50
SCORE STATISTICS:
Mean score: 1649.60
Median score: 410.50
Min score: 1
Max score: 26634
- Problem: Large datasets cause memory errors
- Solution: Use smaller character limits or process in chunks
- Problem: Script can't find CSV files
- Solution: Ensure
submission.csvandcomment.csvare in the same directory
- Problem: No data after filtering
- Solution: Increase character limits or check data format
- Problem:
select_favorites.pycrashes or doesn't respond - Solution: Use
simple_select.pyinstead (more robust) - Problem: Script shows "nothing happens"
- Solution: Check that sample files exist and try the simple version first
- pandas: Data manipulation and analysis
- numpy: Numerical operations
- pyyaml: YAML file handling
- matplotlib/seaborn: Data visualization (optional)
To extend this system:
- Add new stratification methods (e.g., by topic, time period)
- Implement different sampling strategies (e.g., cluster sampling)
- Add data validation and cleaning steps
- Create visualization tools for sample analysis
# 1. Explore your data
python explore_data.py
# 2. Generate a conservative sample
python sample_data.py --max-submission-chars 1000 --max-comment-chars 300 --target-n 30 --oversample-factor 3
# 3. Preview the sample
python preview_sample.py
# 4. Select your favorites
python simple_select.py
# 5. Review your selections
cat favorite_summary.txtAfter running the complete workflow, you'll have:
- 10-15 manually selected samples across all engagement tiers
- Complete text and comments for each selected sample
- Balanced representation of different popularity levels
- Ready-to-analyze data for your research
This code is provided for research and educational purposes. Please respect Reddit's terms of service and data usage policies.