A Python rebuild of a 2025 MS Business Analytics capstone analysis at St. John's University, sponsored by the New York Post. The original analysis (logistic regression on survey data, descriptive cross-tabs) was done in R; this rebuild reproduces it end-to-end in Python and adds out-of-sample evaluation along the way.
The result is a faithful replication of the original findings — plus two statistically significant effects the R analysis treated as null.
The original capstone asked one question: among Gen Z respondents in the NY Metro area, who would be open to "creator-style" news content — short, engaging, presented like their favorite social media creator? A team of four MS students surveyed 520 St. John's students aged 18–26 and modeled the binary outcome with logistic regression. The deliverable was a 16-slide executive deck for the New York Post.
This rebuild reproduces the analysis in Python from scratch. The motivation is direct: the R analysis is good, but R skills don't transfer easily to industry analytics roles. Rebuilding in pandas / statsmodels / scikit-learn proves the bilingual fluency, and the rebuild process is a chance to add the engineering rigor the original timeline didn't allow.
Two effects the original deck missed. The Python regression flags two predictors as statistically significant that the R version reported as null:
- Distrust of mainstream outlets is associated with less interest in creator-style news (OR = 0.53, p = 0.044), not more. The intuitive hypothesis is that anti-mainstream respondents would be exactly the segment most open to creators. The data says the opposite — distrust appears to be a rejection of all current formats, not a gateway to alternatives. Substantive read: these respondents don't perceive creators as more trustworthy than traditional outlets.
- Age shows a positive linear trend (OR = 1.43 per bucket, p = 0.026). Older Gen Z (24–26) are roughly 2× more likely than 18–20s to want creator-style news, controlling for engagement type, barriers, and platform usage. This reframes the strategic recommendation: the under-served audience is the older end of Gen Z, not the youngest.
The original analysis's main story holds. The two strongest positive predictors — barrier_format ("current news is inconvenient", OR = 2.07, p = 0.001) and barrier_social ("social media already covers my news", OR = 1.56, p = 0.048) — replicate cleanly.
Short-form video dominates across every engagement archetype. Even respondents who classified themselves as "News Avoiders" prefer short-form video (56%) over any other format. This was the deck's signature insight, and the Python heatmap reproduces it cell-for-cell.
The pipeline is three modules with strict separation of concerns:
src/
├── cleaning.py # Load → drop text columns → filter to Gen Z → flag NY metro
├── features.py # Recode 27 binary flags + dependent variable + engagement typology
└── modeling.py # Design matrix → statsmodels Logit → coefficient table with ORs
Every function takes a DataFrame and returns a DataFrame. No global state. The R original repeated the same if_else(!is.na(x) & x != "" & x != "0", 1L, 0L) block 27 times; here that's a single _is_selected helper plus three dictionaries mapping output names to source columns — roughly 30 lines instead of 150.
The two notebooks live in notebooks/:
notebooks/
├── 01_eda.ipynb # Reproduces deck slides 5, 6, 9 (platforms, format-by-age, heatmap)
└── 02_modeling.ipynb # Reproduces slide 7 + adds train/test ROC curve
Both render fully inline on GitHub — open them in the browser to see the analysis without running anything.
- Data wrangling:
pandas 2.x - Modeling:
statsmodels(inference: coefficients, p-values, CIs),scikit-learn(train/test split, ROC, AUC) - Visualization:
matplotlib,seaborn - Environment:
uvfor dependency management, Python 3.12 - Notebooks: Jupyter via the VS Code Jupyter extension
git clone https://github.com/[FILL IN: your username]/nypost-gen-z-python.git
cd nypost-gen-z-python
uv sync # Installs all dependencies into .venv
# Run the modeling pipeline end-to-end
uv run python -m src.modeling
# Or open the notebooks
uv run jupyter notebookThe 2025 capstone was a team effort by Mya Lamadrid, Mohammed Ahmed, Anthony Onwugbenu, and Paul Rodriguez, advised by our capstone faculty advisor at St. John's University, The Peter J. Tobin College of Business, sponsored by the New York Post. The original R analysis and executive deck are the team's and sponsor's work; this repository is a personal Python rebuild for portfolio purposes.
- Bilingual R/Python fluency on the same dataset — same model, same findings, plus more
- Modular package structure with separation of concerns (
cleaning→features→modeling) - Dictionary-driven feature engineering replacing repeated R boilerplate
- Statistical inference with
statsmodels(publication-quality coefficient tables with CIs) - Held-out test methodology with
scikit-learnthat the R version doesn't include - Analytical communication: notebooks that walk a reader through the findings with deck-faithful visualizations
- K-means clustering for data-driven personas. The deck's two personas were synthesized in PowerPoint; a proper Python version would derive them with elbow + silhouette validation.
- 5-fold cross-validation for a more stable AUC estimate. The single-split ROC point estimate has wide uncertainty around it.
- Sensitivity analysis on the inclusive outcome (top-3-box of Q11). Strict and inclusive outcomes should produce consistent coefficient signs.
- Unit tests in
tests/for the_is_selectedhelper and the design matrix construction. - Streamlit dashboard that lets a user input a hypothetical respondent profile and see the predicted probability + the factors driving it.
Built by Paul Rodriguez — finance and analytics professional with a background in federal grants management, Big Four tax, and international finance consulting. LinkedIn · GitHub




