Bad Egg: negative result on merged-PR suspension detection by jeffreyksmithjr · Pull Request #49 · 2ndSetAI/good-egg

jeffreyksmithjr · 2026-03-13T11:22:10Z

Not intended for merge

Closes #45

This is an archival record of the bad-egg experiments. We tried to detect suspended GitHub accounts among authors who have merged PRs. None of the approaches worked well enough to ship.

What was tested

4 commits, ~3,500 lines of experiment code. Methods:

Behavioral features (16 PR metadata signals) in logistic regression
k-NN proximity (cosine/euclidean on behavioral feature vectors)
Graph proximity (Jaccard repo overlap, Personalized PageRank on bipartite author-repo graph)
Combined models (LR + proximity as extra features, DeLong paired tests)
LLM scoring (Gemini 3.1 Pro, 30k API calls, 3 prompt variants x 3 temporal cutoffs)
Second-phase re-ranking (LLM re-ranks top-N from first-phase model)

Three evaluation strategies with increasing rigor:

Strategy A: discovery-order holdout
Strategy B: suspended-only cross-validation
Strategy C: temporal holdout with lookahead bias prevention

Results

Method	Best AUC	Best P@25
Behavioral LR	0.573	0.00
k-NN proximity	0.570	0.08
Jaccard repo overlap	0.595	0.08
LR + Jaccard combined	0.608	n/a
LLM standalone	0.570	0.04
LLM + Jaccard combined	0.577	n/a

Best AUC is 0.608 (random = 0.50). Best P@25 is 0.08, i.e. 2 out of the top 25 flagged accounts are actually suspended. 92% false positive rate.

Why it doesn't work

The merged-PR population (~2.5% suspended) is too homogeneous. These are authors who passed code review. Suspended accounts with merged PRs look the same as active accounts with merged PRs. Prior work (stage 6) got AUC 0.619 on the full population, but that included zero-PR suspended accounts that are trivially separable. Restricting to merged PRs collapses the signal.

Files

experiments/bot_detection/proximity_results/PROXIMITY_ANALYSIS.md - full results and conclusion
experiments/bot_detection/proximity_results/*.json - raw results (k-NN, graph, combined, LLM)
scripts/proximity_*.py - experiment code (common utils, k-NN, graph, combined, LLM client, LLM experiment)
scripts/refit_bad_egg.py - 8-feature Bad Egg scoring model (useful for full population, not merged-PR subset)

gemini-code-assist · 2026-03-13T11:26:46Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request documents a comprehensive experimental investigation into the effectiveness of various methods for detecting suspended GitHub accounts among authors who have successfully merged pull requests. The study concludes that, despite extensive testing of behavioral, proximity-based, and LLM-driven models, no method yields operationally useful results for this specific population. The core finding highlights the challenge of distinguishing suspended accounts that have passed human code review from active contributors, as their behavioral patterns become too homogeneous for reliable automated detection.

Highlights

Bad Egg Model Ineffectiveness: Extensive testing across various models (behavioral LR, k-NN, graph proximity, LLM scoring) shows the "Bad Egg" suspension advisory model has no discriminative power on the production-relevant population of authors with merged PRs.
Experimental Scope: The investigation involved 4 commits and ~3,500 lines of experimental code, testing behavioral features, k-NN proximity, graph proximity, combined models, and LLM scoring with multiple evaluation strategies.
Key Finding - Homogeneous Population: The primary reason for the model's failure is the homogeneity of the merged-PR population; suspended accounts with merged PRs are behaviorally indistinguishable from active accounts with merged PRs, as they have already passed human code review.
Actionable Thresholds: The model cannot produce actionable thresholds, with the best AUC at 0.608 (random is 0.50) and precision at 0.08 (92% false positive rate), rendering it operationally useless.

Changelog

action.yml
- Updated action inputs and outputs to support the Bad Egg suspension advisory.
experiments/bot_detection/BAD_EGG_VALIDATION.md
- Added a detailed report validating the ineffectiveness of the Bad Egg suspension advisory model.
experiments/bot_detection/proximity_results/PROXIMITY_ANALYSIS.md
- Added a comprehensive report summarizing the negative results of proximity-based suspension detection.
experiments/bot_detection/proximity_results/combined_results.json
- Added experimental results for combined behavioral and proximity models.
experiments/bot_detection/proximity_results/graph_results.json
- Added experimental results for graph-based proximity methods.
experiments/bot_detection/proximity_results/knn_results.json
- Added experimental results for k-NN proximity methods.
experiments/bot_detection/proximity_results/llm_results.json
- Added experimental results for LLM scoring in suspension detection.
scripts/proximity_analysis.py
- Added a script to aggregate and report proximity experiment results.
scripts/proximity_combined.py
- Added a script to run combined proximity and behavioral model experiments.
scripts/proximity_common.py
- Added a utility script with shared functions for proximity experiments.
scripts/proximity_graph_experiment.py
- Added a script to conduct graph-based proximity experiments.
scripts/proximity_knn_experiment.py
- Added a script to conduct k-NN proximity experiments.
scripts/proximity_llm_client.py
- Added an asynchronous client for LLM API calls with caching.
scripts/proximity_llm_experiment.py
- Added a script to run LLM-based suspension detection experiments.
scripts/validate_bad_egg_features.py
- Added a script to validate Bad Egg features with temporal holdout and correct population.
src/good_egg/action.py
- Updated action to process bad-egg input and output suspicion-level.
src/good_egg/cli.py
- Updated CLI to include a --no-bad-egg option.
src/good_egg/config.py
- Updated configuration to include Bad Egg model settings and cache TTL for repo contributors.
src/good_egg/formatter.py
- Updated output formatting to display suspension advisory information.
src/good_egg/github_client.py
- Updated GitHub client to fetch repository contributors for isolation score calculation.
src/good_egg/mcp_server.py
- Updated server response to include suspicion level and probability.
src/good_egg/models.py
- Defined new models for SuspicionLevel, SuspicionScore, and added suspicion_score to TrustScore and repo_contributors to UserContributionData.
src/good_egg/scorer.py
- Implemented the _compute_suspicion_score method for the Bad Egg model.
tests/conftest.py
- Added fixtures for SuspicionScore and TrustScore with suspicion data.
tests/test_action.py
- Added tests for bad-egg input and suspicion-level output in the action.
tests/test_cli.py
- Added a test for the --no-bad-egg CLI option.
tests/test_formatter.py
- Added tests for formatting suspicion advisory in various outputs.
tests/test_github_client.py
- Added tests for fetching repository contributors.
tests/test_models.py
- Added tests for SuspicionLevel, SuspicionScore, and UserContributionData changes.
tests/test_scorer.py
- Added tests for the _compute_suspicion_score method and its feature calculations.

Activity

The pull request documents a thorough experimental investigation into detecting suspended GitHub accounts among authors with merged PRs.
The investigation involved extensive testing across various machine learning models and LLM scoring.
The conclusion of the investigation is negative, indicating that no tested method produces operationally useful results for this specific population.
The PR serves as a historical record of this negative result, highlighting the challenges in distinguishing suspended accounts that have successfully merged PRs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a "Bad Egg" suspension advisory score feature, including new configuration options, CLI arguments, GitHub Action inputs/outputs, and updates to data models and formatting. The core logic involves calculating an 8-feature logistic regression model to determine a user's suspicion level (HIGH, ELEVATED, NORMAL) based on their contribution patterns and graph-based isolation scores. Experimental results indicate that the model, when applied to the production-relevant population of authors with merged PRs, lacks discriminative power. Review comments highlight the need to update the BadEggModelConfig docstring to reflect this lack of discriminative power and to simplify median calculations in the _compute_suspicion_score method by using Python's statistics.median function.

src/good_egg/config.py

src/good_egg/scorer.py

Adds a suspension risk advisory (Bad Egg) that computes a suspicion score using an 8-feature logistic regression model when scoring PR authors. Includes full integration across action, CLI, MCP server, and formatter outputs. Feature validation (scripts/validate_bad_egg_features.py) with temporal holdout on the production-relevant population (authors with merged PRs only) shows AUCs at chance level (0.47-0.56). Full results documented in experiments/bot_detection/BAD_EGG_VALIDATION.md.

Test whether k-NN distance, Jaccard repo overlap, and personalized PageRank can detect suspended accounts in the merged-PR population where individual behavioral features failed (all AUCs ~0.50). Key findings: - H1 SUPPORTED: k-NN achieves AUC 0.570 on merged-PR population - H2 SUPPORTED: Jaccard max repo overlap achieves AUC 0.595 (best method) - H3 NOT SUPPORTED: 44 biased seeds don't generalize (AUC 0.44) - H4 SUPPORTED: Jaccard adds +0.049 AUC to LR (p<0.0001) Scripts: proximity_common.py (shared lib), proximity_knn_experiment.py, proximity_graph_experiment.py, proximity_combined.py, proximity_analysis.py

Tests Gemini 3.1 Pro scoring of PR titles/bodies as a signal for detecting suspended GitHub accounts in the merged-PR population. Uses temporal cutoffs (Strategy C) to prevent lookahead bias. Three prompt variants (V1: titles, V2: titles+bodies, V3: full profile) scored across 3 cutoffs (2022-07-01, 2023-01-01, 2024-01-01) totaling 30,131 API calls. Results: standalone LLM AUC 0.50-0.57, marginal. Combined LR(F10)+LLM(V2)+Jaccard reaches AUC 0.577 at the 2024-01-01 cutoff (+0.026 over Jaccard alone, significant after Holm-Bonferroni). Second-phase re-ranking is ineffective. H5 weakly supported.

…R population Updates the analysis report with a clear conclusion section explaining why none of the tested methods (behavioral LR, k-NN, Jaccard, PPR, LLM) produce operationally useful results for detecting suspended accounts among authors with merged PRs. Best AUC 0.608 with near-zero precision. The merged-PR population is too homogeneous — these authors passed code review by definition.

github-actions · 2026-03-13T11:53:13Z

🥚 Good Egg: HIGH Trust

Score: 83%

Top Contributions

Repository	PRs	Language	Stars
2ndSetAI/good-egg	20	Python	22
jeffreyksmithjr/verskyt	9	Python	2
jeffreyksmithjr/galapagos_nao	7	Elixir	21
aws-samples/aws-big-data-blog	3	Java	894
pytorch/pytorch.github.io	2	HTML	278
numpy/numpy	1	Python	31593
melissawm/open-source-ai-contribution-policies	1	N/A	130
nerves-project/nerves_examples	1	Elixir	402
kilimchoi/engineering-blogs	1	Ruby	37415
kdeldycke/plumage	1	CSS	55

jeffreyksmithjr · 2026-03-16T11:11:54Z

This work did not result in a viable bot detection methodology, thus is being closed.

rlronan

Did not review in too much detail; main concerns would be knn proximity score returns scores that are < 0; the code has a lode of places where nans are infilled with medians or 0s, which isn't necessary wrong, but if there are a lot of Nans might obscure data issues, and there looks to be a data leakage in run_experiment_a

rlronan · 2026-03-16T15:47:47Z

scripts/proximity_knn_experiment.py

+def knn_proximity_score(
+    seed_x: np.ndarray,
+    eval_x: np.ndarray,
+    k: int,
+    metric: str,
+) -> np.ndarray:
+    """Score eval set by negative mean distance to k nearest seeds.
+
+    Higher score = closer to seeds = more suspicious.
+    """
+    effective_k = min(k, len(seed_x))
+    if effective_k == 0:
+        return np.zeros(len(eval_x))
+    nn = NearestNeighbors(n_neighbors=effective_k, metric=metric)
+    nn.fit(seed_x)
+    distances, _ = nn.kneighbors(eval_x)
+    return -distances.mean(axis=1)


This feels like a weird way to convert distance to a score; you end up the the 'highest scoring' values being small negative values and the 'lowest scoring' values being smaller negative values. Seems like it works, but it's weird for the score contribution to be: -0.1 when suspicious and -5.0 when not suspicious IMO.

rlronan · 2026-03-16T15:48:21Z

scripts/proximity_knn_experiment.py

+                # Scale on seeds + active (no test suspended)
+                all_train_x = np.vstack([
+                    seed_x,
+                    prepare_features(
+                        test_df[test_df["account_status"] == "active"], fs,
+                    ),
+                ])
+                scaler = StandardScaler()
+                scaler.fit(all_train_x)


This appears to be using the test data for feature creation of scaling which is a form of data leakage

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

src/good_egg/config.py Show resolved Hide resolved

src/good_egg/scorer.py Outdated Show resolved Hide resolved

jeffreyksmithjr added 7 commits March 13, 2026 11:45

Fix markdown line wrapping in analysis report

c486ee3

Remove em dashes and LLM writing tropes from analysis report

b1eb20b

Address Gemini review: update docstring accuracy, simplify median calc

02e9a71

jeffreyksmithjr force-pushed the bad-egg branch from 85b9644 to 02e9a71 Compare March 13, 2026 11:52

jeffreyksmithjr requested a review from rlronan March 13, 2026 11:56

jeffreyksmithjr closed this Mar 16, 2026

rlronan reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad Egg: negative result on merged-PR suspension detection#49

Bad Egg: negative result on merged-PR suspension detection#49
jeffreyksmithjr wants to merge 7 commits intomainfrom
bad-egg

jeffreyksmithjr commented Mar 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

jeffreyksmithjr commented Mar 16, 2026

Uh oh!

rlronan left a comment

Uh oh!

rlronan Mar 16, 2026

Uh oh!

rlronan Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffreyksmithjr commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not intended for merge

What was tested

Results

Why it doesn't work

Files

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 13, 2026

🥚 Good Egg: HIGH Trust

Top Contributions

Uh oh!

jeffreyksmithjr commented Mar 16, 2026

Uh oh!

rlronan left a comment

Choose a reason for hiding this comment

Uh oh!

rlronan Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

rlronan Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreyksmithjr commented Mar 13, 2026 •

edited

Loading