Skip to content

Bad Egg: negative result on merged-PR suspension detection#49

Closed
jeffreyksmithjr wants to merge 7 commits intomainfrom
bad-egg
Closed

Bad Egg: negative result on merged-PR suspension detection#49
jeffreyksmithjr wants to merge 7 commits intomainfrom
bad-egg

Conversation

@jeffreyksmithjr
Copy link
Contributor

@jeffreyksmithjr jeffreyksmithjr commented Mar 13, 2026

Not intended for merge

Closes #45

This is an archival record of the bad-egg experiments. We tried to detect suspended GitHub accounts among authors who have merged PRs. None of the approaches worked well enough to ship.

What was tested

4 commits, ~3,500 lines of experiment code. Methods:

  • Behavioral features (16 PR metadata signals) in logistic regression
  • k-NN proximity (cosine/euclidean on behavioral feature vectors)
  • Graph proximity (Jaccard repo overlap, Personalized PageRank on bipartite author-repo graph)
  • Combined models (LR + proximity as extra features, DeLong paired tests)
  • LLM scoring (Gemini 3.1 Pro, 30k API calls, 3 prompt variants x 3 temporal cutoffs)
  • Second-phase re-ranking (LLM re-ranks top-N from first-phase model)

Three evaluation strategies with increasing rigor:

  • Strategy A: discovery-order holdout
  • Strategy B: suspended-only cross-validation
  • Strategy C: temporal holdout with lookahead bias prevention

Results

Method Best AUC Best P@25
Behavioral LR 0.573 0.00
k-NN proximity 0.570 0.08
Jaccard repo overlap 0.595 0.08
LR + Jaccard combined 0.608 n/a
LLM standalone 0.570 0.04
LLM + Jaccard combined 0.577 n/a

Best AUC is 0.608 (random = 0.50). Best P@25 is 0.08, i.e. 2 out of the top 25 flagged accounts are actually suspended. 92% false positive rate.

Why it doesn't work

The merged-PR population (~2.5% suspended) is too homogeneous. These are authors who passed code review. Suspended accounts with merged PRs look the same as active accounts with merged PRs. Prior work (stage 6) got AUC 0.619 on the full population, but that included zero-PR suspended accounts that are trivially separable. Restricting to merged PRs collapses the signal.

Files

  • experiments/bot_detection/proximity_results/PROXIMITY_ANALYSIS.md - full results and conclusion
  • experiments/bot_detection/proximity_results/*.json - raw results (k-NN, graph, combined, LLM)
  • scripts/proximity_*.py - experiment code (common utils, k-NN, graph, combined, LLM client, LLM experiment)
  • scripts/refit_bad_egg.py - 8-feature Bad Egg scoring model (useful for full population, not merged-PR subset)

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request documents a comprehensive experimental investigation into the effectiveness of various methods for detecting suspended GitHub accounts among authors who have successfully merged pull requests. The study concludes that, despite extensive testing of behavioral, proximity-based, and LLM-driven models, no method yields operationally useful results for this specific population. The core finding highlights the challenge of distinguishing suspended accounts that have passed human code review from active contributors, as their behavioral patterns become too homogeneous for reliable automated detection.

Highlights

  • Bad Egg Model Ineffectiveness: Extensive testing across various models (behavioral LR, k-NN, graph proximity, LLM scoring) shows the "Bad Egg" suspension advisory model has no discriminative power on the production-relevant population of authors with merged PRs.
  • Experimental Scope: The investigation involved 4 commits and ~3,500 lines of experimental code, testing behavioral features, k-NN proximity, graph proximity, combined models, and LLM scoring with multiple evaluation strategies.
  • Key Finding - Homogeneous Population: The primary reason for the model's failure is the homogeneity of the merged-PR population; suspended accounts with merged PRs are behaviorally indistinguishable from active accounts with merged PRs, as they have already passed human code review.
  • Actionable Thresholds: The model cannot produce actionable thresholds, with the best AUC at 0.608 (random is 0.50) and precision at 0.08 (92% false positive rate), rendering it operationally useless.
Changelog
  • action.yml
    • Updated action inputs and outputs to support the Bad Egg suspension advisory.
  • experiments/bot_detection/BAD_EGG_VALIDATION.md
    • Added a detailed report validating the ineffectiveness of the Bad Egg suspension advisory model.
  • experiments/bot_detection/proximity_results/PROXIMITY_ANALYSIS.md
    • Added a comprehensive report summarizing the negative results of proximity-based suspension detection.
  • experiments/bot_detection/proximity_results/combined_results.json
    • Added experimental results for combined behavioral and proximity models.
  • experiments/bot_detection/proximity_results/graph_results.json
    • Added experimental results for graph-based proximity methods.
  • experiments/bot_detection/proximity_results/knn_results.json
    • Added experimental results for k-NN proximity methods.
  • experiments/bot_detection/proximity_results/llm_results.json
    • Added experimental results for LLM scoring in suspension detection.
  • scripts/proximity_analysis.py
    • Added a script to aggregate and report proximity experiment results.
  • scripts/proximity_combined.py
    • Added a script to run combined proximity and behavioral model experiments.
  • scripts/proximity_common.py
    • Added a utility script with shared functions for proximity experiments.
  • scripts/proximity_graph_experiment.py
    • Added a script to conduct graph-based proximity experiments.
  • scripts/proximity_knn_experiment.py
    • Added a script to conduct k-NN proximity experiments.
  • scripts/proximity_llm_client.py
    • Added an asynchronous client for LLM API calls with caching.
  • scripts/proximity_llm_experiment.py
    • Added a script to run LLM-based suspension detection experiments.
  • scripts/validate_bad_egg_features.py
    • Added a script to validate Bad Egg features with temporal holdout and correct population.
  • src/good_egg/action.py
    • Updated action to process bad-egg input and output suspicion-level.
  • src/good_egg/cli.py
    • Updated CLI to include a --no-bad-egg option.
  • src/good_egg/config.py
    • Updated configuration to include Bad Egg model settings and cache TTL for repo contributors.
  • src/good_egg/formatter.py
    • Updated output formatting to display suspension advisory information.
  • src/good_egg/github_client.py
    • Updated GitHub client to fetch repository contributors for isolation score calculation.
  • src/good_egg/mcp_server.py
    • Updated server response to include suspicion level and probability.
  • src/good_egg/models.py
    • Defined new models for SuspicionLevel, SuspicionScore, and added suspicion_score to TrustScore and repo_contributors to UserContributionData.
  • src/good_egg/scorer.py
    • Implemented the _compute_suspicion_score method for the Bad Egg model.
  • tests/conftest.py
    • Added fixtures for SuspicionScore and TrustScore with suspicion data.
  • tests/test_action.py
    • Added tests for bad-egg input and suspicion-level output in the action.
  • tests/test_cli.py
    • Added a test for the --no-bad-egg CLI option.
  • tests/test_formatter.py
    • Added tests for formatting suspicion advisory in various outputs.
  • tests/test_github_client.py
    • Added tests for fetching repository contributors.
  • tests/test_models.py
    • Added tests for SuspicionLevel, SuspicionScore, and UserContributionData changes.
  • tests/test_scorer.py
    • Added tests for the _compute_suspicion_score method and its feature calculations.
Activity
  • The pull request documents a thorough experimental investigation into detecting suspended GitHub accounts among authors with merged PRs.
  • The investigation involved extensive testing across various machine learning models and LLM scoring.
  • The conclusion of the investigation is negative, indicating that no tested method produces operationally useful results for this specific population.
  • The PR serves as a historical record of this negative result, highlighting the challenges in distinguishing suspended accounts that have successfully merged PRs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a "Bad Egg" suspension advisory score feature, including new configuration options, CLI arguments, GitHub Action inputs/outputs, and updates to data models and formatting. The core logic involves calculating an 8-feature logistic regression model to determine a user's suspicion level (HIGH, ELEVATED, NORMAL) based on their contribution patterns and graph-based isolation scores. Experimental results indicate that the model, when applied to the production-relevant population of authors with merged PRs, lacks discriminative power. Review comments highlight the need to update the BadEggModelConfig docstring to reflect this lack of discriminative power and to simplify median calculations in the _compute_suspicion_score method by using Python's statistics.median function.

Adds a suspension risk advisory (Bad Egg) that computes a suspicion
score using an 8-feature logistic regression model when scoring PR
authors. Includes full integration across action, CLI, MCP server,
and formatter outputs.

Feature validation (scripts/validate_bad_egg_features.py) with
temporal holdout on the production-relevant population (authors with
merged PRs only) shows AUCs at chance level (0.47-0.56). Full results
documented in experiments/bot_detection/BAD_EGG_VALIDATION.md.
Test whether k-NN distance, Jaccard repo overlap, and personalized
PageRank can detect suspended accounts in the merged-PR population
where individual behavioral features failed (all AUCs ~0.50).

Key findings:
- H1 SUPPORTED: k-NN achieves AUC 0.570 on merged-PR population
- H2 SUPPORTED: Jaccard max repo overlap achieves AUC 0.595 (best method)
- H3 NOT SUPPORTED: 44 biased seeds don't generalize (AUC 0.44)
- H4 SUPPORTED: Jaccard adds +0.049 AUC to LR (p<0.0001)

Scripts: proximity_common.py (shared lib), proximity_knn_experiment.py,
proximity_graph_experiment.py, proximity_combined.py, proximity_analysis.py
Tests Gemini 3.1 Pro scoring of PR titles/bodies as a signal for
detecting suspended GitHub accounts in the merged-PR population.
Uses temporal cutoffs (Strategy C) to prevent lookahead bias.

Three prompt variants (V1: titles, V2: titles+bodies, V3: full profile)
scored across 3 cutoffs (2022-07-01, 2023-01-01, 2024-01-01) totaling
30,131 API calls. Results: standalone LLM AUC 0.50-0.57, marginal.
Combined LR(F10)+LLM(V2)+Jaccard reaches AUC 0.577 at the 2024-01-01
cutoff (+0.026 over Jaccard alone, significant after Holm-Bonferroni).
Second-phase re-ranking is ineffective. H5 weakly supported.
…R population

Updates the analysis report with a clear conclusion section explaining why
none of the tested methods (behavioral LR, k-NN, Jaccard, PPR, LLM) produce
operationally useful results for detecting suspended accounts among authors
with merged PRs. Best AUC 0.608 with near-zero precision. The merged-PR
population is too homogeneous — these authors passed code review by definition.
@github-actions
Copy link

🥚 Good Egg: HIGH Trust

Score: 83%

Top Contributions

Repository PRs Language Stars
2ndSetAI/good-egg 20 Python 22
jeffreyksmithjr/verskyt 9 Python 2
jeffreyksmithjr/galapagos_nao 7 Elixir 21
aws-samples/aws-big-data-blog 3 Java 894
pytorch/pytorch.github.io 2 HTML 278
numpy/numpy 1 Python 31593
melissawm/open-source-ai-contribution-policies 1 N/A 130
nerves-project/nerves_examples 1 Elixir 402
kilimchoi/engineering-blogs 1 Ruby 37415
kdeldycke/plumage 1 CSS 55

@jeffreyksmithjr jeffreyksmithjr requested a review from rlronan March 13, 2026 11:56
@jeffreyksmithjr
Copy link
Contributor Author

This work did not result in a viable bot detection methodology, thus is being closed.

Copy link
Contributor

@rlronan rlronan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not review in too much detail; main concerns would be knn proximity score returns scores that are < 0; the code has a lode of places where nans are infilled with medians or 0s, which isn't necessary wrong, but if there are a lot of Nans might obscure data issues, and there looks to be a data leakage in run_experiment_a

Comment on lines +36 to +52
def knn_proximity_score(
seed_x: np.ndarray,
eval_x: np.ndarray,
k: int,
metric: str,
) -> np.ndarray:
"""Score eval set by negative mean distance to k nearest seeds.

Higher score = closer to seeds = more suspicious.
"""
effective_k = min(k, len(seed_x))
if effective_k == 0:
return np.zeros(len(eval_x))
nn = NearestNeighbors(n_neighbors=effective_k, metric=metric)
nn.fit(seed_x)
distances, _ = nn.kneighbors(eval_x)
return -distances.mean(axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a weird way to convert distance to a score; you end up the the 'highest scoring' values being small negative values and the 'lowest scoring' values being smaller negative values. Seems like it works, but it's weird for the score contribution to be: -0.1 when suspicious and -5.0 when not suspicious IMO.

Comment on lines +92 to +100
# Scale on seeds + active (no test suspended)
all_train_x = np.vstack([
seed_x,
prepare_features(
test_df[test_df["account_status"] == "active"], fs,
),
])
scaler = StandardScaler()
scaler.fit(all_train_x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be using the test data for feature creation of scaling which is a form of data leakage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bad Egg v1: suspension advisory score

2 participants