Bad Egg: negative result on merged-PR suspension detection#49
Bad Egg: negative result on merged-PR suspension detection#49jeffreyksmithjr wants to merge 7 commits intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request documents a comprehensive experimental investigation into the effectiveness of various methods for detecting suspended GitHub accounts among authors who have successfully merged pull requests. The study concludes that, despite extensive testing of behavioral, proximity-based, and LLM-driven models, no method yields operationally useful results for this specific population. The core finding highlights the challenge of distinguishing suspended accounts that have passed human code review from active contributors, as their behavioral patterns become too homogeneous for reliable automated detection. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a "Bad Egg" suspension advisory score feature, including new configuration options, CLI arguments, GitHub Action inputs/outputs, and updates to data models and formatting. The core logic involves calculating an 8-feature logistic regression model to determine a user's suspicion level (HIGH, ELEVATED, NORMAL) based on their contribution patterns and graph-based isolation scores. Experimental results indicate that the model, when applied to the production-relevant population of authors with merged PRs, lacks discriminative power. Review comments highlight the need to update the BadEggModelConfig docstring to reflect this lack of discriminative power and to simplify median calculations in the _compute_suspicion_score method by using Python's statistics.median function.
Adds a suspension risk advisory (Bad Egg) that computes a suspicion score using an 8-feature logistic regression model when scoring PR authors. Includes full integration across action, CLI, MCP server, and formatter outputs. Feature validation (scripts/validate_bad_egg_features.py) with temporal holdout on the production-relevant population (authors with merged PRs only) shows AUCs at chance level (0.47-0.56). Full results documented in experiments/bot_detection/BAD_EGG_VALIDATION.md.
Test whether k-NN distance, Jaccard repo overlap, and personalized PageRank can detect suspended accounts in the merged-PR population where individual behavioral features failed (all AUCs ~0.50). Key findings: - H1 SUPPORTED: k-NN achieves AUC 0.570 on merged-PR population - H2 SUPPORTED: Jaccard max repo overlap achieves AUC 0.595 (best method) - H3 NOT SUPPORTED: 44 biased seeds don't generalize (AUC 0.44) - H4 SUPPORTED: Jaccard adds +0.049 AUC to LR (p<0.0001) Scripts: proximity_common.py (shared lib), proximity_knn_experiment.py, proximity_graph_experiment.py, proximity_combined.py, proximity_analysis.py
Tests Gemini 3.1 Pro scoring of PR titles/bodies as a signal for detecting suspended GitHub accounts in the merged-PR population. Uses temporal cutoffs (Strategy C) to prevent lookahead bias. Three prompt variants (V1: titles, V2: titles+bodies, V3: full profile) scored across 3 cutoffs (2022-07-01, 2023-01-01, 2024-01-01) totaling 30,131 API calls. Results: standalone LLM AUC 0.50-0.57, marginal. Combined LR(F10)+LLM(V2)+Jaccard reaches AUC 0.577 at the 2024-01-01 cutoff (+0.026 over Jaccard alone, significant after Holm-Bonferroni). Second-phase re-ranking is ineffective. H5 weakly supported.
…R population Updates the analysis report with a clear conclusion section explaining why none of the tested methods (behavioral LR, k-NN, Jaccard, PPR, LLM) produce operationally useful results for detecting suspended accounts among authors with merged PRs. Best AUC 0.608 with near-zero precision. The merged-PR population is too homogeneous — these authors passed code review by definition.
85b9644 to
02e9a71
Compare
🥚 Good Egg: HIGH TrustScore: 83% Top Contributions
|
|
This work did not result in a viable bot detection methodology, thus is being closed. |
rlronan
left a comment
There was a problem hiding this comment.
Did not review in too much detail; main concerns would be knn proximity score returns scores that are < 0; the code has a lode of places where nans are infilled with medians or 0s, which isn't necessary wrong, but if there are a lot of Nans might obscure data issues, and there looks to be a data leakage in run_experiment_a
| def knn_proximity_score( | ||
| seed_x: np.ndarray, | ||
| eval_x: np.ndarray, | ||
| k: int, | ||
| metric: str, | ||
| ) -> np.ndarray: | ||
| """Score eval set by negative mean distance to k nearest seeds. | ||
|
|
||
| Higher score = closer to seeds = more suspicious. | ||
| """ | ||
| effective_k = min(k, len(seed_x)) | ||
| if effective_k == 0: | ||
| return np.zeros(len(eval_x)) | ||
| nn = NearestNeighbors(n_neighbors=effective_k, metric=metric) | ||
| nn.fit(seed_x) | ||
| distances, _ = nn.kneighbors(eval_x) | ||
| return -distances.mean(axis=1) |
There was a problem hiding this comment.
This feels like a weird way to convert distance to a score; you end up the the 'highest scoring' values being small negative values and the 'lowest scoring' values being smaller negative values. Seems like it works, but it's weird for the score contribution to be: -0.1 when suspicious and -5.0 when not suspicious IMO.
| # Scale on seeds + active (no test suspended) | ||
| all_train_x = np.vstack([ | ||
| seed_x, | ||
| prepare_features( | ||
| test_df[test_df["account_status"] == "active"], fs, | ||
| ), | ||
| ]) | ||
| scaler = StandardScaler() | ||
| scaler.fit(all_train_x) |
There was a problem hiding this comment.
This appears to be using the test data for feature creation of scaling which is a form of data leakage
Not intended for merge
Closes #45
This is an archival record of the bad-egg experiments. We tried to detect suspended GitHub accounts among authors who have merged PRs. None of the approaches worked well enough to ship.
What was tested
4 commits, ~3,500 lines of experiment code. Methods:
Three evaluation strategies with increasing rigor:
Results
Best AUC is 0.608 (random = 0.50). Best P@25 is 0.08, i.e. 2 out of the top 25 flagged accounts are actually suspended. 92% false positive rate.
Why it doesn't work
The merged-PR population (~2.5% suspended) is too homogeneous. These are authors who passed code review. Suspended accounts with merged PRs look the same as active accounts with merged PRs. Prior work (stage 6) got AUC 0.619 on the full population, but that included zero-PR suspended accounts that are trivially separable. Restricting to merged PRs collapses the signal.
Files
experiments/bot_detection/proximity_results/PROXIMITY_ANALYSIS.md- full results and conclusionexperiments/bot_detection/proximity_results/*.json- raw results (k-NN, graph, combined, LLM)scripts/proximity_*.py- experiment code (common utils, k-NN, graph, combined, LLM client, LLM experiment)scripts/refit_bad_egg.py- 8-feature Bad Egg scoring model (useful for full population, not merged-PR subset)