Good Egg v3: simplify scoring to merge_rate only for unknown contributors

## Summary

Simplify the GE scoring formula for unknown contributors (the population scored when `skip_known_contributors=true`). Drop hub_score and log_account_age from the v2 LR; use alltime merge_rate as the sole scoring input.

## Evidence

Experiments on the `bot-detection` branch (PR #44), tracked in [`experiments/bot_detection/RESULTS.md`](https://github.com/2ndSetAI/good-egg/blob/bot-detection/experiments/bot_detection/RESULTS.md):

**hub_score hurts unknown contributors (stage17):** merge_rate alone outperforms every model that includes hub_score across all repo size tiers:

| Tier | mr_only | mr+hub | Delta |
|---|---|---|---|
| All medium+ | **0.516** | 0.408 | -0.108 |
| Large (500-1999 PRs) | **0.553** | 0.484 | -0.069 |
| XL (2000+ PRs) | **0.533** | 0.405 | -0.128 |

**log_account_age adds nothing (stage19):** On 4 stable cutoffs (n=130 to n=1014, 5-fold CV), mr+age never beats mr_only. DeLong p > 0.07 at every cutoff. age_only AUC is 0.505-0.522 (barely above chance).

| Cutoff | N | mr_only | mr+age | DeLong p |
|---|---|---|---|---|
| T_2022 | 130 | 0.584 | 0.576 | 0.807 |
| T_2022-07 | 431 | 0.606 | 0.606 | 0.992 |
| T_2023 | 474 | 0.552 | 0.534 | 0.076 |
| T_2024 | 1014 | 0.580 | 0.569 | 0.111 |

**Recency windows don't help (stage18):** No significant difference between alltime, 2yr, 1yr, 6mo, or 3mo merge_rate for unknown contributors (zero significant DeLong tests across all tiers and cutoffs).

**Cross-repo merge prediction confirms (stage13):** hub_score hurts here too. `ge_v2_proxy` (hub_score + merge_rate) AUC 0.542 vs `merge_rate_only` AUC 0.576.

**PR #27 validation study corroborates:** account_age was LRT-significant (p = 1.2e-5) against graph_score but did not improve AUC ranking (DeLong p = 0.65 for GE + merge_rate + age vs GE alone).

## Implementation tasks

- [ ] Remove hub_score (graph_score) from the scoring formula in `scorer.py`
- [ ] Remove log_account_age from the scoring formula in `scorer.py`
- [ ] Replace the v2 3-feature LR with alltime merge_rate as sole input for unknown contributors
- [ ] Keep graph construction (needed for repo discovery and contributor mapping)
- [ ] Keep `skip_known_contributors` logic (fast-tracks known contributors)
- [ ] Update thresholds to map merge_rate directly to trust levels
- [ ] Update docs and config reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Good Egg v3: simplify scoring to merge_rate only for unknown contributors #46

Summary

Evidence

Implementation tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tier	mr_only	mr+hub	Delta
All medium+	0.516	0.408	-0.108
Large (500-1999 PRs)	0.553	0.484	-0.069
XL (2000+ PRs)	0.533	0.405	-0.128

Cutoff	N	mr_only	mr+age	DeLong p
T_2022	130	0.584	0.576	0.807
T_2022-07	431	0.606	0.606	0.992
T_2023	474	0.552	0.534	0.076
T_2024	1014	0.580	0.569	0.111

Good Egg v3: simplify scoring to merge_rate only for unknown contributors #46

Description

Summary

Evidence

Implementation tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions