Skip to content

corpus: EN collection wave (measure-only, G008)#492

Merged
devswha merged 1 commit into
mainfrom
bot/corpus-en-wave2
Jun 14, 2026
Merged

corpus: EN collection wave (measure-only, G008)#492
devswha merged 1 commit into
mainfrom
bot/corpus-en-wave2

Conversation

@devswha

@devswha devswha commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Wave 2 of the approved corpus-expansion plan. Measure-only: no detector threshold change, no src/features change.

Manifest

artifacts/rebaseline-2025/manifest.en.scored.public.jsonl — 330 hash-only rows:

  • 200 natural-human from HAP-E (browndw/human-ai-parallel-corpus, MIT), balanced academic-summary / blog
  • 120 ai-like across 3 families (gpt 40 / claude 40 / gemini 40)
  • 5 lightly-edited-ai + 5 heavily-edited-ai (light+heavy per register)

Raw text stays gitignored; only hashes/metadata/scores committed.

Findings

  • rebaseline-en-latest — accuracy 85.8%, recall 86.9%, FP 15.0% (EN detection stronger than KO's 75%/59.2%).
  • rebaseline-low-fpr-en-latest — en + en×register at 1%/5% FPR. academic-summary/blog supported (blog TPR 88.5% at 5%); product-doc/chat-update/technical-how-to honestly no_negatives (HAP-E maps only 2 registers). Overall low-FPR TPR still 0% (high-scoring human controls).
  • rebaseline-audit-en-latest — 0 mislabeled, 0 too-easy; heavy human edits evade 4/5 registers.

Verification

  • npm test 766/766
  • npm run benchmark 100% / ROC-AUC 1.000 / PR-AUC 1.000
  • benchmark:report, benchmark:robustness, check:no-private-assets, lint all pass

Wave 2 of the approved corpus-expansion plan. Measure-only: no detector
threshold change, no src/features change.

Manifest artifacts/rebaseline-2025/manifest.en.scored.public.jsonl (330 rows,
hash-only): 200 natural-human controls from HAP-E (browndw/human-ai-parallel-corpus,
MIT; balanced academic-summary/blog) + 120 ai-like positives across 3 model
families (gpt 40 / claude 40 / gemini 40) + 5 lightly-edited-ai + 5
heavily-edited-ai (one light + one heavy per register). Raw text stays in the
gitignored private workspace; only hashes/metadata/scores are committed.

Reports (docs/benchmarks/):
- rebaseline-en-latest.{md,json}: accuracy 85.8%, recall 86.9%, FP 15.0%
  (EN detection notably stronger than KO).
- rebaseline-low-fpr-en-latest.{md,json}: B4 TPR@1%/5%FPR for en and
  en x register. academic-summary/blog supported (blog TPR 88.5% at 5% FPR);
  other registers honestly report no_negatives (HAP-E maps only 2 registers).
  Overall TPR at low FPR still collapses to 0% (high-scoring human controls).
- rebaseline-audit-en-latest.md: operator audit; 0 mislabeled, 0 too-easy.
  Heavy human edits evade detection in 4/5 registers.

Verify: npm test 766/766; npm run benchmark 100% / ROC-AUC 1.000 / PR-AUC 1.000;
benchmark:report, benchmark:robustness, check:no-private-assets, lint all pass.
@vercel

vercel Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
patina Ready Ready Preview, Comment Jun 14, 2026 12:23pm

Request Review

@devswha devswha merged commit 986eee6 into main Jun 14, 2026
8 checks passed
@devswha devswha deleted the bot/corpus-en-wave2 branch June 14, 2026 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant