Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .patina.default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ stylometry:
lexicon:
enabled: true
languages: [en, ko, zh, ja]
density_threshold: 2.0 # matches per 1000 tokens; > threshold → SUSPECT
density_threshold: 3.0 # matches per 1000 tokens; > threshold → SUSPECT
# Lexicon files auto-discovered via Glob lexicon/ai-{lang}.md.
# en/ko use the calibrated baseline from HC3 + Wikipedia + NamuWiki:
# AI catch 66% → 76% with Wikipedia FP staying at 25% boundary.
Expand Down
2 changes: 1 addition & 1 deletion SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -325,7 +325,7 @@ For paragraph P with tokens T:
hot iff density > lexicon.density_threshold
```

기본 threshold = `2.0`. `.patina.default.yaml`의 `lexicon.density_threshold`로 조정 가능.
기본 threshold = `3.0`. `.patina.default.yaml`의 `lexicon.density_threshold`로 조정 가능.

### Hot 결정 규칙 확장 (3-signal OR)

Expand Down
400 changes: 200 additions & 200 deletions artifacts/rebaseline-2025/manifest.en.scored.public.jsonl

Large diffs are not rendered by default.

500 changes: 250 additions & 250 deletions artifacts/rebaseline-2025/manifest.ko.scored.public.jsonl

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions core/stylometry.md
Original file line number Diff line number Diff line change
Expand Up @@ -677,7 +677,7 @@ For paragraph P with tokens T:
hot iff density > threshold AND min_hot_matches is satisfied
```

기본 threshold = `2.0` (1,000 토큰당 2회). `lexicon.density_threshold`로 설정 가능.
기본 threshold = `3.0` (1,000 토큰당 3회). `lexicon.density_threshold`로 설정 가능.
기본 `min_hot_matches`는 영어 1, 한국어/중국어/일본어 2다. CJK 단일 lexicon hit는
audit hint로 표시하지만 단락을 hot으로 만들지는 않는다.

Expand Down Expand Up @@ -760,7 +760,7 @@ Pareto frontier (3-signal OR, threshold sweep):

### Threshold 선택 근거

`density_threshold = 2.0` 채택. 0.5–5.0 plateau 구간 어디에서도 동일한 catch/FP 가 나오므로 사양 기본값(2.0) 을 사용한다. 운용 의미: "1,000 토큰당 AI lexicon entry 가 2개 초과로 나타나고, 언어별 최소 hit 수를 만족하면 단락 의심". 이는 사양 §3 Recommendation 과 일치한다. 2026-05 Korean 25-row register pilot 이후 CJK 단일 hit는 hot에서 audit hint로 낮췄다.
`density_threshold = 3.0` 채택. v3.5.1 calibration 코퍼스(HC3/Wikipedia/NamuWiki)에서는 0.5–5.0 구간이 동일 catch/FP plateau였으나, 2026-06 modern-model 재기준(rb26 KO/EN measure-only 코퍼스)에서 영어 자연 대조군 오탐이 threshold 2.0에서 15%, 3.0에서 5%로 떨어지고 영어 AI 재현율(86.9%)·한국어 재현율은 불변이며 49-fixture 벤치마크가 100%/ROC·PR-AUC 1.000을 유지하므로 3.0을 채택한다. 운용 의미: "1,000 토큰당 AI lexicon entry 가 3개 초과로 나타나고, 언어별 최소 hit 수를 만족하면 단락 의심". 한국어 오탐은 burstiness 신호가 주도하므로 별도 보정 delta로 분리한다. 2026-05 Korean 25-row register pilot 이후 CJK 단일 hit는 hot에서 audit hint로 낮췄다.

### Calibration drop list

Expand Down
10 changes: 10 additions & 0 deletions docs/benchmarks/rebaseline-audit-en-latest.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,13 @@ under the analyzer's tells. Correctly labeled by construction. **Verdict: genuin
(accuracy 85.8% vs 75.0%, recall 86.9% vs 59.2%), but low-FPR operation still
collapses to TPR 0% overall — high-scoring human controls block a clean low-FPR
point, the honest measure-only outcome.

## Post-calibration update (lexicon density_threshold 2.0 → 3.0)

After the calibration delta, this manifest is re-scored with
`density_threshold = 3.0`. EN human FP drops from **15.0% (30/200) to 5.0%
(10/200)** with AI recall unchanged at **86.9%**, and the 49 checked-in fixtures
stay 100% / ROC-AUC·PR-AUC 1.000. This is the lexicon calibration's clean win:
the lexicon signal was largely a false-positive generator for English, so
tightening its density gate cut FPs two-thirds with no recall cost. All verdicts
above stand (0 mislabeled, 0 too-easy).
10 changes: 10 additions & 0 deletions docs/benchmarks/rebaseline-audit-ko-latest.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,13 @@ construction (revisions of known AI positives). **Verdict: genuine.**
(overall TPR at 5% FPR is 0.0% — high-scoring human controls block low-FPR
operation), which is the honest measure-only outcome motivating a future,
separately-approved calibration delta.

## Post-calibration update (lexicon density_threshold 2.0 → 3.0)

After the calibration delta, this manifest is re-scored at the current analyzer
with `density_threshold = 3.0`. KO human FP is **14.0% (35/250)**, recall
unchanged at 59.2%. The earlier 16.8% figure reflected the 2026-05-22 analyzer;
re-scoring corrects it. The lexicon threshold change does **not** move KO FP —
KO false-positives are driven by the burstiness signal, which is intentionally
out of scope here and deferred to a separate burstiness-calibration delta. All
verdicts above stand (0 mislabeled, 0 too-easy).
82 changes: 41 additions & 41 deletions docs/benchmarks/rebaseline-en-latest.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"schemaVersion": 1,
"generatedAt": "2026-06-14T12:20:56.313Z",
"generatedAt": "2026-06-14T14:07:13.464Z",
"input": "artifacts/rebaseline-2025/manifest.en.scored.public.jsonl",
"targets": {
"protocolPerLanguageClassRegister": 25,
Expand All @@ -13,23 +13,23 @@
"en": 330
},
"byClass": {
"natural-human": 200,
"ai-like": 120,
"lightly-edited-ai": 5,
"heavily-edited-ai": 5
"heavily-edited-ai": 5,
"natural-human": 200
},
"byRegister": {
"academic-summary": 126,
"blog": 126,
"academic-summary": 126,
"product-doc": 26,
"chat-update": 26,
"technical-how-to": 26
},
"byModelFamily": {
"human-reference": 200,
"gpt-family": 50,
"claude-family": 40,
"gemini-family": 40
"gemini-family": 40,
"human-reference": 200
},
"protocolCoverage": {
"totalCells": 80,
Expand Down Expand Up @@ -116,19 +116,19 @@
},
"metrics": {
"tp": 113,
"fp": 30,
"fp": 10,
"fn": 17,
"tn": 170,
"tn": 190,
"total": 330,
"accuracy": 0.858,
"precision": 0.79,
"accuracy": 0.918,
"precision": 0.919,
"recall": 0.869,
"f1": 0.828,
"falsePositiveRate": 0.15,
"f1": 0.893,
"falsePositiveRate": 0.05,
"falseNegativeRate": 0.131,
"accuracyCi": {
"low": 0.816,
"high": 0.891,
"low": 0.884,
"high": 0.943,
"method": "Wilson score interval, 95%"
},
"recallCi": {
Expand All @@ -137,8 +137,8 @@
"method": "Wilson score interval, 95%"
},
"falsePositiveRateCi": {
"low": 0.107,
"high": 0.206,
"low": 0.027,
"high": 0.09,
"method": "Wilson score interval, 95%"
}
},
Expand Down Expand Up @@ -187,32 +187,32 @@
"en": {
"language": "en",
"n": 200,
"falsePositives": 30,
"trueNegatives": 170,
"falsePositiveRate": 0.15,
"falsePositives": 10,
"trueNegatives": 190,
"falsePositiveRate": 0.05,
"falsePositiveRateCi": {
"low": 0.107,
"high": 0.206,
"low": 0.027,
"high": 0.09,
"method": "Wilson score interval, 95%"
}
}
},
"metricsByRegister": {
"academic-summary": {
"tp": 25,
"fp": 25,
"fp": 9,
"fn": 1,
"tn": 75,
"tn": 91,
"total": 126,
"accuracy": 0.794,
"precision": 0.5,
"accuracy": 0.921,
"precision": 0.735,
"recall": 0.962,
"f1": 0.658,
"falsePositiveRate": 0.25,
"f1": 0.833,
"falsePositiveRate": 0.09,
"falseNegativeRate": 0.038,
"accuracyCi": {
"low": 0.715,
"high": 0.855,
"low": 0.86,
"high": 0.956,
"method": "Wilson score interval, 95%"
},
"recallCi": {
Expand All @@ -221,26 +221,26 @@
"method": "Wilson score interval, 95%"
},
"falsePositiveRateCi": {
"low": 0.175,
"high": 0.343,
"low": 0.048,
"high": 0.162,
"method": "Wilson score interval, 95%"
}
},
"blog": {
"tp": 23,
"fp": 5,
"fp": 1,
"fn": 3,
"tn": 95,
"tn": 99,
"total": 126,
"accuracy": 0.937,
"precision": 0.821,
"accuracy": 0.968,
"precision": 0.958,
"recall": 0.885,
"f1": 0.852,
"falsePositiveRate": 0.05,
"f1": 0.92,
"falsePositiveRate": 0.01,
"falseNegativeRate": 0.115,
"accuracyCi": {
"low": 0.88,
"high": 0.967,
"low": 0.921,
"high": 0.988,
"method": "Wilson score interval, 95%"
},
"recallCi": {
Expand All @@ -249,8 +249,8 @@
"method": "Wilson score interval, 95%"
},
"falsePositiveRateCi": {
"low": 0.022,
"high": 0.112,
"low": 0.002,
"high": 0.054,
"method": "Wilson score interval, 95%"
}
},
Expand Down
22 changes: 11 additions & 11 deletions docs/benchmarks/rebaseline-en-latest.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Rebaseline Manifest Summary

- Generated at: 2026-06-14T12:20:56.313Z
- Generated at: 2026-06-14T14:07:13.464Z
- Input: `artifacts/rebaseline-2025/manifest.en.scored.public.jsonl`
- Records: 330
- Protocol target: 25 samples per language × class × register cell
Expand Down Expand Up @@ -94,16 +94,16 @@ Public performance claim: **BLOCKED**

| metric | value |
|---|---:|
| accuracy | 85.8% |
| accuracy CI | 81.6%–89.1% |
| precision | 79.0% |
| accuracy | 91.8% |
| accuracy CI | 88.4%–94.3% |
| precision | 91.9% |
| recall | 86.9% |
| recall CI | 80.1%–91.7% |
| F1 | 0.828 |
| false positive rate | 15.0% |
| false positive rate CI | 10.7%–20.6% |
| F1 | 0.893 |
| false positive rate | 5.0% |
| false positive rate CI | 2.7%–9.0% |
| false negative rate | 13.1% |
| TP/FP/FN/TN | 113/30/17/170 |
| TP/FP/FN/TN | 113/10/17/190 |

### Catch rate by language × model family

Expand All @@ -117,14 +117,14 @@ Public performance claim: **BLOCKED**

| language | n | false-positive rate | 95% CI | FP/TN |
|---|---:|---:|---:|---:|
| en | 200 | 15.0% | 10.7%–20.6% | 30/170 |
| en | 200 | 5.0% | 2.7%–9.0% | 10/190 |

### By register

| register | n | FP rate | FN rate | TP/FP/FN/TN |
|---|---:|---:|---:|---:|
| blog | 126 | 5.0% | 11.5% | 23/5/3/95 |
| academic-summary | 126 | 25.0% | 3.8% | 25/25/1/75 |
| blog | 126 | 1.0% | 11.5% | 23/1/3/99 |
| academic-summary | 126 | 9.0% | 3.8% | 25/9/1/91 |
| product-doc | 26 | 0.0% | 23.1% | 20/0/6/0 |
| chat-update | 26 | 0.0% | 15.4% | 22/0/4/0 |
| technical-how-to | 26 | 0.0% | 11.5% | 23/0/3/0 |
Loading