devswha · devswha · Jun 14, 2026 · Jun 14, 2026
diff --git a/.patina.default.yaml b/.patina.default.yaml
@@ -212,7 +212,7 @@ stylometry:
 lexicon:
   enabled: true
   languages: [en, ko, zh, ja]
-  density_threshold: 2.0     # matches per 1000 tokens; > threshold → SUSPECT
+  density_threshold: 3.0     # matches per 1000 tokens; > threshold → SUSPECT
   # Lexicon files auto-discovered via Glob lexicon/ai-{lang}.md.
   # en/ko use the calibrated baseline from HC3 + Wikipedia + NamuWiki:
   # AI catch 66% → 76% with Wikipedia FP staying at 25% boundary.

diff --git a/SKILL.md b/SKILL.md
@@ -325,7 +325,7 @@ For paragraph P with tokens T:
   hot iff density > lexicon.density_threshold
 ```
 
-기본 threshold = `2.0`. `.patina.default.yaml`의 `lexicon.density_threshold`로 조정 가능.
+기본 threshold = `3.0`. `.patina.default.yaml`의 `lexicon.density_threshold`로 조정 가능.
 
 ### Hot 결정 규칙 확장 (3-signal OR)
 

diff --git a/artifacts/rebaseline-2025/manifest.en.scored.public.jsonl b/artifacts/rebaseline-2025/manifest.en.scored.public.jsonl
diff --git a/artifacts/rebaseline-2025/manifest.ko.scored.public.jsonl b/artifacts/rebaseline-2025/manifest.ko.scored.public.jsonl
diff --git a/core/stylometry.md b/core/stylometry.md
@@ -677,7 +677,7 @@ For paragraph P with tokens T:
   hot iff density > threshold AND min_hot_matches is satisfied
 ```
 
-기본 threshold = `2.0` (1,000 토큰당 2회). `lexicon.density_threshold`로 설정 가능.
+기본 threshold = `3.0` (1,000 토큰당 3회). `lexicon.density_threshold`로 설정 가능.
 기본 `min_hot_matches`는 영어 1, 한국어/중국어/일본어 2다. CJK 단일 lexicon hit는
 audit hint로 표시하지만 단락을 hot으로 만들지는 않는다.
 
@@ -760,7 +760,7 @@ Pareto frontier (3-signal OR, threshold sweep):
 
 ### Threshold 선택 근거
 
-`density_threshold = 2.0` 채택. 0.5–5.0 plateau 구간 어디에서도 동일한 catch/FP 가 나오므로 사양 기본값(2.0) 을 사용한다. 운용 의미: "1,000 토큰당 AI lexicon entry 가 2개 초과로 나타나고, 언어별 최소 hit 수를 만족하면 단락 의심". 이는 사양 §3 Recommendation 과 일치한다. 2026-05 Korean 25-row register pilot 이후 CJK 단일 hit는 hot에서 audit hint로 낮췄다.
+`density_threshold = 3.0` 채택. v3.5.1 calibration 코퍼스(HC3/Wikipedia/NamuWiki)에서는 0.5–5.0 구간이 동일 catch/FP plateau였으나, 2026-06 modern-model 재기준(rb26 KO/EN measure-only 코퍼스)에서 영어 자연 대조군 오탐이 threshold 2.0에서 15%, 3.0에서 5%로 떨어지고 영어 AI 재현율(86.9%)·한국어 재현율은 불변이며 49-fixture 벤치마크가 100%/ROC·PR-AUC 1.000을 유지하므로 3.0을 채택한다. 운용 의미: "1,000 토큰당 AI lexicon entry 가 3개 초과로 나타나고, 언어별 최소 hit 수를 만족하면 단락 의심". 한국어 오탐은 burstiness 신호가 주도하므로 별도 보정 delta로 분리한다. 2026-05 Korean 25-row register pilot 이후 CJK 단일 hit는 hot에서 audit hint로 낮췄다.
 
 ### Calibration drop list
 

diff --git a/docs/benchmarks/rebaseline-audit-en-latest.md b/docs/benchmarks/rebaseline-audit-en-latest.md
@@ -71,3 +71,13 @@ under the analyzer's tells. Correctly labeled by construction. **Verdict: genuin
   (accuracy 85.8% vs 75.0%, recall 86.9% vs 59.2%), but low-FPR operation still
   collapses to TPR 0% overall — high-scoring human controls block a clean low-FPR
   point, the honest measure-only outcome.
+
+## Post-calibration update (lexicon density_threshold 2.0 → 3.0)
+
+After the calibration delta, this manifest is re-scored with
+`density_threshold = 3.0`. EN human FP drops from **15.0% (30/200) to 5.0%
+(10/200)** with AI recall unchanged at **86.9%**, and the 49 checked-in fixtures
+stay 100% / ROC-AUC·PR-AUC 1.000. This is the lexicon calibration's clean win:
+the lexicon signal was largely a false-positive generator for English, so
+tightening its density gate cut FPs two-thirds with no recall cost. All verdicts
+above stand (0 mislabeled, 0 too-easy).
diff --git a/docs/benchmarks/rebaseline-audit-ko-latest.md b/docs/benchmarks/rebaseline-audit-ko-latest.md
@@ -91,3 +91,13 @@ construction (revisions of known AI positives). **Verdict: genuine.**
   (overall TPR at 5% FPR is 0.0% — high-scoring human controls block low-FPR
   operation), which is the honest measure-only outcome motivating a future,
   separately-approved calibration delta.
+
+## Post-calibration update (lexicon density_threshold 2.0 → 3.0)
+
+After the calibration delta, this manifest is re-scored at the current analyzer
+with `density_threshold = 3.0`. KO human FP is **14.0% (35/250)**, recall
+unchanged at 59.2%. The earlier 16.8% figure reflected the 2026-05-22 analyzer;
+re-scoring corrects it. The lexicon threshold change does **not** move KO FP —
+KO false-positives are driven by the burstiness signal, which is intentionally
+out of scope here and deferred to a separate burstiness-calibration delta. All
+verdicts above stand (0 mislabeled, 0 too-easy).
diff --git a/docs/benchmarks/rebaseline-en-latest.json b/docs/benchmarks/rebaseline-en-latest.json
@@ -1,6 +1,6 @@
 {
   "schemaVersion": 1,
-  "generatedAt": "2026-06-14T12:20:56.313Z",
+  "generatedAt": "2026-06-14T14:07:13.464Z",
   "input": "artifacts/rebaseline-2025/manifest.en.scored.public.jsonl",
   "targets": {
     "protocolPerLanguageClassRegister": 25,
@@ -13,23 +13,23 @@
     "en": 330
   },
   "byClass": {
-    "natural-human": 200,
     "ai-like": 120,
     "lightly-edited-ai": 5,
-    "heavily-edited-ai": 5
+    "heavily-edited-ai": 5,
+    "natural-human": 200
   },
   "byRegister": {
-    "academic-summary": 126,
     "blog": 126,
+    "academic-summary": 126,
     "product-doc": 26,
     "chat-update": 26,
     "technical-how-to": 26
   },
   "byModelFamily": {
-    "human-reference": 200,
     "gpt-family": 50,
     "claude-family": 40,
-    "gemini-family": 40
+    "gemini-family": 40,
+    "human-reference": 200
   },
   "protocolCoverage": {
     "totalCells": 80,
@@ -116,19 +116,19 @@
   },
   "metrics": {
     "tp": 113,
-    "fp": 30,
+    "fp": 10,
     "fn": 17,
-    "tn": 170,
+    "tn": 190,
     "total": 330,
-    "accuracy": 0.858,
-    "precision": 0.79,
+    "accuracy": 0.918,
+    "precision": 0.919,
     "recall": 0.869,
-    "f1": 0.828,
-    "falsePositiveRate": 0.15,
+    "f1": 0.893,
+    "falsePositiveRate": 0.05,
     "falseNegativeRate": 0.131,
     "accuracyCi": {
-      "low": 0.816,
-      "high": 0.891,
+      "low": 0.884,
+      "high": 0.943,
       "method": "Wilson score interval, 95%"
     },
     "recallCi": {
@@ -137,8 +137,8 @@
       "method": "Wilson score interval, 95%"
     },
     "falsePositiveRateCi": {
-      "low": 0.107,
-      "high": 0.206,
+      "low": 0.027,
+      "high": 0.09,
       "method": "Wilson score interval, 95%"
     }
   },
@@ -187,32 +187,32 @@
     "en": {
       "language": "en",
       "n": 200,
-      "falsePositives": 30,
-      "trueNegatives": 170,
-      "falsePositiveRate": 0.15,
+      "falsePositives": 10,
+      "trueNegatives": 190,
+      "falsePositiveRate": 0.05,
       "falsePositiveRateCi": {
-        "low": 0.107,
-        "high": 0.206,
+        "low": 0.027,
+        "high": 0.09,
         "method": "Wilson score interval, 95%"
       }
     }
   },
   "metricsByRegister": {
     "academic-summary": {
       "tp": 25,
-      "fp": 25,
+      "fp": 9,
       "fn": 1,
-      "tn": 75,
+      "tn": 91,
       "total": 126,
-      "accuracy": 0.794,
-      "precision": 0.5,
+      "accuracy": 0.921,
+      "precision": 0.735,
       "recall": 0.962,
-      "f1": 0.658,
-      "falsePositiveRate": 0.25,
+      "f1": 0.833,
+      "falsePositiveRate": 0.09,
       "falseNegativeRate": 0.038,
       "accuracyCi": {
-        "low": 0.715,
-        "high": 0.855,
+        "low": 0.86,
+        "high": 0.956,
         "method": "Wilson score interval, 95%"
       },
       "recallCi": {
@@ -221,26 +221,26 @@
         "method": "Wilson score interval, 95%"
       },
       "falsePositiveRateCi": {
-        "low": 0.175,
-        "high": 0.343,
+        "low": 0.048,
+        "high": 0.162,
         "method": "Wilson score interval, 95%"
       }
     },
     "blog": {
       "tp": 23,
-      "fp": 5,
+      "fp": 1,
       "fn": 3,
-      "tn": 95,
+      "tn": 99,
       "total": 126,
-      "accuracy": 0.937,
-      "precision": 0.821,
+      "accuracy": 0.968,
+      "precision": 0.958,
       "recall": 0.885,
-      "f1": 0.852,
-      "falsePositiveRate": 0.05,
+      "f1": 0.92,
+      "falsePositiveRate": 0.01,
       "falseNegativeRate": 0.115,
       "accuracyCi": {
-        "low": 0.88,
-        "high": 0.967,
+        "low": 0.921,
+        "high": 0.988,
         "method": "Wilson score interval, 95%"
       },
       "recallCi": {
@@ -249,8 +249,8 @@
         "method": "Wilson score interval, 95%"
       },
       "falsePositiveRateCi": {
-        "low": 0.022,
-        "high": 0.112,
+        "low": 0.002,
+        "high": 0.054,
         "method": "Wilson score interval, 95%"
       }
     },

diff --git a/docs/benchmarks/rebaseline-en-latest.md b/docs/benchmarks/rebaseline-en-latest.md
@@ -1,6 +1,6 @@
 # Rebaseline Manifest Summary
 
-- Generated at: 2026-06-14T12:20:56.313Z
+- Generated at: 2026-06-14T14:07:13.464Z
 - Input: `artifacts/rebaseline-2025/manifest.en.scored.public.jsonl`
 - Records: 330
 - Protocol target: 25 samples per language × class × register cell
@@ -94,16 +94,16 @@ Public performance claim: **BLOCKED**
 
 | metric | value |
 |---|---:|
-| accuracy | 85.8% |
-| accuracy CI | 81.6%–89.1% |
-| precision | 79.0% |
+| accuracy | 91.8% |
+| accuracy CI | 88.4%–94.3% |
+| precision | 91.9% |
 | recall | 86.9% |
 | recall CI | 80.1%–91.7% |
-| F1 | 0.828 |
-| false positive rate | 15.0% |
-| false positive rate CI | 10.7%–20.6% |
+| F1 | 0.893 |
+| false positive rate | 5.0% |
+| false positive rate CI | 2.7%–9.0% |
 | false negative rate | 13.1% |
-| TP/FP/FN/TN | 113/30/17/170 |
+| TP/FP/FN/TN | 113/10/17/190 |
 
 ### Catch rate by language × model family
 
@@ -117,14 +117,14 @@ Public performance claim: **BLOCKED**
 
 | language | n | false-positive rate | 95% CI | FP/TN |
 |---|---:|---:|---:|---:|
-| en | 200 | 15.0% | 10.7%–20.6% | 30/170 |
+| en | 200 | 5.0% | 2.7%–9.0% | 10/190 |
 
 ### By register
 
 | register | n | FP rate | FN rate | TP/FP/FN/TN |
 |---|---:|---:|---:|---:|
-| blog | 126 | 5.0% | 11.5% | 23/5/3/95 |
-| academic-summary | 126 | 25.0% | 3.8% | 25/25/1/75 |
+| blog | 126 | 1.0% | 11.5% | 23/1/3/99 |
+| academic-summary | 126 | 9.0% | 3.8% | 25/9/1/91 |
 | product-doc | 26 | 0.0% | 23.1% | 20/0/6/0 |
 | chat-update | 26 | 0.0% | 15.4% | 22/0/4/0 |
 | technical-how-to | 26 | 0.0% | 11.5% | 23/0/3/0 |