Skip to content

corpus: JA source-feasibility gate (G010)#494

Merged
devswha merged 1 commit into
mainfrom
bot/corpus-ja-feasibility
Jun 14, 2026
Merged

corpus: JA source-feasibility gate (G010)#494
devswha merged 1 commit into
mainfrom
bot/corpus-ja-feasibility

Conversation

@devswha

@devswha devswha commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Wave 4 source-feasibility gate (G010). Metadata-only; no raw text; no code change. Candidate source inventory + dry-run evidence + GO/NO-GO recommendation. The GO/NO-GO decision and any collection are reserved for the maintainer.

Deliverable

  • artifacts/rebaseline-2025/sources.ja-public.jsonl — 18 Wikimedia CC-BY-SA candidate sources across academic-summary / blog / technical-how-to, full schema, no raw text. The ja script filter requires kana (JA/ZH disambiguation holds).
  • docs/research/ja-source-feasibility.md — dry-run evidence + recommendation.

Dry-run evidence

Inventory validates (18 rows, 0 errors). Bounded dry-run would-collect 15 candidates across 3 registers (academic 5 / blog 5 / technical 5), 1 warning; projects to 100+ at full caps.

Recommendation: GO (conditional)

Feasible via CC-BY-SA Wikimedia (stronger yield than ZH), conditioned on maintainer ratification of hash-only redistribution, a full run to confirm ≥100 yield, and the 3-register scope. STOP for maintainer GO/NO-GO before any JA collection.

Verify: check:no-private-assets passes (0 forbidden); no threshold/src/features change.

Wave 4 source-feasibility gate. Builds a metadata-only candidate source
inventory artifacts/rebaseline-2025/sources.ja-public.jsonl (18 Wikimedia
CC-BY-SA sources across academic-summary/blog/technical-how-to) with the full
schema. No raw text. The ja script filter requires kana, so JA/ZH stay
disambiguated.

Dry-run evidence (no text written): inventory validates (18 rows, 0 errors);
would-collect 15 candidates across 3 registers at small caps (only 1 warning),
projecting to well over 100 across >=3 registers at full caps.

Recommendation: GO (conditional) — feasible via CC-BY-SA Wikimedia (stronger
yield than ZH), conditioned on maintainer ratification of hash-only
redistribution, a full collection run to confirm the >=100 yield, and accepting
the 3-register scope. See docs/research/ja-source-feasibility.md.

STOP for maintainer GO/NO-GO before any JA collection. Measure-only; no
threshold or src/features change; check:no-private-assets passes.
@vercel

vercel Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
patina Ready Ready Preview, Comment Jun 14, 2026 12:34pm

Request Review

@devswha devswha merged commit 1bd8f03 into main Jun 14, 2026
8 checks passed
@devswha devswha deleted the bot/corpus-ja-feasibility branch June 14, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant