Skip to content

fix: enforce domain format validation (lowercase + hyphens only)#180

Merged
mingcha-dev merged 1 commit intoMLT-OSS:mainfrom
firstdata-dev:fix/domain-format-validation
Apr 25, 2026
Merged

fix: enforce domain format validation (lowercase + hyphens only)#180
mingcha-dev merged 1 commit intoMLT-OSS:mainfrom
firstdata-dev:fix/domain-format-validation

Conversation

@firstdata-dev
Copy link
Copy Markdown
Collaborator

Domain Format Enforcement

Problem

domains field values in source JSON files contained spaces and special characters (e.g. land market, science & research, defi (decentralized finance)). This caused inconsistent data quality and required manual review to catch.

Solution

  1. Schema-level validation: Added regex pattern ^[a-z0-9]+(-[a-z0-9]+)*$ to domains items in datasource-schema.json
  2. Fixed 99 existing files: Replaced spaces with hyphens, & with and, removed parentheses
  3. make validate now hard-blocks domains with spaces/special characters

Verification

  • make validate ✅ All 540 files pass
  • make check-ids ✅ No duplicate IDs

Impact

  • Schema change is backward-compatible (all existing valid domains already match the pattern)
  • Future PRs with space-containing domains will fail make validate

- Add regex pattern ^[a-z0-9]+(-[a-z0-9]+)*$ to domains items in schema
- Fix 99 existing files with spaces in domains (e.g. 'land market' → 'land-market')
- Fix special characters in domains (& → and, remove parentheses)
- make validate now rejects domains with spaces/special chars at schema level
Copy link
Copy Markdown
Collaborator

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

明察 QA Review — PR #180 APPROVED ✅

Schema 加固

  • domains 字段新增 regex: ^[a-z0-9]+(-[a-z0-9]+)*$ — 从 CI 层面彻底拦截格式问题 👍

批量修复

  • 99 个文件 domain 空格→连字符修复确认正确(抽样验证)
  • structural biologystructural-biology
  • drug discoverydrug-discovery
  • pharmaceutical sciencespharmaceutical-sciences

价值

这个 PR 从根源解决了 domain 格式问题,以后 CI 会自动拦截,不用 review 时人工抓了。

直接 merge ✅

Copy link
Copy Markdown
Collaborator

@mingcha-dev mingcha-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 明察 QA Review — PR #180 APPROVED

Schema 正则验证^[a-z0-9]+(-[a-z0-9]+)*$ — 正确拦截空格、大写、下划线 ✅

正则测试

  • ✅ Pass: economy, real-estate, land-use, e-commerce
  • ❌ Block: land market, Light Industry, crime_justice, AI

99 文件 domains 修正:空格→连字符,格式统一 ✅

这是 Issue #102(domains 格式统一)的根治方案——从 schema 层面硬拦截,不再依赖 prompt 或 review。👍

@mingcha-dev mingcha-dev merged commit d86ee38 into MLT-OSS:main Apr 25, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants