[codex] Add API validation provenance and schemas#16
Conversation
Review Summary by QodoAdd API validation, provenance tracking, and JSON Schema artifacts
WalkthroughsDescription• Adds per-certificate algorithm extraction provenance tracking source, cache/fallback status, and row counts • Implements in-run PDF cache to centralize Security Policy fetch reuse across certificates • Generates JSON Schema artifacts for API responses with validation support • Adds validate_api.py tool for internal consistency checks before publishing • Includes real FIPS 140-2 and FIPS 140-3 policy text fixtures for regression testing • Wires validation into weekly update workflow to prevent publishing inconsistent artifacts Diagramflowchart LR
A["Algorithm Extraction"] -->|"provenance metadata"| B["AlgorithmExtractionResult"]
B -->|"source, cache, fallback"| C["Module/Detail Records"]
D["PDF Fetch"] -->|"in-run cache"| E["Reuse Detection"]
E -->|"cache hits tracked"| F["Extraction Metrics"]
G["Generated Artifacts"] -->|"validate_api.py"| H["Consistency Check"]
H -->|"JSON Schema validation"| I["Publish Gate"]
File ChangesView more (14)4. api/schemas/index.schema.json
|
Code Review by Qodo
1. Broken schema $ref paths
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ee83a15e4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| task = asyncio.create_task(fetch_with_retry(client, url, response_type="bytes")) | ||
| pdf_cache[url] = task | ||
|
|
||
| result = await task | ||
| return result if isinstance(result, bytes) else None, cache_hit |
There was a problem hiding this comment.
Evict failed PDF fetches from cache
Caching the asyncio.Task unconditionally means a transient fetch failure is memoized for the rest of the run: when fetch_with_retry(...) returns None, later certificates that reference the same policy URL hit the cached failed task and never retry. In datasets where multiple certs share one Security Policy URL, one temporary network error can cascade into persistent algorithm misses for every dependent certificate.
Useful? React with 👍 / 👎.
| metadata_path = "/api/schemas/metadata.schema.json" | ||
| module_path = "/api/schemas/module.schema.json" | ||
| module_in_process_path = "/api/schemas/module-in-process.schema.json" |
There was a problem hiding this comment.
Use deploy-base-aware schema refs
These schema reference paths are root-absolute (/api/...), but each schema $id is under https://hackidle.github.io/nist-cmvp-api/...; JSON Schema resolution will therefore point to https://hackidle.github.io/api/... instead of the repo subpath. Validators resolving $ref from hosted schemas on GitHub Pages will fail to locate linked schemas.
Useful? React with 👍 / 👎.
| "properties": { | ||
| "metadata": { | ||
| "$ref": "/api/schemas/metadata.schema.json" | ||
| }, | ||
| "modules": { | ||
| "type": "array", | ||
| "items": { | ||
| "$ref": "/api/schemas/module.schema.json" | ||
| } | ||
| } |
There was a problem hiding this comment.
1. Broken schema $ref paths 🐞 Bug ≡ Correctness
The published JSON Schemas use absolute-path $ref values (e.g., "/api/schemas/metadata.schema.json") while their $id lives under "/nist-cmvp-api/...", so resolvers will fetch https://hackidle.github.io/api/... (missing "/nist-cmvp-api") and fail to resolve referenced schemas. Because scraper-generated schemas hardcode the same leading-slash refs, future runs will keep regenerating broken references.
Agent Prompt
## Issue description
JSON Schema documents under `api/schemas/` have `$id` values rooted at `https://hackidle.github.io/nist-cmvp-api/...`, but they reference other schemas using leading-slash `$ref` paths like `/api/schemas/metadata.schema.json`. Per URI resolution rules, a leading `/` resets to the host root, so these `$ref`s resolve to `https://hackidle.github.io/api/...` instead of `https://hackidle.github.io/nist-cmvp-api/api/...`, breaking schema consumers.
## Issue Context
Schemas are advertised for consumer validation; broken `$ref` breaks validation tooling for the response schemas that compose other schemas.
## Fix Focus Areas
- api/schemas/modules.schema.json[11-20]
- api/schemas/historical-modules.schema.json[11-20]
- api/schemas/modules-in-process.schema.json[11-20]
- scraper.py[2787-2797]
- scraper.py[2890-2942]
## Implementation notes
- Prefer **relative refs** (e.g., `"metadata.schema.json"`, `"module.schema.json"`, etc.) so they resolve relative to each schema’s `$id`.
- Alternatively, emit fully-qualified refs using `PUBLIC_BASE_URL` + schema path, but keep it consistent across all schema files.
- Regenerate/overwrite the checked-in schema JSON files to match the corrected generator output.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Summary
Adds the first reliability and API-quality slice from the CMVP API audit:
validate_api.pyand wires the weekly update workflow to validate generated artifacts before committingapi/algorithms.jsonwhen algorithm extraction is skippedImpact
This keeps existing response fields intact while adding optional provenance/schema fields for consumers that want stronger validation and traceability. The weekly workflow now fails before publishing if generated artifacts are internally inconsistent or if a new run reports Firecrawl as the extraction source.
Validation
python3 validate_api.pypython3 -m py_compile scraper.py test_scraper.py validate_api.pytest_scraper.pyin a temporary venv with scraper runtime dependenciesgit diff --checkNote: I did not run a full live scrape locally because the temporary local test environment did not include Crawl4AI. The GitHub update workflow installs full requirements and now validates the generated output after the real scrape.