Skip to content

[codex] Add API validation provenance and schemas#16

Merged
ethanolivertroy merged 2 commits into
mainfrom
codex/add-api-validation-provenance
May 14, 2026
Merged

[codex] Add API validation provenance and schemas#16
ethanolivertroy merged 2 commits into
mainfrom
codex/add-api-validation-provenance

Conversation

@ethanolivertroy
Copy link
Copy Markdown
Member

Summary

Adds the first reliability and API-quality slice from the CMVP API audit:

  • records per-certificate algorithm extraction provenance, including configured source, actual source, cache/fallback state, source URL, and extracted row counts
  • adds extraction metrics to generated metadata and algorithm summary output
  • centralizes local Security Policy PDF fetch reuse through an in-run cache
  • adds representative real NIST FIPS 140-2 and FIPS 140-3 policy text fixtures
  • publishes JSON Schema artifacts for API responses and advertises them from docs/index payloads
  • adds validate_api.py and wires the weekly update workflow to validate generated artifacts before committing
  • removes stale api/algorithms.json when algorithm extraction is skipped

Impact

This keeps existing response fields intact while adding optional provenance/schema fields for consumers that want stronger validation and traceability. The weekly workflow now fails before publishing if generated artifacts are internally inconsistent or if a new run reports Firecrawl as the extraction source.

Validation

  • python3 validate_api.py
  • python3 -m py_compile scraper.py test_scraper.py validate_api.py
  • test_scraper.py in a temporary venv with scraper runtime dependencies
  • git diff --check

Note: I did not run a full live scrape locally because the temporary local test environment did not include Crawl4AI. The GitHub update workflow installs full requirements and now validates the generated output after the real scrape.

@ethanolivertroy ethanolivertroy marked this pull request as ready for review May 14, 2026 04:39
@ethanolivertroy ethanolivertroy merged commit 0e73814 into main May 14, 2026
2 checks passed
@ethanolivertroy ethanolivertroy deleted the codex/add-api-validation-provenance branch May 14, 2026 04:40
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add API validation, provenance tracking, and JSON Schema artifacts

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Adds per-certificate algorithm extraction provenance tracking source, cache/fallback status, and
  row counts
• Implements in-run PDF cache to centralize Security Policy fetch reuse across certificates
• Generates JSON Schema artifacts for API responses with validation support
• Adds validate_api.py tool for internal consistency checks before publishing
• Includes real FIPS 140-2 and FIPS 140-3 policy text fixtures for regression testing
• Wires validation into weekly update workflow to prevent publishing inconsistent artifacts
Diagram
flowchart LR
  A["Algorithm Extraction"] -->|"provenance metadata"| B["AlgorithmExtractionResult"]
  B -->|"source, cache, fallback"| C["Module/Detail Records"]
  D["PDF Fetch"] -->|"in-run cache"| E["Reuse Detection"]
  E -->|"cache hits tracked"| F["Extraction Metrics"]
  G["Generated Artifacts"] -->|"validate_api.py"| H["Consistency Check"]
  H -->|"JSON Schema validation"| I["Publish Gate"]
Loading

Grey Divider

File Changes

1. scraper.py ✨ Enhancement +695/-38

Add extraction provenance, metrics, and PDF caching

scraper.py


2. validate_api.py 🧪 Tests +508/-0

New validation tool for API artifact consistency

validate_api.py


3. test_scraper.py 🧪 Tests +271/-0

Add provenance, metrics, and fixture-based tests

test_scraper.py


View more (14)
4. api/schemas/index.schema.json 📝 Documentation +49/-0

JSON Schema discovery document for API responses

api/schemas/index.schema.json


5. api/schemas/metadata.schema.json 📝 Documentation +70/-0

JSON Schema for API metadata endpoint

api/schemas/metadata.schema.json


6. api/schemas/module.schema.json 📝 Documentation +155/-0

JSON Schema for module row structure

api/schemas/module.schema.json


7. api/schemas/module-in-process.schema.json 📝 Documentation +30/-0

JSON Schema for in-process module rows

api/schemas/module-in-process.schema.json


8. api/schemas/modules.schema.json 📝 Documentation +22/-0

JSON Schema for active modules response

api/schemas/modules.schema.json


9. api/schemas/historical-modules.schema.json 📝 Documentation +22/-0

JSON Schema for historical modules response

api/schemas/historical-modules.schema.json


10. api/schemas/modules-in-process.schema.json 📝 Documentation +22/-0

JSON Schema for modules-in-process response

api/schemas/modules-in-process.schema.json


11. api/schemas/certificate-detail.schema.json 📝 Documentation +195/-0

JSON Schema for certificate detail response

api/schemas/certificate-detail.schema.json


12. api/schemas/algorithms.schema.json 📝 Documentation +50/-0

JSON Schema for algorithms summary response

api/schemas/algorithms.schema.json


13. README.md 📝 Documentation +36/-4

Document extraction provenance and JSON schemas

README.md


14. tests/fixtures/nist_security_policies/5260_fips_140_3_algorithms.txt 🧪 Tests +17/-0

FIPS 140-3 policy text fixture for testing

tests/fixtures/nist_security_policies/5260_fips_140_3_algorithms.txt


15. tests/fixtures/nist_security_policies/5152_fips_140_2_algorithms.txt 🧪 Tests +21/-0

FIPS 140-2 policy text fixture for testing

tests/fixtures/nist_security_policies/5152_fips_140_2_algorithms.txt


16. .github/workflows/validate.yml ⚙️ Configuration changes +47/-0

New validation workflow for PR checks

.github/workflows/validate.yml


17. .github/workflows/update-data.yml ⚙️ Configuration changes +5/-0

Wire validation into weekly update workflow

.github/workflows/update-data.yml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 14, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

Grey Divider


Action required

1. Broken schema $ref paths 🐞 Bug ≡ Correctness
Description
The published JSON Schemas use absolute-path $ref values (e.g., "/api/schemas/metadata.schema.json")
while their $id lives under "/nist-cmvp-api/...", so resolvers will fetch
https://hackidle.github.io/api/... (missing "/nist-cmvp-api") and fail to resolve referenced
schemas. Because scraper-generated schemas hardcode the same leading-slash refs, future runs will
keep regenerating broken references.
Code

api/schemas/modules.schema.json[R11-20]

+  "properties": {
+    "metadata": {
+      "$ref": "/api/schemas/metadata.schema.json"
+    },
+    "modules": {
+      "type": "array",
+      "items": {
+        "$ref": "/api/schemas/module.schema.json"
+      }
+    }
Evidence
modules.schema.json has a $id rooted under /nist-cmvp-api/... but references other schemas via
leading-slash $ref, which resolves to the site root instead of the repo base path. The schema
generator in scraper.py hardcodes the same leading-slash ref strings, so the issue will persist
across regenerations until fixed in code.

api/schemas/modules.schema.json[2-20]
api/schemas/modules-in-process.schema.json[2-20]
scraper.py[2787-2797]
scraper.py[2890-2942]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
JSON Schema documents under `api/schemas/` have `$id` values rooted at `https://hackidle.github.io/nist-cmvp-api/...`, but they reference other schemas using leading-slash `$ref` paths like `/api/schemas/metadata.schema.json`. Per URI resolution rules, a leading `/` resets to the host root, so these `$ref`s resolve to `https://hackidle.github.io/api/...` instead of `https://hackidle.github.io/nist-cmvp-api/api/...`, breaking schema consumers.

## Issue Context
Schemas are advertised for consumer validation; broken `$ref` breaks validation tooling for the response schemas that compose other schemas.

## Fix Focus Areas
- api/schemas/modules.schema.json[11-20]
- api/schemas/historical-modules.schema.json[11-20]
- api/schemas/modules-in-process.schema.json[11-20]
- scraper.py[2787-2797]
- scraper.py[2890-2942]

## Implementation notes
- Prefer **relative refs** (e.g., `"metadata.schema.json"`, `"module.schema.json"`, etc.) so they resolve relative to each schema’s `$id`.
- Alternatively, emit fully-qualified refs using `PUBLIC_BASE_URL` + schema path, but keep it consistent across all schema files.
- Regenerate/overwrite the checked-in schema JSON files to match the corrected generator output.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. PDF cache retains all bytes 🐞 Bug ➹ Performance
Description
fetch_policy_pdf_bytes caches an asyncio.Task per PDF URL in a shared dict and never evicts
entries, so completed tasks retain PDF bytes in memory for the entire scrape run and the cache grows
with the number of distinct policy URLs fetched. Within a run, a failed fetch result is also reused
for that URL (no re-fetch), which can amplify transient failures.
Code

scraper.py[R1329-1344]

+async def fetch_policy_pdf_bytes(
+    client: httpx.AsyncClient,
+    url: str,
+    pdf_cache: Dict[str, asyncio.Task],
+    pdf_cache_lock: asyncio.Lock,
+) -> Tuple[Optional[bytes], bool]:
+    """Fetch Security Policy PDF bytes through an in-run task cache."""
+    async with pdf_cache_lock:
+        task = pdf_cache.get(url)
+        cache_hit = task is not None
+        if task is None:
+            task = asyncio.create_task(fetch_with_retry(client, url, response_type="bytes"))
+            pdf_cache[url] = task
+
+    result = await task
+    return result if isinstance(result, bytes) else None, cache_hit
Evidence
fetch_policy_pdf_bytes inserts a created task into pdf_cache and never removes it.
build_certificate_artifacts creates a single shared pdf_cache dict for the dataset run; with
thousands of certificate details, this implies the cache can grow to thousands of retained task
results (and thus retained PDF bytes) during one run.

scraper.py[1295-1345]
scraper.py[1752-1762]
api/metadata.json[2-8]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`fetch_policy_pdf_bytes()` stores a task in `pdf_cache[url]` and never removes it. Since the task result holds the PDF bytes, the cache retains every fetched PDF in memory until the run ends; for large runs this can materially increase peak memory. Additionally, if the first fetch for a URL fails, later calls within the same run reuse the same failed task result and do not retry.

## Issue Context
The cache is intended to dedupe concurrent/duplicate fetches. You can keep that benefit while avoiding retaining all completed PDFs.

## Fix Focus Areas
- scraper.py[1329-1344]
- scraper.py[1752-1762]

## Implementation notes
- Treat this as an **in-flight** dedupe cache:
 - Insert a task under the lock.
 - `await` it.
 - Then, under the lock, `pdf_cache.pop(url, None)` so completed tasks can be GC’d.
- If you want reuse beyond in-flight dedupe, implement a bounded LRU of bytes (max entries / max total bytes) and ensure failures are not cached (or expire quickly).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. PR CI skips strict validation 🐞 Bug ☼ Reliability
Description
The new PR validation workflow runs validate_api.py without --require-current-schema, so CI will
not fail when checked-in artifacts are missing newly introduced fields/links (schemas advertising,
extraction_metrics, algorithm_extraction provenance). In this branch, api/index.json and
api/metadata.json are still missing the new schema/metrics fields, demonstrating the gap.
Code

.github/workflows/validate.yml[R41-44]

+      - name: Validate checked-in API artifacts
+        run: |
+          python validate_api.py
+
Evidence
The workflow runs the validator without the flag that turns on current-schema checks;
validate_api.py shows those checks are conditional on require_current_schema. The current
checked-in api/index.json lacks the new schemas field and feature flags, and api/metadata.json
lacks extraction_metrics/algorithm_extraction_schema_version, which strict validation would
catch.

.github/workflows/validate.yml[35-44]
validate_api.py[374-460]
api/index.json[14-36]
api/metadata.json[2-13]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`.github/workflows/validate.yml` runs `python validate_api.py` without `--require-current-schema`, but `validate_api.py` only enforces new schema/metrics/provenance requirements when that flag is enabled. This allows PRs to merge with stale checked-in artifacts even when code/docs claim new API surface.

## Issue Context
The update workflow already runs strict validation (`--require-current-schema`). PR CI should match that strictness (or the PR should regenerate checked-in artifacts before enabling it).

## Fix Focus Areas
- .github/workflows/validate.yml[41-44]
- validate_api.py[388-460]

## Implementation notes
- Update the workflow step to:
 - `python validate_api.py --require-current-schema`
- Then regenerate and commit the `api/*.json`, docs, and schema outputs so PR CI passes under the stricter validation (or temporarily gate the flag until artifacts are updated in this PR).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ee83a15e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scraper.py
Comment on lines +1340 to +1344
task = asyncio.create_task(fetch_with_retry(client, url, response_type="bytes"))
pdf_cache[url] = task

result = await task
return result if isinstance(result, bytes) else None, cache_hit
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Evict failed PDF fetches from cache

Caching the asyncio.Task unconditionally means a transient fetch failure is memoized for the rest of the run: when fetch_with_retry(...) returns None, later certificates that reference the same policy URL hit the cached failed task and never retry. In datasets where multiple certs share one Security Policy URL, one temporary network error can cascade into persistent algorithm misses for every dependent certificate.

Useful? React with 👍 / 👎.

Comment thread scraper.py
Comment on lines +2892 to +2894
metadata_path = "/api/schemas/metadata.schema.json"
module_path = "/api/schemas/module.schema.json"
module_in_process_path = "/api/schemas/module-in-process.schema.json"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use deploy-base-aware schema refs

These schema reference paths are root-absolute (/api/...), but each schema $id is under https://hackidle.github.io/nist-cmvp-api/...; JSON Schema resolution will therefore point to https://hackidle.github.io/api/... instead of the repo subpath. Validators resolving $ref from hosted schemas on GitHub Pages will fail to locate linked schemas.

Useful? React with 👍 / 👎.

Comment on lines +11 to +20
"properties": {
"metadata": {
"$ref": "/api/schemas/metadata.schema.json"
},
"modules": {
"type": "array",
"items": {
"$ref": "/api/schemas/module.schema.json"
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Broken schema $ref paths 🐞 Bug ≡ Correctness

The published JSON Schemas use absolute-path $ref values (e.g., "/api/schemas/metadata.schema.json")
while their $id lives under "/nist-cmvp-api/...", so resolvers will fetch
https://hackidle.github.io/api/... (missing "/nist-cmvp-api") and fail to resolve referenced
schemas. Because scraper-generated schemas hardcode the same leading-slash refs, future runs will
keep regenerating broken references.
Agent Prompt
## Issue description
JSON Schema documents under `api/schemas/` have `$id` values rooted at `https://hackidle.github.io/nist-cmvp-api/...`, but they reference other schemas using leading-slash `$ref` paths like `/api/schemas/metadata.schema.json`. Per URI resolution rules, a leading `/` resets to the host root, so these `$ref`s resolve to `https://hackidle.github.io/api/...` instead of `https://hackidle.github.io/nist-cmvp-api/api/...`, breaking schema consumers.

## Issue Context
Schemas are advertised for consumer validation; broken `$ref` breaks validation tooling for the response schemas that compose other schemas.

## Fix Focus Areas
- api/schemas/modules.schema.json[11-20]
- api/schemas/historical-modules.schema.json[11-20]
- api/schemas/modules-in-process.schema.json[11-20]
- scraper.py[2787-2797]
- scraper.py[2890-2942]

## Implementation notes
- Prefer **relative refs** (e.g., `"metadata.schema.json"`, `"module.schema.json"`, etc.) so they resolve relative to each schema’s `$id`.
- Alternatively, emit fully-qualified refs using `PUBLIC_BASE_URL` + schema path, but keep it consistent across all schema files.
- Regenerate/overwrite the checked-in schema JSON files to match the corrected generator output.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant