[codex] Add API validation provenance and schemas by ethanolivertroy · Pull Request #16 · hackIDLE/nist-cmvp-api

ethanolivertroy · 2026-05-14T04:38:00Z

Summary

Adds the first reliability and API-quality slice from the CMVP API audit:

records per-certificate algorithm extraction provenance, including configured source, actual source, cache/fallback state, source URL, and extracted row counts
adds extraction metrics to generated metadata and algorithm summary output
centralizes local Security Policy PDF fetch reuse through an in-run cache
adds representative real NIST FIPS 140-2 and FIPS 140-3 policy text fixtures
publishes JSON Schema artifacts for API responses and advertises them from docs/index payloads
adds validate_api.py and wires the weekly update workflow to validate generated artifacts before committing
removes stale api/algorithms.json when algorithm extraction is skipped

Impact

This keeps existing response fields intact while adding optional provenance/schema fields for consumers that want stronger validation and traceability. The weekly workflow now fails before publishing if generated artifacts are internally inconsistent or if a new run reports Firecrawl as the extraction source.

Validation

python3 validate_api.py
python3 -m py_compile scraper.py test_scraper.py validate_api.py
test_scraper.py in a temporary venv with scraper runtime dependencies
git diff --check

Note: I did not run a full live scrape locally because the temporary local test environment did not include Crawl4AI. The GitHub update workflow installs full requirements and now validates the generated output after the real scrape.

qodo-code-review · 2026-05-14T04:40:15Z

Review Summary by Qodo

Add API validation, provenance tracking, and JSON Schema artifacts

✨ Enhancement 🧪 Tests

Walkthroughs

Description

• Adds per-certificate algorithm extraction provenance tracking source, cache/fallback status, and
  row counts
• Implements in-run PDF cache to centralize Security Policy fetch reuse across certificates
• Generates JSON Schema artifacts for API responses with validation support
• Adds validate_api.py tool for internal consistency checks before publishing
• Includes real FIPS 140-2 and FIPS 140-3 policy text fixtures for regression testing
• Wires validation into weekly update workflow to prevent publishing inconsistent artifacts

Diagram

flowchart LR
  A["Algorithm Extraction"] -->|"provenance metadata"| B["AlgorithmExtractionResult"]
  B -->|"source, cache, fallback"| C["Module/Detail Records"]
  D["PDF Fetch"] -->|"in-run cache"| E["Reuse Detection"]
  E -->|"cache hits tracked"| F["Extraction Metrics"]
  G["Generated Artifacts"] -->|"validate_api.py"| H["Consistency Check"]
  H -->|"JSON Schema validation"| I["Publish Gate"]

File Changes

1. scraper.py ✨ Enhancement +695/-38

Add extraction provenance, metrics, and PDF caching

scraper.py

2. validate_api.py 🧪 Tests +508/-0

New validation tool for API artifact consistency

validate_api.py

3. test_scraper.py 🧪 Tests +271/-0

Add provenance, metrics, and fixture-based tests

test_scraper.py

View more (14)

4. api/schemas/index.schema.json 📝 Documentation +49/-0

JSON Schema discovery document for API responses

api/schemas/index.schema.json

5. api/schemas/metadata.schema.json 📝 Documentation +70/-0

JSON Schema for API metadata endpoint

api/schemas/metadata.schema.json

6. api/schemas/module.schema.json 📝 Documentation +155/-0

JSON Schema for module row structure

api/schemas/module.schema.json

7. api/schemas/module-in-process.schema.json 📝 Documentation +30/-0

JSON Schema for in-process module rows

api/schemas/module-in-process.schema.json

8. api/schemas/modules.schema.json 📝 Documentation +22/-0

JSON Schema for active modules response

api/schemas/modules.schema.json

9. api/schemas/historical-modules.schema.json 📝 Documentation +22/-0

JSON Schema for historical modules response

api/schemas/historical-modules.schema.json

10. api/schemas/modules-in-process.schema.json 📝 Documentation +22/-0

JSON Schema for modules-in-process response

api/schemas/modules-in-process.schema.json

11. api/schemas/certificate-detail.schema.json 📝 Documentation +195/-0

JSON Schema for certificate detail response

api/schemas/certificate-detail.schema.json

12. api/schemas/algorithms.schema.json 📝 Documentation +50/-0

JSON Schema for algorithms summary response

api/schemas/algorithms.schema.json

13. README.md 📝 Documentation +36/-4

Document extraction provenance and JSON schemas

README.md

14. tests/fixtures/nist_security_policies/5260_fips_140_3_algorithms.txt 🧪 Tests +17/-0

FIPS 140-3 policy text fixture for testing

tests/fixtures/nist_security_policies/5260_fips_140_3_algorithms.txt

15. tests/fixtures/nist_security_policies/5152_fips_140_2_algorithms.txt 🧪 Tests +21/-0

FIPS 140-2 policy text fixture for testing

tests/fixtures/nist_security_policies/5152_fips_140_2_algorithms.txt

16. .github/workflows/validate.yml ⚙️ Configuration changes +47/-0

New validation workflow for PR checks

.github/workflows/validate.yml

17. .github/workflows/update-data.yml ⚙️ Configuration changes +5/-0

Wire validation into weekly update workflow

.github/workflows/update-data.yml

qodo-code-review · 2026-05-14T04:40:17Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0)

1. Broken schema $ref paths 🐞 Bug ≡ Correctness

Description

The published JSON Schemas use absolute-path $ref values (e.g., "/api/schemas/metadata.schema.json")
while their $id lives under "/nist-cmvp-api/...", so resolvers will fetch
https://hackidle.github.io/api/... (missing "/nist-cmvp-api") and fail to resolve referenced
schemas. Because scraper-generated schemas hardcode the same leading-slash refs, future runs will
keep regenerating broken references.

Code

api/schemas/modules.schema.json[R11-20]

+  "properties": {
+    "metadata": {
+      "$ref": "/api/schemas/metadata.schema.json"
+    },
+    "modules": {
+      "type": "array",
+      "items": {
+        "$ref": "/api/schemas/module.schema.json"
+      }
+    }

Evidence

modules.schema.json has a $id rooted under /nist-cmvp-api/... but references other schemas via
leading-slash $ref, which resolves to the site root instead of the repo base path. The schema
generator in scraper.py hardcodes the same leading-slash ref strings, so the issue will persist
across regenerations until fixed in code.

api/schemas/modules.schema.json[2-20]
api/schemas/modules-in-process.schema.json[2-20]
scraper.py[2787-2797]
scraper.py[2890-2942]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
JSON Schema documents under `api/schemas/` have `$id` values rooted at `https://hackidle.github.io/nist-cmvp-api/...`, but they reference other schemas using leading-slash `$ref` paths like `/api/schemas/metadata.schema.json`. Per URI resolution rules, a leading `/` resets to the host root, so these `$ref`s resolve to `https://hackidle.github.io/api/...` instead of `https://hackidle.github.io/nist-cmvp-api/api/...`, breaking schema consumers.

## Issue Context
Schemas are advertised for consumer validation; broken `$ref` breaks validation tooling for the response schemas that compose other schemas.

## Fix Focus Areas
- api/schemas/modules.schema.json[11-20]
- api/schemas/historical-modules.schema.json[11-20]
- api/schemas/modules-in-process.schema.json[11-20]
- scraper.py[2787-2797]
- scraper.py[2890-2942]

## Implementation notes
- Prefer **relative refs** (e.g., `"metadata.schema.json"`, `"module.schema.json"`, etc.) so they resolve relative to each schema’s `$id`.
- Alternatively, emit fully-qualified refs using `PUBLIC_BASE_URL` + schema path, but keep it consistent across all schema files.
- Regenerate/overwrite the checked-in schema JSON files to match the corrected generator output.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. PDF cache retains all bytes 🐞 Bug ➹ Performance

Description

fetch_policy_pdf_bytes caches an asyncio.Task per PDF URL in a shared dict and never evicts
entries, so completed tasks retain PDF bytes in memory for the entire scrape run and the cache grows
with the number of distinct policy URLs fetched. Within a run, a failed fetch result is also reused
for that URL (no re-fetch), which can amplify transient failures.

Code

scraper.py[R1329-1344]

+async def fetch_policy_pdf_bytes(
+    client: httpx.AsyncClient,
+    url: str,
+    pdf_cache: Dict[str, asyncio.Task],
+    pdf_cache_lock: asyncio.Lock,
+) -> Tuple[Optional[bytes], bool]:
+    """Fetch Security Policy PDF bytes through an in-run task cache."""
+    async with pdf_cache_lock:
+        task = pdf_cache.get(url)
+        cache_hit = task is not None
+        if task is None:
+            task = asyncio.create_task(fetch_with_retry(client, url, response_type="bytes"))
+            pdf_cache[url] = task
+
+    result = await task
+    return result if isinstance(result, bytes) else None, cache_hit

Evidence

fetch_policy_pdf_bytes inserts a created task into pdf_cache and never removes it.
build_certificate_artifacts creates a single shared pdf_cache dict for the dataset run; with
thousands of certificate details, this implies the cache can grow to thousands of retained task
results (and thus retained PDF bytes) during one run.

scraper.py[1295-1345]
scraper.py[1752-1762]
api/metadata.json[2-8]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`fetch_policy_pdf_bytes()` stores a task in `pdf_cache[url]` and never removes it. Since the task result holds the PDF bytes, the cache retains every fetched PDF in memory until the run ends; for large runs this can materially increase peak memory. Additionally, if the first fetch for a URL fails, later calls within the same run reuse the same failed task result and do not retry.

## Issue Context
The cache is intended to dedupe concurrent/duplicate fetches. You can keep that benefit while avoiding retaining all completed PDFs.

## Fix Focus Areas
- scraper.py[1329-1344]
- scraper.py[1752-1762]

## Implementation notes
- Treat this as an **in-flight** dedupe cache:
 - Insert a task under the lock.
 - `await` it.
 - Then, under the lock, `pdf_cache.pop(url, None)` so completed tasks can be GC’d.
- If you want reuse beyond in-flight dedupe, implement a bounded LRU of bytes (max entries / max total bytes) and ensure failures are not cached (or expire quickly).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. PR CI skips strict validation 🐞 Bug ☼ Reliability

Description

The new PR validation workflow runs validate_api.py without --require-current-schema, so CI will
not fail when checked-in artifacts are missing newly introduced fields/links (schemas advertising,
extraction_metrics, algorithm_extraction provenance). In this branch, api/index.json and
api/metadata.json are still missing the new schema/metrics fields, demonstrating the gap.

Code

.github/workflows/validate.yml[R41-44]

+      - name: Validate checked-in API artifacts
+        run: |
+          python validate_api.py
+

Evidence

The workflow runs the validator without the flag that turns on current-schema checks;
validate_api.py shows those checks are conditional on require_current_schema. The current
checked-in api/index.json lacks the new schemas field and feature flags, and api/metadata.json
lacks extraction_metrics/algorithm_extraction_schema_version, which strict validation would
catch.

.github/workflows/validate.yml[35-44]
validate_api.py[374-460]
api/index.json[14-36]
api/metadata.json[2-13]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`.github/workflows/validate.yml` runs `python validate_api.py` without `--require-current-schema`, but `validate_api.py` only enforces new schema/metrics/provenance requirements when that flag is enabled. This allows PRs to merge with stale checked-in artifacts even when code/docs claim new API surface.

## Issue Context
The update workflow already runs strict validation (`--require-current-schema`). PR CI should match that strictness (or the PR should regenerate checked-in artifacts before enabling it).

## Fix Focus Areas
- .github/workflows/validate.yml[41-44]
- validate_api.py[388-460]

## Implementation notes
- Update the workflow step to:
 - `python validate_api.py --require-current-schema`
- Then regenerate and commit the `api/*.json`, docs, and schema outputs so PR CI passes under the stricter validation (or temporarily gate the flag until artifacts are updated in this PR).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ee83a15e4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T04:43:59Z

+            task = asyncio.create_task(fetch_with_retry(client, url, response_type="bytes"))
+            pdf_cache[url] = task
+
+    result = await task
+    return result if isinstance(result, bytes) else None, cache_hit


Evict failed PDF fetches from cache

Caching the asyncio.Task unconditionally means a transient fetch failure is memoized for the rest of the run: when fetch_with_retry(...) returns None, later certificates that reference the same policy URL hit the cached failed task and never retry. In datasets where multiple certs share one Security Policy URL, one temporary network error can cascade into persistent algorithm misses for every dependent certificate.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-14T04:43:59Z

+    metadata_path = "/api/schemas/metadata.schema.json"
+    module_path = "/api/schemas/module.schema.json"
+    module_in_process_path = "/api/schemas/module-in-process.schema.json"


Use deploy-base-aware schema refs

These schema reference paths are root-absolute (/api/...), but each schema $id is under https://hackidle.github.io/nist-cmvp-api/...; JSON Schema resolution will therefore point to https://hackidle.github.io/api/... instead of the repo subpath. Validators resolving $ref from hosted schemas on GitHub Pages will fail to locate linked schemas.

Useful? React with 👍 / 👎.

qodo-code-review · 2026-05-14T04:44:59Z

+  "properties": {
+    "metadata": {
+      "$ref": "/api/schemas/metadata.schema.json"
+    },
+    "modules": {
+      "type": "array",
+      "items": {
+        "$ref": "/api/schemas/module.schema.json"
+      }
+    }


1. Broken schema $ref paths 🐞 Bug ≡ Correctness

The published JSON Schemas use absolute-path $ref values (e.g., "/api/schemas/metadata.schema.json") while their $id lives under "/nist-cmvp-api/...", so resolvers will fetch https://hackidle.github.io/api/... (missing "/nist-cmvp-api") and fail to resolve referenced schemas. Because scraper-generated schemas hardcode the same leading-slash refs, future runs will keep regenerating broken references.

Agent Prompt

## Issue description JSON Schema documents under `api/schemas/` have `$id` values rooted at `https://hackidle.github.io/nist-cmvp-api/...`, but they reference other schemas using leading-slash `$ref` paths like `/api/schemas/metadata.schema.json`. Per URI resolution rules, a leading `/` resets to the host root, so these `$ref`s resolve to `https://hackidle.github.io/api/...` instead of `https://hackidle.github.io/nist-cmvp-api/api/...`, breaking schema consumers. ## Issue Context Schemas are advertised for consumer validation; broken `$ref` breaks validation tooling for the response schemas that compose other schemas. ## Fix Focus Areas - api/schemas/modules.schema.json[11-20] - api/schemas/historical-modules.schema.json[11-20] - api/schemas/modules-in-process.schema.json[11-20] - scraper.py[2787-2797] - scraper.py[2890-2942] ## Implementation notes - Prefer **relative refs** (e.g., `"metadata.schema.json"`, `"module.schema.json"`, etc.) so they resolve relative to each schema’s `$id`. - Alternatively, emit fully-qualified refs using `PUBLIC_BASE_URL` + schema path, but keep it consistent across all schema files. - Regenerate/overwrite the checked-in schema JSON files to match the corrected generator output.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ethanolivertroy added 2 commits May 14, 2026 04:37

add api validation provenance and schemas

bbcdc23

add pr validation workflow

6ee83a1

ethanolivertroy marked this pull request as ready for review May 14, 2026 04:39

ethanolivertroy merged commit 0e73814 into main May 14, 2026
2 checks passed

ethanolivertroy deleted the codex/add-api-validation-provenance branch May 14, 2026 04:40

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

qodo-code-review Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add API validation provenance and schemas#16

[codex] Add API validation provenance and schemas#16
ethanolivertroy merged 2 commits into
mainfrom
codex/add-api-validation-provenance

ethanolivertroy commented May 14, 2026

Uh oh!

Uh oh!

qodo-code-review Bot commented May 14, 2026

Uh oh!

qodo-code-review Bot commented May 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

qodo-code-review Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ethanolivertroy commented May 14, 2026

Summary

Impact

Validation

Uh oh!

Uh oh!

qodo-code-review Bot commented May 14, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review Bot commented May 14, 2026 •

edited

Loading