Skip to content

Release v1.4 — MCP server, Define-JSON, CRF specializations, and full study entity coverage#254

Open
pendingintent wants to merge 67 commits into
masterfrom
release-v-1.4
Open

Release v1.4 — MCP server, Define-JSON, CRF specializations, and full study entity coverage#254
pendingintent wants to merge 67 commits into
masterfrom
release-v-1.4

Conversation

@pendingintent

Copy link
Copy Markdown
Owner

What

This release expands the SoA Workbench from a visit/activity scheduling tool into a full USDM study definition platform. v1.4 ships an MCP server for agent-driven workflows, a complete Define-JSON generator, CRF specialization assignment, and twelve new study entity domains (organizations, roles, persons, estimands, indications, study interventions, titles, identifiers, amendments extensions, geographic scopes, BC categories, and SoA bundle import/export).

Why

The workbench previously covered the Schedule of Activities core (visits, activities, arms, epochs) but lacked the surrounding USDM study metadata needed to produce a submission-ready USDM package or a Define-XML/JSON artifact. v1.4 closes those gaps and adds an MCP server so AI agents can drive the workbench programmatically.

How

  • MCP server (src/soa_builder/mcp/server.py): 11 tools covering SoA CRUD, visit/activity management, matrix retrieval, and USDM/Define-JSON export — enables Claude and other agents to interact with the workbench via the Model Context Protocol.
  • Define-JSON generator (src/usdm/create_define_json.py, generate_define_json.py): Produces CDISC Define-JSON v2.1 from the workbench database, including concepts and conceptProperties; documented in help/DEFINE_JSON_GENERATOR_INTEGRATION.md.
  • CRF specializations (routers/activities.py + templates/crf_*.html): Assigns CRF specializations to activity instances with extensionAttributes in USDM export.
  • New entity domains — each with router, migration, USDM generator, templates, and full test coverage: Organizations, Persons, Roles, Estimands, Indications, Study Interventions, Study Titles, Study Identifiers, BC Categories.
  • Amendments extended: Enrollments, geographic scopes (regions + country codes), and governance dates added as sub-entities.
  • SoA bundle (routers/soa_bundle.py): Export/import of a complete SoA as a portable JSON bundle.
  • HTML matrix export (templates/soa_matrix_export.html): Standalone HTML page of the SoA matrix for easy web deployment.
  • Audit trail (audit.py): Centralised before/after audit logging extracted from inline router code.
  • USDM systemVersion: Stamped from the current git branch name; release-v-* prefix normalized automatically.
  • BCP response code cleanup (scripts/cleanup_bcp_response_codes.py): Fixes stale and mismatched BiomedicalConceptProperty response codes across all SoAs.
  • config.env: Centralised environment variable management for deployment.

Changes

  • MCP: New src/soa_builder/mcp/ package — server.py with 11 tools; pyproject.toml entry point added
  • Define-JSON: src/usdm/create_define_json.py (4 501 lines), generate_define_json.py, UI at templates/define_json.html
  • New routers: bc_categories, estimands, indications, organizations, persons, roles, soa_bundle, study_identifiers, study_interventions, study_titles
  • CRF specializations: templates/crf_cell.html, crf_specialization_detail.html, crf_specializations.html
  • Amendments: Extended templates for enrollments, geographic scopes, governance dates
  • USDM generators: New generate_estimands, generate_indications, generate_organizations, generate_roles, generate_study_identifiers, generate_study_interventions, generate_study_titles, generate_bc_categories
  • Audit: web/audit.py (319 lines) extracted from inline router code
  • Migrations: migrate_database.py extended with all new entity tables
  • Docs/help: Moved docs → help; added ALIGNMENT_EAGER_BCP_VS_DEFINE_JSON, BIOMEDICAL_CONCEPT_PROPERTY_EAGER_POPULATION, DEFINE_JSON_GENERATOR_INTEGRATION, DIFF_REPORT_ALL_USDM_ENTITIES, ORGANIZATIONS guides
  • Output JSON: Updated USDM and Define-JSON snapshots for H2Q-MC-LZZT and NCT01797120
  • CI: Removed release-* branch match from azure-deploy.yml (deploy only from master)

Testing

  • Full test suite passes — 18 new test files added covering all new routers and generators
  • MCP server tested end-to-end (tests/test_mcp_server.py, 295 lines)
  • Define-JSON generation tested with concept/conceptProperty population (tests/test_define_json_concepts.py, test_define_json_generator.py)
  • BCP response code cleanup script tested (tests/test_bcp_response_code_cleanup.py)
  • CRF specialization CRUD and USDM export tested (tests/test_routers_activities_crf.py)
  • SoA bundle export/import round-trip tested (tests/test_routers_soa_bundle.py)
  • All new entity routers have dedicated test files

Notes for reviewers

  • The src/usdm/create_define_json.py file is large (4 500+ lines) — it is a self-contained generator and can be skimmed structurally rather than line-by-line.
  • migrate_database.py is the source of truth for all schema changes; review the new ALTER TABLE / CREATE TABLE blocks there.
  • Subject data CSV files (files/subject_data/NCT01797120/) were removed — they are no longer needed in the repo.
  • The large files/D1_Master Protocol…pdf (38 MB) was added to files/ as reference material; confirm this is intentional before merge if storage is a concern.

Extended the diff report to cover all entity classes for USDM created in
the SOA Workbench
Added an HTML page extract of the SOA Matrix to allow users to easily
review and deploy to web servers.
Added ability to delete an SOA with double confirmation
Adds 0..N organizations per SOA with name, label, identifier,
identifierScheme, type (DDF CT C215480), and optional legalAddress
(text, lines, city, district, state, postalCode, country via ISO 3166).

- DB: organization + organization_audit tables with migrations
- Router: JSON CRUD + HTMX add/delete endpoints
- USDM: generate_organizations.py → Organization-Output + Address-Output
- UI: Organizations study-meta-card below Study Metadata on edit page
- Tests: 9 new tests (434 total, 0 failures)
- Resolved merge conflicts with pendingintent-add-titles changes
Replaces hardcoded "1.0.0" with the active branch from
`git rev-parse --abbrev-ref HEAD` so each export is traceable
to the branch that generated it. Falls back to "unknown" if git
is unavailable.
Strips the "release-v-" prefix and ensures three dot-separated
components (e.g. release-v-1.4 -> 1.4.0, release-v-1.4.1 -> 1.4.1).
Non-release branches remain unchanged.
_populate_bcp_locked always deleted existing BCP/alias_code/code rows
before re-inserting, but never guarded against the case where the
replacement data was empty (API failure or timeout). If
_get_biomedical_concept_data returned {} the delete committed with zero
inserts, permanently destroying BCP rows until the next successful
backfill.

Added _has_insertable_data() which validates that at least one BCP
would actually be written before the delete proceeds. If no data is
available the function logs a warning and returns early, preserving
the existing rows unchanged.
Root cause: 189 biomedical_concept rows had alias_code entries whose
standard_code (code row) had been deleted by a prior migration. The
INNER JOIN on alias_code/code in build_usdm_biomedical_concepts silently
excluded every BC with a broken chain — 128 were referenced in activity
biomedicalConceptIds but absent from the biomedicalConcepts array.

Two fixes:
1. _migrate_repair_broken_bc_code_chains: at startup, finds BCs whose
   alias_code exists but code row is missing, re-creates the code and
   alias_code rows from activity_concept.concept_code, and repoints
   biomedical_concept.code to the new valid alias. All 168 affected BCs
   are repaired on next server start.

2. build_usdm_biomedical_concepts: change INNER JOIN to LEFT JOIN on
   alias_code/code so any future broken chains do not silently drop BCs
   from output. Missing code info emits empty strings rather than
   excluding the BC entirely.
…ric values for type if missing for roles and orgs
@pendingintent pendingintent added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 22, 2026
@pendingintent pendingintent added this to the version 1.4 milestone Jun 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copilot AI review requested due to automatic review settings June 22, 2026 18:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

- freezes.py: replace str(exc) in 3 error responses with generic
  messages; log details server-side only
- footnotes.py: cast soa_id to int in all redirect URLs (open redirect)
- bc_surrogates.py: cast soa_id to int in all redirect URLs (open redirect)
- app.py: remove resp.text snippets from logs and cache; suppress
  exception details from status API endpoints and UI responses
Copilot AI review requested due to automatic review settings June 22, 2026 19:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…posure

- app.py: remove user-controlled 'category' parameter from all log
  calls in fetch_biomedical_concepts_by_category to satisfy
  py/clear-text-logging-sensitive-data (5 alerts)
- activities.py: replace href.startswith() SSRF guard with explicit
  urlparse netloc comparison so CodeQL recognises the host allowlist
  (py/full-ssrf, 2 alerts)
- amendments.py: replace html.escape(str(exc)) in governance date
  handler with logger.exception + generic message
  (py/stack-trace-exposure, 1 alert)
…xposure

- activities.py: reconstruct request URL from trusted _p.scheme/_p.netloc
  + user-provided path/query via urlunparse, so CodeQL can verify the
  host is never user-controlled (py/full-ssrf, 2 locations)
- amendments.py: fix missed html.escape(str(exc)) in geographic scope
  handler — replace with logger.exception + generic message
  (py/stack-trace-exposure, line 2222)
Copilot AI review requested due to automatic review settings June 22, 2026 19:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

- SDTM path: prefer DSS variable's dataType over DEC's generic dataType
  (e.g. a float result in a specialization is more specific than the
  DEC's generic 'string'); matches Dave Iberson-Hurst's cdisc_bc_library
  approach
- Add 7 validation tests:
  - exclusion list exactly matches Dave's _process_property reference
  - all excluded SDTM suffixes rejected by _include_property
  - all data-carrying variables pass the filter
  - every required USDM BCP attribute is present and correctly typed
  - isEnabled is always True
  - mandatoryValue: false maps to isRequired: False
  - DSS dataType takes priority over DEC dataType

Closes #218
MD5 is flagged by CodeQL (py/weak-sensitive-data-hashing) because the
content passed to this function may include clinical trial data. The
function only needs deterministic, stable OIDs for Define-XML — no
cryptographic security requirement — but SHA-256 satisfies both the
security scanner and the use case with identical interface.
Copilot AI review requested due to automatic review settings June 22, 2026 20:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@pendingintent pendingintent moved this from Todo to In Progress in SOA Workbench version 1.4 Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants