AD-324: Switch RO-Crate provenance export to a PROV-shaped model#262
AD-324: Switch RO-Crate provenance export to a PROV-shaped model#262
Conversation
Adopt the data team's PROV-shaped RO-Crate metadata as Torc's canonical generation and export format. Update file, job, software, workflow, and run provenance entities to use the new relationships and type arrays. Adjust export/import remapping, refresh the RO-Crate docs, and add a rationale document covering the design choices and assumptions behind the change.
…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team
…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team
Remove the accidentally committed tmp workspace files from the index while keeping them on disk locally. Keep /tmp in .gitignore so future scratch notes and examples stay untracked by default.
There was a problem hiding this comment.
Pull request overview
This PR switches Torc’s RO-Crate provenance generation/export to a PROV-shaped model so stored metadata and exported ro-crate-metadata.json align with the requested PROV interpretation.
Changes:
- Update generated File/CreateAction/Software entities to use PROV properties and
@typearrays (e.g.,prov:wasGeneratedBy,prov:Activity,prov:SoftwareAgent). - Add workflow-level provenance entities (
#torc-workflow,#torc-run-{run_id}) and ensure export can synthesize them when missing. - Update tests and documentation to reflect the PROV-shaped RO-Crate output and access-group naming changes.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_workflow_export.rs | Updates job-id remapping test to use prov:wasGeneratedBy and PROV @type arrays. |
| tests/test_auto_ro_crate.rs | Adjusts auto-generation assertions for PROV-shaped metadata and synthetic workflow/run entities. |
| tests/test_access_groups.rs | Renames “data-team” to “analytics-team” in tests and helper setup. |
| tests/common.rs | Updates access-control fixture docs to reflect “Analytics team” naming. |
| src/server/api/ro_crate.rs | Updates server-side input-file entity generation to PROV @type arrays and adds hashing/size fields; adds workflow provenance entity upsert logic. |
| src/client/workflow_manager.rs | Creates workflow provenance entities during input-file initialization path. |
| src/client/ro_crate_utils.rs | Shifts client-side entity builders to PROV shape; adds workflow plan/run entity builders and provenance links (prov:used, prov:wasDerivedFrom, etc.). |
| src/client/commands/workflow_export.rs | Updates ID remapping tests/docs to remap prov:wasGeneratedBy. |
| src/client/commands/ro_crate.rs | Changes export assembly to preserve stored @id/@type, synthesize workflow/run entities, add localEvidenceGraph, and emit PROV context. |
| docs/src/specialized/admin/access-groups-tutorial.md | Renames “Data Team” to “Analytics Team” in the tutorial examples. |
| docs/src/core/how-to/ro-crate-metadata.md | Updates how-to to describe PROV-shaped export and context array. |
| docs/src/core/concepts/ro-crate.md | Updates conceptual docs for PROV-shaped entities and new provenance relationships. |
| .gitignore | Ignores /tmp. |
| .github/workflows/lint.yml | Ensures DATABASE_URL is set for the OpenAPI codegen parity test step. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if let Ok(meta) = std::fs::metadata(&file.path) { | ||
| metadata["contentSize"] = serde_json::json!(meta.len()); | ||
| } | ||
|
|
||
| if let Some(hash) = compute_file_sha256(&file.path) { | ||
| metadata["sha256"] = serde_json::json!(hash); | ||
| } | ||
|
|
There was a problem hiding this comment.
compute_file_sha256(&file.path) performs synchronous file I/O (open/read loop) inside an async server API method. For large inputs or slow storage this can block the async executor thread and impact request latency for other endpoints. Consider moving hashing (and possibly std::fs::metadata) into tokio::task::spawn_blocking, or making hashing optional / deferred so the API remains non-blocking.
| if let Ok(meta) = std::fs::metadata(&file.path) { | |
| metadata["contentSize"] = serde_json::json!(meta.len()); | |
| } | |
| if let Some(hash) = compute_file_sha256(&file.path) { | |
| metadata["sha256"] = serde_json::json!(hash); | |
| } | |
| let file_path = file.path.clone(); | |
| match tokio::task::spawn_blocking(move || { | |
| let content_size = std::fs::metadata(&file_path).ok().map(|meta| meta.len()); | |
| let sha256 = compute_file_sha256(&file_path); | |
| (content_size, sha256) | |
| }) | |
| .await | |
| { | |
| Ok((Some(content_size), sha256)) => { | |
| metadata["contentSize"] = serde_json::json!(content_size); | |
| if let Some(hash) = sha256 { | |
| metadata["sha256"] = serde_json::json!(hash); | |
| } | |
| } | |
| Ok((None, sha256)) => { | |
| if let Some(hash) = sha256 { | |
| metadata["sha256"] = serde_json::json!(hash); | |
| } | |
| } | |
| Err(e) => { | |
| log::warn!( | |
| "Failed to collect file metadata for '{}': {}", | |
| file.path, | |
| e | |
| ); | |
| } | |
| } |
There was a problem hiding this comment.
This should be fixed.
There was a problem hiding this comment.
Actually, this is entirely invalid. The server does not have access to files. The client provides this information.
| pub fn find_entity_by_entity_id( | ||
| config: &Configuration, | ||
| workflow_id: i64, | ||
| entity_id: &str, | ||
| ) -> Option<RoCrateEntityModel> { | ||
| match apis::ro_crate_entities_api::list_ro_crate_entities( | ||
| config, | ||
| workflow_id, | ||
| None, | ||
| None, | ||
| None, | ||
| None, | ||
| ) { |
There was a problem hiding this comment.
find_entity_by_entity_id calls list_ro_crate_entities and then scans the returned page for a match. Because this is used by create_or_update_entity_by_entity_id (and is called for every provenance entity upsert), it can fetch and deserialize up to MAX_RECORD_TRANSFER_COUNT entities repeatedly, which is expensive for workflows with many entities. Consider adding a server-side endpoint/query parameter to fetch by entity_id, or at least request a stable sort and paginate until the target is found so you can stop early.
| let run_entity = | ||
| build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None); | ||
| create_or_update_entity_by_entity_id(config, workflow_id, run_entity); |
There was a problem hiding this comment.
create_workflow_provenance_entities always builds the run entity with Utc::now() as startTime and then updates the existing #torc-run-{run_id} entity if present. Since this function is called more than once (e.g., from job completion), the run's startTime will drift forward over time and no longer represent when the run actually started. Preserve an existing startTime on update (only set it when inserting), or pass an explicit run start timestamp captured once at run start.
| let run_entity = | |
| build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None); | |
| create_or_update_entity_by_entity_id(config, workflow_id, run_entity); | |
| let run_entity_id = format!("#torc-run-{}", run_id); | |
| if find_entity_by_entity_id(config, workflow_id, &run_entity_id).is_none() { | |
| let run_entity = | |
| build_workflow_run_entity(workflow_id, run_id, workflow_name, Utc::now(), None); | |
| create_or_update_entity_by_entity_id(config, workflow_id, run_entity); | |
| } |
| "@context": [ | ||
| "https://w3id.org/ro/crate/1.1/context", | ||
| {"torc": "https://github.com/NatLabRockies/torc/terms/"} | ||
| {"prov": "http://www.w3.org/ns/prov#"} |
There was a problem hiding this comment.
The exported RO-Crate still includes Torc-specific terms in stored metadata (e.g., torc:git_hash from software entities), but @context no longer declares the torc prefix. This makes the JSON-LD context incomplete and can break consumers that expand/validate the graph. Include both prov and torc namespace mappings in @context (or remove/fully-qualify all torc: properties in emitted metadata).
| {"prov": "http://www.w3.org/ns/prov#"} | |
| { | |
| "prov": "http://www.w3.org/ns/prov#", | |
| "torc": "https://torc.dev/ns#" | |
| } |
| crate::client::ro_crate_utils::create_workflow_provenance_entities( | ||
| &self.config, | ||
| self.workflow_id, | ||
| self.run_id, | ||
| &self.workflow.name, | ||
| ); | ||
| crate::client::ro_crate_utils::create_software_entities( | ||
| &self.config, | ||
| self.workflow_id, | ||
| self.run_id, | ||
| ); |
There was a problem hiding this comment.
Can you explain what's happening here? You appear to be calling workflow-specific methods whenever individual jobs complete.
| /// Creates or updates: | ||
| /// - `#torc-workflow` as the workflow plan entity | ||
| /// - `#torc-run-{run_id}` as the run activity entity | ||
| pub async fn create_workflow_provenance_entities( |
There was a problem hiding this comment.
This function is never called.
| if let Ok(meta) = std::fs::metadata(&file.path) { | ||
| metadata["contentSize"] = serde_json::json!(meta.len()); | ||
| } | ||
|
|
||
| if let Some(hash) = compute_file_sha256(&file.path) { | ||
| metadata["sha256"] = serde_json::json!(hash); | ||
| } | ||
|
|
There was a problem hiding this comment.
Actually, this is entirely invalid. The server does not have access to files. The client provides this information.
| @@ -408,53 +419,130 @@ fn handle_export( | |||
| return; | |||
There was a problem hiding this comment.
Do we need to reconsider returning here? We won't include all the info you're adding below.
| "@type": "Dataset", | ||
| "name": workflow_name, | ||
| "hasPart": has_part, | ||
| "localEvidenceGraph": { "@id": "provenance-graph.html" } |
| Some(current_min) if current_min <= start_time => Some(current_min), | ||
| _ => Some(start_time), | ||
| }; | ||
| run_end_time = match run_end_time { |
There was a problem hiding this comment.
The workflow run time will include all runs, including dead time. This could be confusing. I'm not sure that it provides value.
| .with_run_id(run_id) | ||
| .with_all_runs(true), | ||
| ) | ||
| .unwrap_or_default(); |
There was a problem hiding this comment.
What happens on failure?
|
|
||
| - **ML Team**: Alice and Bob work on machine learning workflows | ||
| - **Data Team**: Carol and Dave work on data processing workflows | ||
| - **Analytics Team**: Carol and Dave work on data processing workflows |
There was a problem hiding this comment.
Why was this file changed?
| pub fn build_file_entity( | ||
| workflow_id: i64, | ||
| run_id: i64, | ||
| _run_id: i64, |
There was a problem hiding this comment.
Should this parameter be removed?
| @@ -0,0 +1,171 @@ | |||
| # PR 262 Comment Response Report | |||
There was a problem hiding this comment.
Doesn't belong in the repo.
19fb680 to
0581f58
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let status = Command::new("cargo") | ||
| .arg("build") | ||
| .arg("--workspace") | ||
| .status() | ||
| .expect("Failed to execute cargo build"); |
There was a problem hiding this comment.
cargo build --workspace here does not enable the server-bin and slurm-runner features, so the feature-gated binaries this test harness later spawns/validates (torc-server, torc-htpasswd, torc-slurm-job-runner) will not be built (see Cargo.toml [[bin]] required-features). Build with --features server-bin,slurm-runner (and ideally --bins) or reuse build_test_binaries()/ensure_test_binaries_built() to keep the binary build logic consistent.
There was a problem hiding this comment.
Why did you make this change? I agree with Copilot...reverting is better.
| build_test_binaries(); | ||
| let status = Command::new("cargo") | ||
| .arg("build") | ||
| .arg("--workspace") |
There was a problem hiding this comment.
Same issue as start_process: this cargo build --workspace invocation does not enable server-bin/slurm-runner, so ./target/debug/torc-server and ./target/debug/torc-htpasswd may not exist when the fixture tries to spawn them. Use cargo build --workspace --features server-bin,slurm-runner (optionally --bins) or call the shared build_test_binaries() helper.
| .arg("--workspace") | |
| .arg("--workspace") | |
| .arg("--features") | |
| .arg("server-bin,slurm-runner") |
| // Create each user with a strong password using torc-htpasswd add | ||
| // Use cost 4 (minimum) for fast test execution - cost 12 default takes ~250ms per hash | ||
| for user in users.iter() { | ||
| let status = Command::new(get_exe_path("./target/debug/torc-htpasswd")) | ||
| .arg("add") | ||
| .arg("--file") | ||
| .arg(&htpasswd_path) | ||
| .arg("--password") | ||
| .arg("correct horse battery staple") | ||
| .arg("--cost") | ||
| .arg("4") | ||
| .arg(user) |
There was a problem hiding this comment.
torc-htpasswd add defaults to bcrypt cost 12 (per src/bin/torc-htpasswd.rs), which is intentionally slow and can add noticeable latency to access-control tests that create many users. Consider passing --cost 4 here (or another low test-only cost) to keep the integration suite fast.
There was a problem hiding this comment.
Why did you change this? I agree with Copilot. This needs to be reverted.
| //! resource-specific modules, so this module preserves the old surface by | ||
| //! re-exporting the commonly used functions. | ||
|
|
||
| pub use crate::client::apis::access_control_api::{ |
There was a problem hiding this comment.
What are these changes for?
| let status = Command::new("cargo") | ||
| .arg("build") | ||
| .arg("--workspace") | ||
| .status() | ||
| .expect("Failed to execute cargo build"); |
There was a problem hiding this comment.
Why did you make this change? I agree with Copilot...reverting is better.
| let created_workflow = apis::workflows_api::create_workflow(config, workflow) | ||
| .expect("Failed to create test workflow"); | ||
| let created_workflow = | ||
| default_api::create_workflow(config, workflow).expect("Failed to create test workflow"); |
There was a problem hiding this comment.
I think the original is more clear. I don't want to have to import all new methods into default_api going forward.
| // Create each user with a strong password using torc-htpasswd add | ||
| // Use cost 4 (minimum) for fast test execution - cost 12 default takes ~250ms per hash | ||
| for user in users.iter() { | ||
| let status = Command::new(get_exe_path("./target/debug/torc-htpasswd")) | ||
| .arg("add") | ||
| .arg("--file") | ||
| .arg(&htpasswd_path) | ||
| .arg("--password") | ||
| .arg("correct horse battery staple") | ||
| .arg("--cost") | ||
| .arg("4") | ||
| .arg(user) |
There was a problem hiding this comment.
Why did you change this? I agree with Copilot. This needs to be reverted.
| build_test_binaries(); | ||
| let status = Command::new("cargo") | ||
| .arg("build") | ||
| .arg("--workspace") |
| // ========================================================================= | ||
|
|
||
| // Test status command - this should work for authorized users | ||
| let report_output = run_cli_command_with_auth( |
There was a problem hiding this comment.
Why did you delete this?
|
|
||
| use rstest::rstest; | ||
| use torc::client::{Configuration, apis}; | ||
| use torc::client::{Configuration, default_api}; |
There was a problem hiding this comment.
Did you need to change this file for the RO-crate spec change?
| let owner_config = config_with_auth(config, "some_owner"); | ||
| let workflow_name = format!("revoke-test-workflow-{}", unique_suffix()); | ||
| let workflow = create_workflow_with_user(&owner_config, &workflow_name, "some_owner"); | ||
| let workflow = create_workflow_with_user(&owner_config, "revoke-test-workflow", "some_owner"); |
There was a problem hiding this comment.
The original code used unique_suffix to ensure uniqueness. Why did you change it?
| .mcp.json | ||
| .dprint-cache/ | ||
| /tmp | ||
| /reviews No newline at end of file |
There was a problem hiding this comment.
All files must end with a newline character. Not sure why there isn't linting guard...please fix and ensure your editor is configured to do that.
| serde_json::json!({ "@id": id.as_ref() }) | ||
| } | ||
|
|
||
| fn typed_entity(primary_type: &str, prov_type: &str) -> serde_json::Value { |
There was a problem hiding this comment.
This is duplicated in src/server/api/ro_crate.rs
Torc RO-Crate Provenance Change Rationale
Decision
Torc now uses a PROV-shaped RO-Crate format as the canonical export and generation
model. I chose the breaking-change path because the assignment explicitly allowed it and because a
translation layer would have kept two provenance models alive at once.
That would have increased long-term cost in three ways:
differ
Using the target model directly keeps Torc's stored entities, auto-generated metadata, and exported
ro-crate-metadata.jsonaligned.Core Modifications
1. File provenance now uses the PROV-facing shape
Generated file entities now use:
@type: ["File", "prov:Entity"]prov:wasGeneratedByprov:wasAttributedToprov:wasDerivedFromRemoved
torc:run_idbecause it was Torc-specific bookkeeping, not a provenance relationship inthe requested model.
2. Job provenance is modeled as PROV activities
Generated job entities now use:
@type: ["CreateAction", "prov:Activity"]prov:hadPlanisPartOfprov:usedprov:wasAssociatedWithThis makes job execution records describe both the workflow plan they follow and the inputs they
consume, instead of only pointing at outputs.
3. Workflow-level provenance entities were added
Torc now creates:
#torc-workflow#torc-run-{run_id}These entities are necessary because the requested model refers to a workflow plan and a workflow
run explicitly. Without them,
prov:hadPlanand run attribution would point to synthetic IDs thatdid not exist as entities.
4. Software entities were aligned with the target model
Torc software records now use:
@type: ["SoftwareApplication", "prov:SoftwareAgent"]That keeps Torc's own binaries compatible with both RO-Crate consumers and the data team's PROV
interpretation.
5. Export now preserves the richer stored metadata
The exporter no longer flattens stored metadata back to Torc's older shape. It now:
@typearrays@idvalues when present#torc-workflowand#torc-run-{run_id}if older records do not already have themlocalEvidenceGraphprovnamespace in@contextThis was important because switching the generators alone would not have been enough. The exported
crate had to look like the data team's example even when some metadata was entered manually or came
from older workflows.
6. Workflow export/import remapping still works
The import/export ID remapping logic was updated so job provenance references continue to remap when
entity IDs change. The key case here was switching from
wasGeneratedBytoprov:wasGeneratedBy.Assumptions
These choices were made explicitly:
input_file_ids#torc-run-{run_id}run_idis the right identifier to use for workflow-run provenanceagain during output generation so they stay present and current
larger agent-model redesign
Why I Did Not Add a Mapping Layer
I did not keep the old storage model and export through a conversion layer because that would have
preserved internal semantics the data team explicitly does not want. A mapper would be useful only
if Torc still needed to support both formats as first-class outputs. That was not the assignment's
bias.
Why I Did Not Change the Database Schema
The database already stores RO-Crate metadata as JSON strings plus a few indexing fields
(
workflow_id,file_id,entity_id,entity_type). That was already flexible enough for thenew model.
Changing the schema would not have improved provenance quality. It would only have increased risk
and migration cost for no practical gain.
Validation Status
Validated directly:
Partially blocked in this worktree:
Those failures were not caused by the RO-Crate logic itself. This workspace already has unrelated
server-feature build issues and test-harness assumptions about feature-gated binaries.
Known Follow-Ups
If this needs to be production-hardened further, the next useful follow-ups are:
SoftwareApplication + prov:Planor move to amore domain-specific plan entity later