feat: SHACL validation reports — fluree validate CLI, Rust API, and HTTP endpoint#1424
Conversation
efca0e8 to
2aa1755
Compare
…ion result Adds the W3C constraint-component IRI constants to fluree-vocab, a Constraint::component() mapping, and a constraint_component field on ValidationResult threaded through all emit sites. Qualified-count results distinguish the min/max bound that actually failed; paths that fail to compile report a Fluree-specific component IRI since W3C SHACL has no component for that condition. Groundwork for the W3C-shaped validation report (fluree validate).
The graph-events adapter treated any nested object with @id as a pure reference, silently dropping its other properties. An embedded node object carrying more than a bare @id now recurses so its own triples are asserted (JSON-LD embedded-node semantics). Affected JSON-LD bulk import and inline shapes/ontology documents.
Adds fluree_db_api::validate — the shared core behind the upcoming
'fluree validate' CLI command and HTTP endpoint:
- ValidateOptions { graph, shapes, include_attached } with ShapesSource
= Attached | Graph(iri) | InlineJsonLd | InlineTurtle. Ad-hoc shapes
REPLACE attached shapes by default (include_attached unions them) and
ride the non-persisting inline-shapes bundle; replace-mode compiles
the bundle against an empty genesis snapshot so the ledger's own
graph-0 shapes never leak into the scan.
- Validation runs over the query-visible view (snapshot + novelty);
graph IRIs resolve through snapshot.graph_registry.
- ValidateReport with IRI-resolved results (component IRIs, severity,
path when a single predicate) and W3C sh:ValidationReport JSON-LD
serialization; deterministic result ordering.
- Turtle counterpart for the inline-shapes bundle parser.
- Five integration tests: attached-shapes report over a non-conforming
ledger, conforming state, inline-Turtle replace vs include_attached,
inline JSON-LD, unknown-graph NotFound.
… files Ledger mode validates a local ledger's current state (attached shapes, --shacl <file> ad-hoc shapes with replace-by-default semantics and --include-attached union, or --shacl-graph <iri>); file mode loads an RDF file into an ephemeral in-memory ledger with staging-time SHACL disabled via the config graph, so files that embed their own shapes report violations instead of failing to load. Output formats: table (default), jsonld, turtle; exit codes 0 = conforms, 1 = findings at or above --fail-on (violation|warning|info), 2 = usage error. Adds ValidateReport::to_turtle() (W3C report as Turtle, with blank-node label sanitization for skolemized ids), CLI integration tests, docs/cli/validate.md, and cookbook cross-references.
…dpoint HTTP surface of fluree_db_api::validate, following the /show read- endpoint conventions (peer-mode forwarding, data-auth can_read gate, proxy-storage NotImplemented guard, request span). POST body selects the data graph and shapes source: inline shapes as a JSON-LD object or a Turtle string (replace-by-default, includeAttached unions), or a same-ledger shapesGraph IRI; GET validates attached shapes with defaults. Accept negotiation: JSON summary envelope (default), application/ld+json and text/turtle for the W3C sh:ValidationReport. Non-conformance is a 200, never an error; unknown ledger/graph map to 404 via ApiError::is_not_found. Feature-gated behind shacl (default on). Eight HTTP integration tests and endpoint documentation with CLI/cookbook cross-links.
…mbership An ad-hoc shapes doc can ship the controlled vocabulary its sh:class constraints refer to (ex:CA rdf:type ex:State), matching f:shapesSource semantics where value-sets live with the shapes — but the detached inline bundle was invisible to membership lookups, so every such value reported as not-an-instance. CrossLedgerMembership gains a same_term_space mode: the bundle is encoded against the data ledger's namespace registry, so membership probes use the data-side Sids directly instead of the decode/re-encode translation (which always misses against the bundle's empty genesis snapshot). validate_view threads the bundle db in via the new validate_all_with_membership; the existing cross-ledger path is unchanged (same_term_space: false).
…s::exit run(cli) is a library entry point; terminating the host process from inside command dispatch breaks embedding. Non-conforming validation now returns CliError::ExitCode(1), which exit_with_error maps to a silent exit in the binary — observable behavior unchanged (integration tests still assert exit codes 0/1/2).
…ports
ValidationResult carried only a FlakeValue, so reports could not
represent language tags or non-native datatypes. ConstraintViolation
gains value_index (stamped by the per-value dispatcher, unique-lang's
internal position, the pair-constraint wrapper, and the class loop —
validator signatures unchanged); result construction recovers the
value's datatype and language from the parallel arrays into new
ValidationResult::{value_datatype, value_lang} fields.
Report emission renders sh:value as a JSON-LD value object: language-
tagged literals as {"@value", "@language"}, non-native datatypes as
{"@value", "@type"} with the lexical form, stringified-IRI facet
values as {"@id"}, and self-describing temporal/numeric variants with
inferred XSD types; only JSON-native XSD types stay bare scalars. The
Turtle report renders the same terms as "lex"@lang / "lex"^^<dt>.
The Turtle grammar allows `[ ... ] .` as a whole statement — the predicate-object list after a blankNodePropertyList subject is optional. The parser required it, rejecting valid documents (found via the W3C SHACL test suite, e.g. core/node/class-002.ttl).
Mirrors testsuite-sparql: workspace-excluded crate with the W3C data-shapes repo as a git submodule. Walks the core manifest tree, runs each sht:Validate case through fluree_db_api::validate (data loaded into an ephemeral memory ledger with staging-time SHACL disabled; shapes via ShapesSource::InlineTurtle), and compares the produced report against the expected sh:ValidationReport as a result multiset on focusNode / resultPath / severity / component / value, with documented leniency for blank nodes and absent fields (sourceShape and resultMessage are not compared). Make targets: count, summary, test-cat CAT=, test-one TEST=, failures, report-json, show. Report-only by default (SHACL_STRICT=1 to fail on any miss). First run: 46/98 (46.9%) — property 31/38, targets 5/7, node 5/32. The node category is dominated by two known engine gaps the suite now surfaces precisely: sh:targetNode with literal targets, and node-shape constraint results missing sh:value = focus node. See docs/contributing/shacl-compliance.md.
…closed Per spec, a node shape's value node is the focus node itself — its constraint results now carry sh:value = focus: structural constraints (sh:node/not/or/xone; sh:and collapses to one result per focus with aggregated messages instead of propagating nested results), direct value constraints (defaulting when the validator produced no value, and reporting the focus node rather than its stringified IRI for the string facets), and sh:closed results carry the offending flake's datatype/language. sh:closed no longer implicitly ignores rdf:type — the spec never grants that (W3C core/node/closed-001 pins it); shapes targeting typed instances must declare sh:ignoredProperties (rdf:type), as the spec's own examples do. Tests and cookbook updated accordingly. W3C SHACL core: 46.9% -> 55.1% (node 5/32 -> 13/32); no category regressed. Remaining node failures are the literal sh:targetNode gap.
Literal targets compile into TargetType::LiteralNode (value + datatype
+ language from the flake) and validate through a dedicated path: value
constraints evaluate against the literal directly (the value node IS
the focus), structural constraints test the literal's conformance to
the nested shapes, and property shapes see an empty value set (only
the minimum-count constraints can fire).
ValidationResult.focus_node becomes a FocusNode enum (Node(Sid) |
Literal). The report layer emits literal focus nodes as JSON-LD value
forms — plain strings wrap as {"@value": ...} so they stay distinct
from IRI strings — and ReportResult.focus_node is now a JsonValue
(string for IRIs/bnodes; wire format unchanged for those).
Also registers subjects explicitly typed sh:NodeShape during compile:
a shape whose only markers are the type + value constraints (e.g. an
implicit-class-target shape carrying just sh:in) was previously never
created at all.
W3C SHACL core: 55.1% -> 69.4% (node 13/32 -> 25/32, property 32/38).
Remaining node failures are documented model differences (numeric
value identity, ill-formed literal ingest, list dedup) or future work
(sh:equals per-value reporting, xsd:dateTime timezone partial order) —
see docs/contributing/shacl-compliance.md.
…reads A property shape carrying its own targets (ex:S a sh:PropertyShape ; sh:path ...; sh:targetNode ...) with no wrapping node shape never compiled — finalize only merged a shape's own entry when it was path-less. The shape now attaches its own path-bearing entry and validates its focus nodes directly; this is the shape every W3C path test uses. Path resolution reads a bare rdf:first list before the operator keys, so an ill-formed node carrying both a sequence structure and sh:inversePath resolves as the sequence (path-strange-001/002). Harness: a list-valued sh:resultPath in an expected report is a complex path structure — matched leniently like blank-node paths instead of being misread as its first element. W3C SHACL core: 69.4% -> 78.6% (path 2/13 -> 12/13). uniqueLang-002 flips pass->fail honestly: it passed only because the shape never compiled; sh:uniqueLang "1"^^xsd:boolean must be ignored per spec but canonicalizes to true in Fluree's value space — documented as a model difference alongside path-complex-002's duplicate-list-entry collapse.
…argets Four spec-conformance fixes surfaced by the W3C suite: - sh:conforms is true iff the report has NO results (warnings and infos included), per spec §3.4.2 — previously only violations counted. The two transaction-enforcement gates now check violation_count() explicitly, so warn/info results still never block a commit. - Per-value sh:and on property shapes produces ONE result per value node, not one per failing conjunct (matches the node-shape collapse). - sh:targetObjectsOf includes literal objects as focus nodes, routed through the literal-focus path (deduplicated by value). - Class-target focus discovery unions a live rdfs:subClassOf descendant walk with the indexed hierarchy, so novelty-only subclass relations reach sh:targetClass / implicit class targets (mirrors the live walk sh:class membership already had). W3C SHACL core: 78.6% -> 82.7% (targets 7/7, misc 4/5, property 32/38). The 17 remaining failures are all documented: value-model differences (ill-formed literals, numeric term identity, list dedup) or deferred features (per-value pair reporting, dateTime timezone partial order, nested sh:property on property shapes, custom severity IRIs, sh:sparql) — see docs/contributing/shacl-compliance.md.
969c384 to
19c1a11
Compare
aaj3f
left a comment
There was a problem hiding this comment.
Excited for this PR's work towards a SHACL validation mechanism
| // A String carrying the `@id` datatype is a stringified IRI | ||
| // (STR() semantics in the string facets) — report the IRI node. | ||
| if dt_iri == "@id" || &*dt.name == "id" { | ||
| if let FlakeValue::String(s) = value { | ||
| return json!({"@id": s}); | ||
| } | ||
| } |
There was a problem hiding this comment.
Don't know if we'd ever see this, but the @id-datatype heuristic dt_iri == "@id" || &*dt.name == "id" will misfire for any real datatype whose local name happens to be id (e.g. a literal "foo"^^<http://example.org/id>): dt_iri won't equal "@id", but dt.name == "id" is true, so the string literal is rendered as {"@id":"foo"} — a plain literal reported as an IRI node in sh:value. Report-display only (no effect on the conformance decision), but wrong. Prefer keying off the full IRI (or the datatype's namespace code) rather than the bare local name:
// drop the `|| &*dt.name == "id"` fallback, or gate it on the id namespace:
if dt_iri == "@id" {
if let FlakeValue::String(s) = value {
return json!({"@id": s});
}
}There was a problem hiding this comment.
validate_staged_nodes still computes conforms = all_results.iter().all(|r| r.severity != Violation) (violation-based), while fluree_db_shacl::ShaclEngine::validate_staged was changed in this PR to conforms = all_results.is_empty() (spec semantics). Two functions with near-identical names now stamp ValidationReport.conforms with two different meanings. It's benign today (every enforcement gate here reads violation_count(), not .conforms), but it's a latent trap for the next caller who trusts validate_staged_nodes(...).conforms to mean what the crate-level report means. Recommend aligning to all_results.is_empty() for one consistent definition of conforms across crates, or a comment explaining the divergence.
| // Build list of classes to query: target class + all subclasses. | ||
| // The indexed hierarchy misses novelty-added subclass relations, | ||
| // so a live rdfs:subClassOf descendant walk unions them in | ||
| // (mirrors the live walk sh:class membership already does). | ||
| let mut classes_to_query = vec![class.clone()]; | ||
| if let Some(h) = hierarchy { | ||
| classes_to_query.extend(h.subclasses_of(class).iter().cloned()); | ||
| } | ||
| let sub_class_of = Sid::new(fluree_vocab::namespaces::RDFS, "subClassOf"); | ||
| let mut queue: std::collections::VecDeque<Sid> = | ||
| classes_to_query.iter().cloned().collect(); | ||
| let mut visited: HashSet<Sid> = classes_to_query.iter().cloned().collect(); | ||
| while let Some(cls) = queue.pop_front() { | ||
| let sub_flakes = db | ||
| .range( | ||
| IndexType::Opst, | ||
| RangeTest::Eq, | ||
| RangeMatch::predicate_object( | ||
| sub_class_of.clone(), | ||
| FlakeValue::Ref(cls), | ||
| ), | ||
| ) | ||
| .await?; | ||
| for flake in &sub_flakes { | ||
| if visited.insert(flake.s.clone()) { | ||
| classes_to_query.push(flake.s.clone()); | ||
| queue.push_back(flake.s.clone()); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The live rdfs:subClassOf descendant BFS added to get_focus_nodes issues one Opst range query per discovered class and re-runs in full for every shape that targets the same class, on top of the already-consulted indexed hierarchy. Additionally, for a sh:targetObjectsOf shape the target predicate is range-scanned twice — once here (line 665, Ref objects) and once in the literal-target loop (line 388, literal objects). This path is only reached by the explicit validate endpoint/CLI (validate_all_with_membership), not the transaction-commit hot path, so it is not a hot-path regression and I'll 100% defer to you if it's even worth fixing — but for large ontologies it is avoidable repeated I/O. Consider memoizing the subclass closure once per validate call (keyed by class Sid) and folding the ObjectsOf literal/Ref discovery into a single predicate scan.
There was a problem hiding this comment.
Addressed the repeated subclass BFS + instance scans in 40e7e6f: get_focus_nodes now memoizes the per-class focus-node set for the duration of a validate call (keyed by the target class Sid — db and hierarchy are constant across the shape loop), so multiple shapes targeting the same class compute it once. Added it_validate_report.rs (first integration coverage for the validate endpoint) asserting subclass expansion + both shapes firing on the shared memoized focus set.
I deliberately left the sh:targetObjectsOf double-scan alone: the Ref-object and literal-object discovery live in separate scopes (get_focus_nodes vs the literal-target loop), so folding them into one scan means restructuring that boundary for a win that's just one duplicate scan per ObjectsOf shape — not amplified by hierarchy depth like the BFS was. Poor risk-to-reward on this cold path, so I'm consciously deferring it.
| /// Render an IRI or blank-node label as a Turtle term. | ||
| /// | ||
| /// Skolemized blank-node labels may carry characters that are invalid in a | ||
| /// Turtle BLANK_NODE_LABEL (e.g. `/` and `:` from embedded ledger ids) — | ||
| /// sanitize them so the emitted document always parses. | ||
| fn turtle_term(iri_or_bnode: &str) -> String { | ||
| match iri_or_bnode.strip_prefix("_:") { | ||
| Some(label) => { | ||
| let clean: String = label | ||
| .chars() | ||
| .map(|c| { | ||
| if c.is_ascii_alphanumeric() || c == '-' || c == '_' { | ||
| c | ||
| } else { | ||
| '-' | ||
| } | ||
| }) | ||
| .collect(); | ||
| format!("_:{clean}") | ||
| } | ||
| None => format!("<{iri_or_bnode}>"), | ||
| } | ||
| } |
There was a problem hiding this comment.
Turtle IRI serialization is not escaped. turtle_term sanitizes blank-node labels (the _: branch) but renders IRIs as the raw format!("<{iri_or_bnode}>"). An IRI carrying a Turtle-IRIREF-illegal character (space, <, >, ", {, }, |, backtick, \) would emit non-parseable Turtle for the text/turtle response / to_turtle(). Low likelihood (SIDs decode to IRIs from normally well-formed parsed input) and the JSON-LD path is unaffected, but it is the same class of bug the author already guarded against for blank nodes — worth a percent-encode or guard on the IRI branch. Not merge-blocking.
…mantics Three review-driven fixes to the validation report path: - value_json: match the JSON-LD @id datatype by namespace (JSON_LD + local "id"), not the bare local name, so a real datatype whose local name is "id" (e.g. <http://example.org/id>) is no longer misreported as an IRI node in sh:value. - Turtle IRIREF serialization: \uXXXX-escape the characters the Turtle grammar forbids inside <…> (controls, space, <>"{}|^`\) for both subject/ object IRIs and datatype IRIs. UCHAR escapes round-trip to the original IRI, unlike percent-encoding. Adds a parser round-trip test. - validate_staged_nodes: compute `conforms` as `all_results.is_empty()` (spec semantics) to match ShaclEngine::validate_staged, so the field means the same thing across crates. Enforcement gates already key off violation_count(), so behavior is unchanged.
get_focus_nodes ran the rdfs:subClassOf descendant BFS and the rdf:type instance scans once per shape, so N shapes targeting the same class paid that I/O N times. Memoize class -> focus-node list per validate call (db and hierarchy are constant across the shape loop), keyed by the target class Sid, so shapes sharing a class compute it once. Cold path only — the explicit validate endpoint/CLI, not transaction commit. Adds it_validate_report.rs, the first integration coverage for the validate endpoint: two shapes targeting the same class over a subclass hierarchy, asserting subclass expansion (a subclass instance is a focus node) and that both shapes fire on the shared, memoized focus set.
Re-merge onto the current origin branch after resetting away a stale main-merge. Four conflicts, all in shared SHACL touch points: - path.rs: keep main's IRI sh:path fast-path AND this branch's list-check-first ordering (they compose). - compile.rs: adopt main's hoisted implicit-class-target scan (already present after the loop); drop the redundant in-loop copy while keeping this branch's in-loop sh:NodeShape registration. - tx.rs: main's fail-loud cross-ledger value-set graph error plus this branch's same_term_space field. - Cargo.toml: union the workspace excludes. SHACL suites green (fluree-db-shacl 59, api grp_graphsource 82); workspace builds --all-features --all-targets.
Adds static SHACL validation as a first-class feature: instead of rejecting a transaction the way staging-time enforcement does, you can now ask "is my existing data clean?" and get a full W3C-shaped validation report — from the CLI, the Rust API, or over HTTP. Built on top of the property-paths work in the base branch.
fluree validate(CLI).flureeat all — data loads into an ephemeral memory ledger (staging-time enforcement disabled via the config graph, so files that embed their own shapes report violations instead of failing to load). This is the pre-flight companion to bulk import, which deliberately never runs SHACL.--include-attachedunions them.table(human),--format jsonld|turtlefor the W3Csh:ValidationReport.0conforms,1findings at/above--fail-on violation|warning|info,2usage.Rust API —
fluree_db_api::validateThe shared core behind both surfaces:
Fluree::validate_ledger(alias, &ValidateOptions)/validate_view(...)withShapesSource::Attached | Graph(iri) | InlineJsonLd | InlineTurtle, returning aValidateReportwith IRI-resolved results andto_jsonld()/to_turtle()serializers. Validation always runs over the query-visible view (snapshot + novelty); graph IRIs resolve throughgraph_registry, matchingf:shapesSource. Inline shapes ride the non-persisting inline-shapes bundle, compiled against a detached genesis snapshot so replace semantics never leak the ledger's own shapes — while their value-set facts (e.g.ex:CA a ex:Stateshipped with ash:class ex:Stateconstraint) still reach membership checks via a same-term-space membership source.HTTP —
GET|POST /validate/{ledger…}Follows the read-endpoint conventions (peer-mode forwarding, bearer
can_readgate). POST body selects data graph + shapes source;Acceptnegotiates a JSON summary envelope (default),application/ld+json, ortext/turtle. Non-conformance is a200— the report is the answer.Report core
Validation results now carry everything a standard report needs:
sh:sourceConstraintComponentIRIs on every result, full RDF term fidelity forsh:value(language tags, non-native datatypes as{"@value", "@type"}), and focus nodes that can be literals (FocusNodeenum) for literalsh:targetNodetargets.W3C compliance harness
New
testsuite-shacl/crate mirroringtestsuite-sparql/conventions: workspace-excluded, the officialdata-shapessuite as a git submodule,make count / summary / test-cat / test-one / failurestargets. Each test runs through the samefluree_db_api::validatecore as the CLI. Comparison semantics, deliberate leniency rules, and the remaining-failure taxonomy are documented indocs/contributing/shacl-compliance.md; current status and follow-ups will be tracked in a separate issue.Engine conformance fixes (driven by the suite)
ex:S a sh:PropertyShape ; sh:path … ; sh:targetNode …with no wrapping node shape) now compile and validate — previously a silent no-op.sh:targetNode/ literalsh:targetObjectsOfobjects validate as focus nodes.rdfs:subClassOfwalk with the indexed hierarchy (novelty-only subclass relations were invisible to targets).sh:value= focus node;sh:andproduces one result per value node.sh:NodeShapewith only value constraints (e.g. implicit-class-target shapes carrying justsh:in) now register during compile.sh:closedno longer implicitly ignoresrdf:type— the spec never grants that; shapes targeting typed instances must declaresh:ignoredProperties (rdf:type)(as the spec's own examples do). Existing shapes relying on the auto-ignore will start flaggingrdf:type.sh:conformsis spec-semantics: false when the report has any results, warnings/infos included. Transaction enforcement is unchanged — it gates on violations explicitly.@id+ properties now assert their triples (previously silently dropped as bare references). This fixes data loss in JSON-LD bulk import and inline shape/ontology documents.[ … ] .blank-node-property-list statements (valid Turtle, previously rejected).Docs
docs/cli/validate.md,/validatesection indocs/api/endpoints.md,docs/contributing/shacl-compliance.md, and cookbook updates (validation reports section, strictsh:closedguidance).