feat: SHACL property paths and full constraint-coverage sweep#1423
feat: SHACL property paths and full constraint-coverage sweep#1423bplatz wants to merge 19 commits into
Conversation
sh:path previously stored whatever the sh:path object pointed at as a bare predicate Sid. Complex paths (sh:inversePath, sequence lists, sh:alternativePath, sh:zeroOrMore/oneOrMore/zeroOrOnePath) arrived as blank-node refs and were scanned as if they were predicates — SPOT found nothing, so minCount fired spuriously and every other constraint passed silently. Add a PropertyPath AST (fluree-db-shacl/src/path.rs): - resolve_sh_path compiles the sh:path object, handling bare predicates, Turtle blank-node path expressions (rdf:first/rdf:rest sequences, sh:alternativePath lists, the transitive *Path predicates) and the JSON-LD @list encoding (ordered flakes via metadata index). - eval_path evaluates a path to its value-node set: forward = SPOT, inverse = OPST, sequence = chained frontier, alternative = union, */+/? = BFS closure (reflexive for * and ?), deduplicated. - validate_property_shape keeps the single-predicate SPOT fast path; sh:resultPath is populated only for single-predicate paths. The inverse of a composite path (^(p1/p2)) and any blank-node path that cannot be resolved are rejected at compile time with ShaclError::InvalidConstraint instead of misbehaving silently. Adds vocab constants for the path predicates, five integration tests (inverse, sequence, alternative, oneOrMore, unsupported-rejection), and documents the feature in docs/contributing/shacl-implementation.md.
…r internals doc Add a Property paths section to the SHACL cookbook covering inverse, sequence, alternative, and transitive (*/+/?) paths with Turtle and JSON-LD (@list) examples, plus the unsupported inverse-of-composite case. Remove docs/contributing/shacl-implementation.md — we don't keep a dedicated internals guide per feature. The user-facing semantics it carried (shapesSource, predicate-target discovery, per-graph config, value-sets) already live in the cookbook and config reference. Clean up its references in SUMMARY.md and contributing/README.md.
sh:message was compiled onto property and node shapes but never read when building validation results, so custom messages documented in the SHACL cookbook were silently ignored. Violations now use the property shape's sh:message (property constraints and per-value logical constraints) or the node shape's sh:message (sh:closed and node-level logical constraints), falling back to the generated message when no custom message is declared. Nested anonymous shapes keep generated messages (NestedShape carries no message).
…le paths to targets Property-path hardening follow-ups: - Nested shape members (sh:or/sh:and/sh:xone/sh:not) now carry the compiled PropertyPath and evaluate it, instead of scanning the path blank node as a bare predicate. A complex path on a nested member (e.g. an inverse path inside sh:or) no longer silently never matches. - An unsupported or unresolvable sh:path now compiles to PropertyPath::Unresolvable(reason) instead of failing shape compilation. The reason surfaces as a violation only when the owning shape fires on a focus node, so one broken shape no longer wedges every transaction on the ledger — the failure is scoped to that shape's targets. - references_blank_node() walks the whole AST so a path whose structure lives in a not-yet-scanned graph isn't accepted with a bnode predicate. - ordered_objects keeps indexed and unindexed flakes from interleaving. - literal members in a sequence path now error instead of being dropped. - closure() seeds visited with the focus (no re-expansion on cycles). - reuse fluree_db_core::id_datatype_sid() for ref value-node datatypes. Tests: complex-path-in-nested-or, unsupported-path scoped-to-targets.
Six SHACL constructs compiled (or loaded) without error but never constrained data. All now enforce: - sh:node: new NodeConstraint::Node compiled like the other shape-ref constraints — per-value on property shapes, focus-node on node shapes, with anonymous inline shapes inlined via build_nested_shape. Recursive shape references over cyclic data (FriendShape -> knows -> sh:node FriendShape) terminate via a (focus, shape) active-check guard in validate_shape that assumes conformance on re-entry. - Value constraints directly on a node shape (no sh:path) previously accumulated in a path-less PropertyShapeData that finalize() dropped; they now become node_constraints evaluated against the focus node, and sh:message/sh:name that landed on that entry backfill the shape. - sh:deactivated is now parsed; deactivated node and property shapes are ignored entirely, including when referenced via sh:node or logical constraints. - Implicit class targets: a shape that is also rdfs:Class / owl:Class targets its own instances (bound-object rdf:type scans, cost scales with declared classes). - sh:qualifiedValueShape + sh:qualifiedMin/MaxCount: conforming values are counted against the qualified shape (top-level property shapes; qualifiedValueShapesDisjoint remains unsupported). - sh:ignoredProperties in Turtle RDF-list form is now expanded; the unexpanded list-head blank node was previously treated as the ignored property, so closed shapes rejected the actual members.
sh:closed, sh:node, and the logical constraints (sh:not/and/or/xone) hardcoded Severity::Violation on their results, so a shape marked sh:severity sh:Warning still rejected transactions for those constraints while property constraints honored severity correctly. Structural results now carry the shape's severity. Nested-shape internal results keep Violation — they are conformance signals for the logical operators, not surfaced directly.
sh:pattern unconditionally rejected every non-string value, so
legitimate shapes over numeric, boolean, or date/time literals (e.g.
^\d{4}$ on an integer vintage year) always violated. Values now match
on their lexical form per SPARQL STR() semantics. Non-literals still
violate: blank nodes per spec; IRIs because matching them needs
namespace decoding this pure path doesn't have (noted as a follow-up).
Two db-access constraints silently no-oped inside nested shapes: - sh:class on an inline logical member (sh:or [ sh:path p ; sh:class C ]) or an anonymous sh:node value shape fell through to the pure constraint dispatch, which skips db-access constraints — the check always passed. Nested property constraints and value shapes now resolve class membership, and the ClassMembershipCtx (f:shapesSource vocabulary graphs, per-txn memo, cross-ledger model) is threaded through the nested/referenced-shape paths instead of being dropped at the named-ref boundary, so value-set lookups behave the same at any nesting depth. - sh:qualifiedValueShape on a property shape used as a logical member was never compiled into the member. build_nested_shape now attaches it (cycle-guarded via a seen-set: a qualified reference cycle between anonymous property shapes falls back to named-ref resolution, where the runtime recursion guard applies), and the nested constraint loop counts conformance like the top-level arm.
Both constraints were parsed but silently unenforced — the last of the loads-fine-does-nothing pair. Language tags already live in flake metadata (FlakeMeta::lang); validation now carries a langs column parallel to values/datatypes, including through property-path evaluation (PathValue gains the language tag; inverse/closure steps and the focus node itself are untagged). - sh:languageIn matches via SPARQL langMatches basic filtering (RFC 4647): case-insensitive, "en" matches "en-US", "*" matches any tag; untagged values violate. - sh:uniqueLang true reports one violation per duplicated tag; untagged values are ignored. - Compile fix: sh:languageIn previously produced one singleton constraint per JSON-LD list member (an unsatisfiable conjunction had it ever been enforced) and dropped the Turtle RDF-list form entirely. Tags now accumulate into a single constraint, with the Turtle list head expanded like sh:in.
Membership used term equality, so sh:in (1.0 2.0) never matched an integer 1 and sh:hasValue 42 never matched a decimal 42.00 — while the range facets already compared across numeric representations via numeric_cmp. sh:in / sh:hasValue now use the same value equality for numeric pairs; everything else keeps term equality.
sh:pattern / sh:minLength / sh:maxLength on IRI values previously violated unconditionally. Per SPARQL STR() they now match against the full IRI: value vectors containing non-blank IRI refs are rewritten with the decoded IRI string before the string-facet dispatch (top-level and nested property constraints). Blank nodes still fail per spec, and an IRI whose namespace was allocated in the same transaction cannot be decoded against the base snapshot — it fails closed with the generic message rather than silently passing.
NestedShape now carries the member's sh:message (populated from the anonymous property-shape entry at compile time; named references keep resolving their message from the referenced CompiledShape at validation time). Nested property-constraint violations prefer it over the generated text, so a custom message on an sh:and/or/xone/not/node member reaches the transaction error.
A qualified value shape marked disjoint now excludes values that conform to a sibling qualified shape (the qualified shapes declared by the other property shapes of the same node shape, gathered in a finalize pass). The canonical crew example works: requiring one pilot and one navigator as distinct members rejects a single member holding both roles, while the same data passes without disjointness. Sibling collection is top-level only; qualified constraints inside logical members keep counting without disjointness.
Bulk import is a trusted, high-throughput load path: validate source data against the shapes before importing; transaction-time validation keeps the ledger clean from there.
…ing facets Review fixes: - A value-only anonymous member of a node-level logical constraint (sh:or ([ sh:class ex:C ]), sh:not [ sh:nodeKind ... ]) produced no checks at all — validate_nested_shape never evaluated value_constraints against the focus node, so such members always conformed. The focus-node evaluation core is now shared (focus_value_violations) between direct node-shape constraints and nested value-only members, including pair constraints, sh:class, and string facets. - String facets are now consistent everywhere: node-shape constraints and anonymous value shapes stringify IRI values (STR(iri)) like top-level property shapes, and sh:minLength/sh:maxLength use the same lexical form as sh:pattern — non-literals (blank nodes, undecodable IRIs) violate instead of measuring the SID name fragment. - sh:severity on a node shape with direct value constraints was lost (the metadata arms prefer the path-less property-shape entry; message and name were backfilled but severity was not). - sh:uniqueLang counts tags case-insensitively per BCP 47. - Nested qualified constraints honor disjoint/sibling fields uniformly (siblings remain empty for members not referenced via sh:property, which matches the spec's sibling definition). - Removed the now-unused validate_constraint_set.
Review fixes: - sh:inversePath was limited to a single predicate, rejecting valid SHACL like ^(p1/p2). Inversion now rewrites into the AST: inverse of a sequence is the reversed sequence of inverses, inverse of an alternative distributes, inverse of a closure wraps the inverted inner path, and a double inverse collapses. - Malformed paths error instead of compiling nondeterministically: path operators take exactly one value (multiple distinct refs for sh:inversePath / closures error via sole_ref/operand_path), and multiple sh:path or operator-operand objects only form a sequence under the JSON-LD @list encoding (every flake carries a list index) — un-indexed multiples are distinct assertions and error. Operator operands accept both encodings via operand_path. - fluree-vocab: full-IRI constants for the path operators, sh:node, and sh:deactivated to match the local-name additions. The unsupported-path tests now use a genuinely malformed fixture (a literal sequence step); a new test covers ^(parent/parent).
Per the spec's sibling definition the set excludes the constraint's own sh:qualifiedValueShape by value, so two property shapes referencing the same qualified shape don't disqualify each other's values.
efca0e8 to
2aa1755
Compare
aaj3f
left a comment
There was a problem hiding this comment.
Nice fix and nice path support addition!
| /// Resolve a single `sh:path` object node (a predicate IRI or a path-expression | ||
| /// blank node) into a [`PropertyPath`]. | ||
| fn resolve_path_node<'a>(db: GraphDbRef<'a>, node: &'a Sid) -> PathFuture<'a, PropertyPath> { | ||
| Box::pin(async move { | ||
| // sh:inversePath — inverse of any path, rewritten into the AST | ||
| // (inverse of a sequence = reversed sequence of inverses, etc.). | ||
| if let Some(inner) = operand_path(db, node, &shacl(predicates::INVERSE_PATH)).await? { | ||
| return Ok(invert(inner)); | ||
| } |
There was a problem hiding this comment.
resolve_path_node probes six operator predicates (sh:inversePath, sh:alternativePath, the three closures, and rdf:first) via operand_path/has_object — each a db.range SPOT scan — before falling through to PropertyPath::Predicate. For a plain-predicate sh:path (the overwhelmingly common case) that is ~6 index probes that all return empty. This runs inside resolve_paths, which compile_from_dbs invokes per transaction on any ledger that has shapes (fluree-db-api/src/tx.rs:578 → from_dbs_with_overlay; shapes are recompiled every transaction, not cached by schema_epoch). So a schema with N property shapes adds ~6·N empty index scans to every write. Per the SHACL spec an IRI-valued sh:path is always a predicate, and path-expression structures are always blank nodes — so a namespace_code != BLANK_NODE short-circuit is both a correctness-preserving and a large constant-factor win. The base branch stored the path as a bare Sid with zero scans, so this is a genuine regression introduced here.
fn resolve_path_node<'a>(db: GraphDbRef<'a>, node: &'a Sid) -> PathFuture<'a, PropertyPath> {
Box::pin(async move {
// Fast path: an IRI in sh:path is always a plain predicate (SHACL spec);
// only blank nodes carry path-expression structure. Skip the six operator
// probes for the common case — this runs per property shape per compile,
// and compiles happen per transaction.
if node.namespace_code != BLANK_NODE {
return Ok(PropertyPath::Predicate(node.clone()));
}
// sh:inversePath — inverse of any path, rewritten into the AST ...
if let Some(inner) = operand_path(db, node, &shacl(predicates::INVERSE_PATH)).await? {
return Ok(invert(inner));
}
// ... rest unchanged(BLANK_NODE is already imported at line 22.)
| /// Deduplicate value nodes (SHACL value nodes are a set). | ||
| fn dedup(mut values: Vec<PathValue>) -> Vec<PathValue> { | ||
| let mut seen: HashSet<String> = HashSet::new(); | ||
| values.retain(|(v, dt, lang)| seen.insert(format!("{v:?}|{dt:?}|{lang:?}"))); | ||
| values | ||
| } |
There was a problem hiding this comment.
dedup builds a fresh heap String per value node via format!("{v:?}|{dt:?}|{lang:?}") as its hash-set key. On a sh:zeroOrMorePath/oneOrMorePath closure over a large graph (or a wide alternative/sequence) this is an O(n) allocation + formatting pass on every step's output. Prefer a structural key that hashes the components directly — FlakeValue, Sid, and Option<String> should be Hash/Eq already (they key the class-membership memo and visited sets elsewhere), so:
fn dedup(mut values: Vec<PathValue>) -> Vec<PathValue> {
let mut seen: HashSet<(FlakeValue, Sid, Option<String>)> = HashSet::new();
values.retain(|(v, dt, lang)| seen.insert((v.clone(), dt.clone(), lang.clone())));
values
}Still clones, but avoids the format-string allocation and the Debug-representation fragility. Scope is complex-path shapes only (opt-in), hence minor. (Verification: the suggested tuple key compiles — FlakeValue has total manual Hash/Eq impls at value.rs:747/795/865, and Sid/Option<String> are Hash+Eq. Note it also makes dedup collapse cross-type-numeric-equal values that share a datatype, which is arguably more correct.)
| // Collect subjects typed as a class — a shape that is also a class | ||
| // implicitly targets its own instances (SHACL "implicit class | ||
| // targets"). Bound-object scans, so cost scales with the number of | ||
| // declared classes, not the data. | ||
| let rdf_type = Sid::new(RDF, rdf_names::TYPE); | ||
| for class_class in [ | ||
| Sid::new(fluree_vocab::namespaces::RDFS, "Class"), | ||
| Sid::new(fluree_vocab::namespaces::OWL, "Class"), | ||
| ] { | ||
| let flakes = db | ||
| .range( | ||
| IndexType::Opst, | ||
| RangeTest::Eq, | ||
| RangeMatch::predicate_object( | ||
| rdf_type.clone(), | ||
| FlakeValue::Ref(class_class), | ||
| ), | ||
| ) | ||
| .await?; | ||
| class_typed.extend(flakes.iter().map(|f| f.s.clone())); | ||
| } |
There was a problem hiding this comment.
The implicit-class-target discovery issues two Opst bound-object scans (rdf:type → rdfs:Class, rdf:type → owl:Class) per input graph on every compile (i.e. every transaction with shapes), regardless of whether any shape id is actually a class. The result set is bounded by the number of declared classes, so this is cheap in absolute terms, but it is fixed per-transaction overhead that didn't exist before. Acceptable as-is; noting it so it is a conscious cost. If it ever shows up in profiles, gate it on !self.shapes.is_empty() (already implied) or restrict to shape ids.
Three review-driven optimizations to compile_from_dbs, which recompiles shapes on every transaction against a shape-bearing ledger: - resolve_path_node: short-circuit IRI-valued sh:path to a plain predicate before the six operator probes. Per the SHACL spec only blank nodes carry path-expression structure, so N property shapes no longer add ~6N empty SPOT scans per compile. Restores the base branch's zero-scan behavior. - dedup: key value nodes on a structural (FlakeValue, Sid, Option<String>) tuple instead of a per-value format! heap allocation, dropping the O(n) Debug-format pass on closure/alternative output. - implicit class targets: hoist the two Opst rdf:type->Class scans out of the per-graph loop and skip them entirely when no shapes were found. Cross-graph resolution is preserved (runs after all graphs processed).
Overview
Builds on the class-value-set branch with two thrusts: full
sh:pathproperty-path support, and a sweep that closes every silently-broken or unenforced SHACL constraint found in a full audit of the engine. Before this branch, several constructs loaded without error but never constrained data — the worst failure mode for a validation system.Property paths (
sh:path)PropertyPathAST and evaluate against the focus node: inverse (over any path —^(p1/p2)rewrites to^p2/^p1), sequences (Turtle RDF lists and JSON-LD@list), alternatives, and the three closures, all nestable.Unresolvablemarker that surfaces as a violation scoped to the shape's targets — one broken shape cannot wedge unrelated transactions on the ledger.Constraints that now enforce (previously silent no-ops)
sh:node— per-value on property shapes, focus-node on node shapes. Recursive shape references over cyclic data (e.g.FriendShape → knows → sh:node FriendShape) terminate via an active-(focus, shape)guard that assumes conformance on re-entry.sh:path), including value-only anonymous members of logical constraints.sh:qualifiedValueShape+ min/max counts, including inside logical members, plussh:qualifiedValueShapesDisjoint(sibling set per the spec definition: all property shapes of the parent node shape, excluding the own qualified shape by value).sh:uniqueLang/sh:languageIn— language tags threaded from flake metadata through validation (including path evaluation);langMatchesbasic-range semantics; case-insensitive uniqueness per BCP 47.sh:deactivated, implicit class targets (shape that is alsordfs:Class/owl:Class), Turtle RDF-listsh:ignoredProperties.Semantics fixes in already-enforced constraints
sh:messagenow surfaces (top-level shapes and anonymous nested members).sh:severityhonored on node-level structural constraints and on node shapes with direct value constraints — warn-severity shapes no longer reject.sh:pattern/sh:minLength/sh:maxLengthmatch the lexical form per SPARQLSTR(): numeric/date literals participate, IRIs match their full decoded IRI, blank nodes fail per spec. (An IRI whose namespace is allocated in the same transaction can't be decoded against the base snapshot and fails closed.)sh:in/sh:hasValuecompare numerics by value across representations (1matches1.0).sh:classand qualified shapes evaluate inside nested members with the full membership context (f:shapesSourcevocabulary graphs, per-txn memo, cross-ledger model) instead of dropping it at the named-ref boundary.Policy
Bulk import deliberately bypasses SHACL (documented in the cookbook): validate source data against the shapes before importing; transaction-time validation keeps the ledger clean from there.
Testing
62 SHACL integration tests (each constraint has pass + violate cases; regressions pinned via temp-revert where practical, e.g. the disjoint test carries its own control) and 57 unit tests. Full workspace
clippy --all-features --all-targetsandnextest --workspace --all-featurespass; the one unrelated flake (raft_multi_node::liveness_monitor_demotes_killed_follower) passes in isolation.Known limitations (documented)
sh:targetNodewith literal values is not compiled (focus nodes are subject ids throughout the engine).sh:sparqlconstraints are not implemented.ShaclCacheKey.schema_epochis ready for a cross-transaction cache if profiling warrants it.