Skip to content

feat: SHACL property paths and full constraint-coverage sweep#1423

Open
bplatz wants to merge 19 commits into
feature/dataset-reasoning-parityfrom
feature/shacl-property-paths
Open

feat: SHACL property paths and full constraint-coverage sweep#1423
bplatz wants to merge 19 commits into
feature/dataset-reasoning-parityfrom
feature/shacl-property-paths

Conversation

@bplatz

@bplatz bplatz commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Overview

Builds on the class-value-set branch with two thrusts: full sh:path property-path support, and a sweep that closes every silently-broken or unenforced SHACL constraint found in a full audit of the engine. Before this branch, several constructs loaded without error but never constrained data — the worst failure mode for a validation system.

Property paths (sh:path)

  • Path expressions compile into a PropertyPath AST and evaluate against the focus node: inverse (over any path — ^(p1/p2) rewrites to ^p2/^p1), sequences (Turtle RDF lists and JSON-LD @list), alternatives, and the three closures, all nestable.
  • Complex paths work inside nested/logical members too, not just top-level property shapes.
  • Malformed paths (literal steps, multiple un-listed operator values) compile to an Unresolvable marker that surfaces as a violation scoped to the shape's targets — one broken shape cannot wedge unrelated transactions on the ledger.

Constraints that now enforce (previously silent no-ops)

  • sh:node — per-value on property shapes, focus-node on node shapes. Recursive shape references over cyclic data (e.g. FriendShape → knows → sh:node FriendShape) terminate via an active-(focus, shape) guard that assumes conformance on re-entry.
  • Value constraints directly on a node shape (no sh:path), including value-only anonymous members of logical constraints.
  • sh:qualifiedValueShape + min/max counts, including inside logical members, plus sh:qualifiedValueShapesDisjoint (sibling set per the spec definition: all property shapes of the parent node shape, excluding the own qualified shape by value).
  • sh:uniqueLang / sh:languageIn — language tags threaded from flake metadata through validation (including path evaluation); langMatches basic-range semantics; case-insensitive uniqueness per BCP 47.
  • sh:deactivated, implicit class targets (shape that is also rdfs:Class/owl:Class), Turtle RDF-list sh:ignoredProperties.

Semantics fixes in already-enforced constraints

  • sh:message now surfaces (top-level shapes and anonymous nested members).
  • sh:severity honored on node-level structural constraints and on node shapes with direct value constraints — warn-severity shapes no longer reject.
  • sh:pattern/sh:minLength/sh:maxLength match the lexical form per SPARQL STR(): numeric/date literals participate, IRIs match their full decoded IRI, blank nodes fail per spec. (An IRI whose namespace is allocated in the same transaction can't be decoded against the base snapshot and fails closed.)
  • sh:in/sh:hasValue compare numerics by value across representations (1 matches 1.0).
  • sh:class and qualified shapes evaluate inside nested members with the full membership context (f:shapesSource vocabulary graphs, per-txn memo, cross-ledger model) instead of dropping it at the named-ref boundary.

Policy

Bulk import deliberately bypasses SHACL (documented in the cookbook): validate source data against the shapes before importing; transaction-time validation keeps the ledger clean from there.

Testing

62 SHACL integration tests (each constraint has pass + violate cases; regressions pinned via temp-revert where practical, e.g. the disjoint test carries its own control) and 57 unit tests. Full workspace clippy --all-features --all-targets and nextest --workspace --all-features pass; the one unrelated flake (raft_multi_node::liveness_monitor_demotes_killed_follower) passes in isolation.

Known limitations (documented)

  • sh:targetNode with literal values is not compiled (focus nodes are subject ids throughout the engine).
  • sh:sparql constraints are not implemented.
  • Shapes recompile per transaction; ShaclCacheKey.schema_epoch is ready for a cross-transaction cache if profiling warrants it.

@bplatz bplatz requested review from aaj3f and zonotope July 2, 2026 15:12
bplatz added 18 commits July 2, 2026 20:10
sh:path previously stored whatever the sh:path object pointed at as a bare
predicate Sid. Complex paths (sh:inversePath, sequence lists,
sh:alternativePath, sh:zeroOrMore/oneOrMore/zeroOrOnePath) arrived as
blank-node refs and were scanned as if they were predicates — SPOT found
nothing, so minCount fired spuriously and every other constraint passed
silently.

Add a PropertyPath AST (fluree-db-shacl/src/path.rs):
- resolve_sh_path compiles the sh:path object, handling bare predicates,
  Turtle blank-node path expressions (rdf:first/rdf:rest sequences,
  sh:alternativePath lists, the transitive *Path predicates) and the
  JSON-LD @list encoding (ordered flakes via metadata index).
- eval_path evaluates a path to its value-node set: forward = SPOT,
  inverse = OPST, sequence = chained frontier, alternative = union,
  */+/? = BFS closure (reflexive for * and ?), deduplicated.
- validate_property_shape keeps the single-predicate SPOT fast path;
  sh:resultPath is populated only for single-predicate paths.

The inverse of a composite path (^(p1/p2)) and any blank-node path that
cannot be resolved are rejected at compile time with
ShaclError::InvalidConstraint instead of misbehaving silently.

Adds vocab constants for the path predicates, five integration tests
(inverse, sequence, alternative, oneOrMore, unsupported-rejection), and
documents the feature in docs/contributing/shacl-implementation.md.
…r internals doc

Add a Property paths section to the SHACL cookbook covering inverse,
sequence, alternative, and transitive (*/+/?) paths with Turtle and
JSON-LD (@list) examples, plus the unsupported inverse-of-composite case.

Remove docs/contributing/shacl-implementation.md — we don't keep a
dedicated internals guide per feature. The user-facing semantics it
carried (shapesSource, predicate-target discovery, per-graph config,
value-sets) already live in the cookbook and config reference. Clean up
its references in SUMMARY.md and contributing/README.md.
sh:message was compiled onto property and node shapes but never read
when building validation results, so custom messages documented in the
SHACL cookbook were silently ignored. Violations now use the property
shape's sh:message (property constraints and per-value logical
constraints) or the node shape's sh:message (sh:closed and node-level
logical constraints), falling back to the generated message when no
custom message is declared. Nested anonymous shapes keep generated
messages (NestedShape carries no message).
…le paths to targets

Property-path hardening follow-ups:

- Nested shape members (sh:or/sh:and/sh:xone/sh:not) now carry the
  compiled PropertyPath and evaluate it, instead of scanning the path
  blank node as a bare predicate. A complex path on a nested member
  (e.g. an inverse path inside sh:or) no longer silently never matches.

- An unsupported or unresolvable sh:path now compiles to
  PropertyPath::Unresolvable(reason) instead of failing shape
  compilation. The reason surfaces as a violation only when the owning
  shape fires on a focus node, so one broken shape no longer wedges every
  transaction on the ledger — the failure is scoped to that shape's
  targets.

- references_blank_node() walks the whole AST so a path whose structure
  lives in a not-yet-scanned graph isn't accepted with a bnode predicate.
- ordered_objects keeps indexed and unindexed flakes from interleaving.
- literal members in a sequence path now error instead of being dropped.
- closure() seeds visited with the focus (no re-expansion on cycles).
- reuse fluree_db_core::id_datatype_sid() for ref value-node datatypes.

Tests: complex-path-in-nested-or, unsupported-path scoped-to-targets.
Six SHACL constructs compiled (or loaded) without error but never
constrained data. All now enforce:

- sh:node: new NodeConstraint::Node compiled like the other shape-ref
  constraints — per-value on property shapes, focus-node on node shapes,
  with anonymous inline shapes inlined via build_nested_shape. Recursive
  shape references over cyclic data (FriendShape -> knows -> sh:node
  FriendShape) terminate via a (focus, shape) active-check guard in
  validate_shape that assumes conformance on re-entry.
- Value constraints directly on a node shape (no sh:path) previously
  accumulated in a path-less PropertyShapeData that finalize() dropped;
  they now become node_constraints evaluated against the focus node,
  and sh:message/sh:name that landed on that entry backfill the shape.
- sh:deactivated is now parsed; deactivated node and property shapes
  are ignored entirely, including when referenced via sh:node or
  logical constraints.
- Implicit class targets: a shape that is also rdfs:Class / owl:Class
  targets its own instances (bound-object rdf:type scans, cost scales
  with declared classes).
- sh:qualifiedValueShape + sh:qualifiedMin/MaxCount: conforming values
  are counted against the qualified shape (top-level property shapes;
  qualifiedValueShapesDisjoint remains unsupported).
- sh:ignoredProperties in Turtle RDF-list form is now expanded; the
  unexpanded list-head blank node was previously treated as the ignored
  property, so closed shapes rejected the actual members.
sh:closed, sh:node, and the logical constraints (sh:not/and/or/xone)
hardcoded Severity::Violation on their results, so a shape marked
sh:severity sh:Warning still rejected transactions for those
constraints while property constraints honored severity correctly.
Structural results now carry the shape's severity. Nested-shape
internal results keep Violation — they are conformance signals for
the logical operators, not surfaced directly.
sh:pattern unconditionally rejected every non-string value, so
legitimate shapes over numeric, boolean, or date/time literals (e.g.
^\d{4}$ on an integer vintage year) always violated. Values now match
on their lexical form per SPARQL STR() semantics. Non-literals still
violate: blank nodes per spec; IRIs because matching them needs
namespace decoding this pure path doesn't have (noted as a follow-up).
Two db-access constraints silently no-oped inside nested shapes:

- sh:class on an inline logical member (sh:or [ sh:path p ; sh:class C ])
  or an anonymous sh:node value shape fell through to the pure constraint
  dispatch, which skips db-access constraints — the check always passed.
  Nested property constraints and value shapes now resolve class
  membership, and the ClassMembershipCtx (f:shapesSource vocabulary
  graphs, per-txn memo, cross-ledger model) is threaded through the
  nested/referenced-shape paths instead of being dropped at the
  named-ref boundary, so value-set lookups behave the same at any
  nesting depth.
- sh:qualifiedValueShape on a property shape used as a logical member
  was never compiled into the member. build_nested_shape now attaches
  it (cycle-guarded via a seen-set: a qualified reference cycle between
  anonymous property shapes falls back to named-ref resolution, where
  the runtime recursion guard applies), and the nested constraint loop
  counts conformance like the top-level arm.
Both constraints were parsed but silently unenforced — the last of the
loads-fine-does-nothing pair. Language tags already live in flake
metadata (FlakeMeta::lang); validation now carries a langs column
parallel to values/datatypes, including through property-path
evaluation (PathValue gains the language tag; inverse/closure steps
and the focus node itself are untagged).

- sh:languageIn matches via SPARQL langMatches basic filtering
  (RFC 4647): case-insensitive, "en" matches "en-US", "*" matches any
  tag; untagged values violate.
- sh:uniqueLang true reports one violation per duplicated tag;
  untagged values are ignored.
- Compile fix: sh:languageIn previously produced one singleton
  constraint per JSON-LD list member (an unsatisfiable conjunction had
  it ever been enforced) and dropped the Turtle RDF-list form
  entirely. Tags now accumulate into a single constraint, with the
  Turtle list head expanded like sh:in.
Membership used term equality, so sh:in (1.0 2.0) never matched an
integer 1 and sh:hasValue 42 never matched a decimal 42.00 — while the
range facets already compared across numeric representations via
numeric_cmp. sh:in / sh:hasValue now use the same value equality for
numeric pairs; everything else keeps term equality.
sh:pattern / sh:minLength / sh:maxLength on IRI values previously
violated unconditionally. Per SPARQL STR() they now match against the
full IRI: value vectors containing non-blank IRI refs are rewritten
with the decoded IRI string before the string-facet dispatch (top-level
and nested property constraints). Blank nodes still fail per spec, and
an IRI whose namespace was allocated in the same transaction cannot be
decoded against the base snapshot — it fails closed with the generic
message rather than silently passing.
NestedShape now carries the member's sh:message (populated from the
anonymous property-shape entry at compile time; named references keep
resolving their message from the referenced CompiledShape at validation
time). Nested property-constraint violations prefer it over the
generated text, so a custom message on an sh:and/or/xone/not/node
member reaches the transaction error.
A qualified value shape marked disjoint now excludes values that
conform to a sibling qualified shape (the qualified shapes declared by
the other property shapes of the same node shape, gathered in a
finalize pass). The canonical crew example works: requiring one pilot
and one navigator as distinct members rejects a single member holding
both roles, while the same data passes without disjointness.
Sibling collection is top-level only; qualified constraints inside
logical members keep counting without disjointness.
Bulk import is a trusted, high-throughput load path: validate source
data against the shapes before importing; transaction-time validation
keeps the ledger clean from there.
…ing facets

Review fixes:
- A value-only anonymous member of a node-level logical constraint
  (sh:or ([ sh:class ex:C ]), sh:not [ sh:nodeKind ... ]) produced no
  checks at all — validate_nested_shape never evaluated
  value_constraints against the focus node, so such members always
  conformed. The focus-node evaluation core is now shared
  (focus_value_violations) between direct node-shape constraints and
  nested value-only members, including pair constraints, sh:class, and
  string facets.
- String facets are now consistent everywhere: node-shape constraints
  and anonymous value shapes stringify IRI values (STR(iri)) like
  top-level property shapes, and sh:minLength/sh:maxLength use the same
  lexical form as sh:pattern — non-literals (blank nodes, undecodable
  IRIs) violate instead of measuring the SID name fragment.
- sh:severity on a node shape with direct value constraints was lost
  (the metadata arms prefer the path-less property-shape entry; message
  and name were backfilled but severity was not).
- sh:uniqueLang counts tags case-insensitively per BCP 47.
- Nested qualified constraints honor disjoint/sibling fields uniformly
  (siblings remain empty for members not referenced via sh:property,
  which matches the spec's sibling definition).
- Removed the now-unused validate_constraint_set.
Review fixes:
- sh:inversePath was limited to a single predicate, rejecting valid
  SHACL like ^(p1/p2). Inversion now rewrites into the AST: inverse of
  a sequence is the reversed sequence of inverses, inverse of an
  alternative distributes, inverse of a closure wraps the inverted
  inner path, and a double inverse collapses.
- Malformed paths error instead of compiling nondeterministically:
  path operators take exactly one value (multiple distinct refs for
  sh:inversePath / closures error via sole_ref/operand_path), and
  multiple sh:path or operator-operand objects only form a sequence
  under the JSON-LD @list encoding (every flake carries a list index)
  — un-indexed multiples are distinct assertions and error. Operator
  operands accept both encodings via operand_path.
- fluree-vocab: full-IRI constants for the path operators, sh:node,
  and sh:deactivated to match the local-name additions.

The unsupported-path tests now use a genuinely malformed fixture (a
literal sequence step); a new test covers ^(parent/parent).
Per the spec's sibling definition the set excludes the constraint's own
sh:qualifiedValueShape by value, so two property shapes referencing the
same qualified shape don't disqualify each other's values.
@bplatz bplatz force-pushed the feature/shacl-property-paths branch from efca0e8 to 2aa1755 Compare July 3, 2026 00:13
@bplatz bplatz changed the base branch from feature/shacl-class-value-set to feature/dataset-reasoning-parity July 3, 2026 00:13

@aaj3f aaj3f left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix and nice path support addition!

Comment on lines +194 to +202
/// Resolve a single `sh:path` object node (a predicate IRI or a path-expression
/// blank node) into a [`PropertyPath`].
fn resolve_path_node<'a>(db: GraphDbRef<'a>, node: &'a Sid) -> PathFuture<'a, PropertyPath> {
Box::pin(async move {
// sh:inversePath — inverse of any path, rewritten into the AST
// (inverse of a sequence = reversed sequence of inverses, etc.).
if let Some(inner) = operand_path(db, node, &shacl(predicates::INVERSE_PATH)).await? {
return Ok(invert(inner));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve_path_node probes six operator predicates (sh:inversePath, sh:alternativePath, the three closures, and rdf:first) via operand_path/has_object — each a db.range SPOT scan — before falling through to PropertyPath::Predicate. For a plain-predicate sh:path (the overwhelmingly common case) that is ~6 index probes that all return empty. This runs inside resolve_paths, which compile_from_dbs invokes per transaction on any ledger that has shapes (fluree-db-api/src/tx.rs:578from_dbs_with_overlay; shapes are recompiled every transaction, not cached by schema_epoch). So a schema with N property shapes adds ~6·N empty index scans to every write. Per the SHACL spec an IRI-valued sh:path is always a predicate, and path-expression structures are always blank nodes — so a namespace_code != BLANK_NODE short-circuit is both a correctness-preserving and a large constant-factor win. The base branch stored the path as a bare Sid with zero scans, so this is a genuine regression introduced here.

fn resolve_path_node<'a>(db: GraphDbRef<'a>, node: &'a Sid) -> PathFuture<'a, PropertyPath> {
    Box::pin(async move {
        // Fast path: an IRI in sh:path is always a plain predicate (SHACL spec);
        // only blank nodes carry path-expression structure. Skip the six operator
        // probes for the common case — this runs per property shape per compile,
        // and compiles happen per transaction.
        if node.namespace_code != BLANK_NODE {
            return Ok(PropertyPath::Predicate(node.clone()));
        }

        // sh:inversePath — inverse of any path, rewritten into the AST ...
        if let Some(inner) = operand_path(db, node, &shacl(predicates::INVERSE_PATH)).await? {
            return Ok(invert(inner));
        }
        // ... rest unchanged

(BLANK_NODE is already imported at line 22.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2063e08

Comment on lines +475 to +480
/// Deduplicate value nodes (SHACL value nodes are a set).
fn dedup(mut values: Vec<PathValue>) -> Vec<PathValue> {
let mut seen: HashSet<String> = HashSet::new();
values.retain(|(v, dt, lang)| seen.insert(format!("{v:?}|{dt:?}|{lang:?}")));
values
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedup builds a fresh heap String per value node via format!("{v:?}|{dt:?}|{lang:?}") as its hash-set key. On a sh:zeroOrMorePath/oneOrMorePath closure over a large graph (or a wide alternative/sequence) this is an O(n) allocation + formatting pass on every step's output. Prefer a structural key that hashes the components directly — FlakeValue, Sid, and Option<String> should be Hash/Eq already (they key the class-membership memo and visited sets elsewhere), so:

fn dedup(mut values: Vec<PathValue>) -> Vec<PathValue> {
    let mut seen: HashSet<(FlakeValue, Sid, Option<String>)> = HashSet::new();
    values.retain(|(v, dt, lang)| seen.insert((v.clone(), dt.clone(), lang.clone())));
    values
}

Still clones, but avoids the format-string allocation and the Debug-representation fragility. Scope is complex-path shapes only (opt-in), hence minor. (Verification: the suggested tuple key compiles — FlakeValue has total manual Hash/Eq impls at value.rs:747/795/865, and Sid/Option<String> are Hash+Eq. Note it also makes dedup collapse cross-type-numeric-equal values that share a datatype, which is arguably more correct.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2063e08

Comment thread fluree-db-shacl/src/compile.rs Outdated
Comment on lines +284 to +304
// Collect subjects typed as a class — a shape that is also a class
// implicitly targets its own instances (SHACL "implicit class
// targets"). Bound-object scans, so cost scales with the number of
// declared classes, not the data.
let rdf_type = Sid::new(RDF, rdf_names::TYPE);
for class_class in [
Sid::new(fluree_vocab::namespaces::RDFS, "Class"),
Sid::new(fluree_vocab::namespaces::OWL, "Class"),
] {
let flakes = db
.range(
IndexType::Opst,
RangeTest::Eq,
RangeMatch::predicate_object(
rdf_type.clone(),
FlakeValue::Ref(class_class),
),
)
.await?;
class_typed.extend(flakes.iter().map(|f| f.s.clone()));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implicit-class-target discovery issues two Opst bound-object scans (rdf:type → rdfs:Class, rdf:type → owl:Class) per input graph on every compile (i.e. every transaction with shapes), regardless of whether any shape id is actually a class. The result set is bounded by the number of declared classes, so this is cheap in absolute terms, but it is fixed per-transaction overhead that didn't exist before. Acceptable as-is; noting it so it is a conscious cost. If it ever shows up in profiles, gate it on !self.shapes.is_empty() (already implied) or restrict to shape ids.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2063e08

Three review-driven optimizations to compile_from_dbs, which recompiles
shapes on every transaction against a shape-bearing ledger:

- resolve_path_node: short-circuit IRI-valued sh:path to a plain predicate
  before the six operator probes. Per the SHACL spec only blank nodes carry
  path-expression structure, so N property shapes no longer add ~6N empty
  SPOT scans per compile. Restores the base branch's zero-scan behavior.

- dedup: key value nodes on a structural (FlakeValue, Sid, Option<String>)
  tuple instead of a per-value format! heap allocation, dropping the O(n)
  Debug-format pass on closure/alternative output.

- implicit class targets: hoist the two Opst rdf:type->Class scans out of
  the per-graph loop and skip them entirely when no shapes were found.
  Cross-graph resolution is preserved (runs after all graphs processed).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants