Skip to content

Fix Cypher label disjunction panic and improve schema handling#64

Open
milliondreams wants to merge 5 commits intorustic-ai:mainfrom
milliondreams:main
Open

Fix Cypher label disjunction panic and improve schema handling#64
milliondreams wants to merge 5 commits intorustic-ai:mainfrom
milliondreams:main

Conversation

@milliondreams
Copy link
Copy Markdown
Contributor

No description provided.

Closes rustic-ai#62. `MATCH (n:A|B)` and the `WHERE n:A OR n:B`
rewrite both lower to a `Union` of per-label `Scan` operators (rustic-ai#56).
When the labels carry different property sets the per-label `Scan`
outputs have different Arrow widths and DataFusion's `union_schema`
panics with `index out of bounds` — the panic escapes the await as a
process abort. The existing rustic-ai#56 fixtures all used uniform `name:
String` schemas, so the hazard stayed hidden.

Three layers fix it:

- `function_props`: register `ID`, `ELEMENTID`, `TYPE` as
  `entity_arg_only`. The unknown-function fallback was marking
  entities as `*` (full-entity wildcard) for plain `id(b)` access,
  which forced `df_planner::resolve_properties` to expand per-label
  schemas under each Union branch. Cures the most common trigger.

- `planner`: add `label_branches_share_property_schema` and route
  both `plan_unbound_node` and `replace_scan_all_with_label_union`
  through a heterogeneous fallback. When branch schemas would diverge
  (different property names/types, or any schemaless label in the
  mix), every branch becomes a *single-label* `ScanMainByLabels`
  instead of a per-label `Scan`. The per-branch Union shape is kept
  (multi-label `ScanMainByLabels` has AND/intersection semantics —
  wrong for disjunction), but each branch now emits a uniform
  schemaless schema so the Union is safe.

- `df_planner::plan_union`: defense-in-depth schema-mismatch check
  before `UnionExec::try_new`. Any future logical-Union mismatch now
  becomes a typed `Plan` error instead of a DataFusion panic.

Tests in `cypher_label_disjunction_support.rs` add a heterogeneous
fixture (DestA: 2 props, DestB: 4 props) and cover the inline
`(:A|B)`, common-property projection, and `WHERE`-form variants.
Restores the [patch.crates-io] override pointing at ../uni-xervo, mirroring
the pattern used for 0.9.0 prior to commit 668844e. The patch can be
removed once 0.10.0 is published to crates.io.
Closes rustic-ai#63. Re-applying the same schema (the documented
"register on every KB-open" pattern) was duplicating index entries in
schema.indexes super-linearly: explicit append in apply()'s AddIndex
arm, plus the rebuild loop calling create_*_index whose epilogue
re-appends every existing entry on top. Production catalogs hit 60K+
duplicates within ~15 reopens, dragging KB-open into multi-minute
territory.

Three layered fixes:

- uni-common SchemaManager::add_index is now upsert-by-name. The
  trailing add_index in every IndexManager::create_*_index becomes
  idempotent for free, so neither apply()'s rebuild loop nor Cypher
  CREATE INDEX (which depends on that epilogue) needs plumbing
  changes. Index names are stable per (label, property), so name is
  a correct upsert key.

- apply()'s AddIndex arm short-circuits the synchronous Lance rebuild
  when the index is already registered with identical config. This is
  what closes the "KB open takes minutes" gap — the post-fix repro
  goes from 15.7 s to 0.020 s for 10 applies.

- SchemaManager::load_from_store self-heals legacy bloated catalogs by
  collapsing duplicate entries by name (last writer wins, matching the
  upsert semantics). Emits a tracing warning once per bloated load. No
  immediate save() on load — the cleanup persists on the next mutation
  that calls save(), keeping load() free of surprise I/O.

Tests in schema_apply_idempotency.rs cover the issue's three repros
plus a bloated-catalog migration case. Unit tests in schema.rs cover
upsert and load-dedup semantics directly.
The defense-in-depth guard added in 3f71cd4 (issue rustic-ai#62) compared both
field names and types per position before UnionExec::try_new, which
over-rejected the user-facing Cypher UNION clause. UNION routinely
produces branches whose internal columns are namespaced by different
pattern variables (e.g. `MATCH (a:A) RETURN a UNION MATCH (b:B) RETURN
b` yields `a._vid` vs `b._vid` at position 1). DataFusion's union_schema
handles name divergence by adopting left names; only width/type
mismatches are the panic source. Caught by uni-tck Union1[4] / Union2[4]
which went from FAIL to PASS after relaxing to type-only comparison.

Also folds in cargo fmt for two long let-bindings introduced in
4786f4a / 3f71cd4 that the CI fmt check flagged.
uni-xervo 0.11.0 ships two reranker-relevant fixes:

- Cross-encoder loader auto-detects token_type_ids from session.inputs()
  (commit d08411c). Pre-0.11.0 the loader hardcoded the input, which
  errored at ORT load for BAAI/bge-reranker-base (its ONNX export
  doesn't declare that input). So BGE reranker support was effectively
  blocked on this bump.
- Generative-style reranker dispatch for Qwen3-Reranker family
  (commit 21642b5). Selected via `options.style = "generative"` on the
  ModelAliasSpec; default `cross-encoder` is unchanged so existing
  catalogs are safe.

The bump is non-breaking on the consumer surface (runtime.reranker(alias)
and RerankerModel::rerank signatures unchanged), so no uni-db code edits
are forced.

Two new #[ignore] integration tests in reranker_integration.rs lock in
real-ONNX coverage for both families through uni-db's alias-based rerank
pipeline:

- test_real_onnx_bge_reranker_reranks_by_relevance — BAAI/bge-reranker-base
  via the default cross-encoder path. Validates the token_type_ids
  auto-detection lands.
- test_real_onnx_qwen3_reranker_reranks_by_relevance — onnx-community/
  Qwen3-Reranker-0.6B-ONNX (quantized model_q4 ~500 MB) via generative
  dispatch. Pins rerank_score in [0, 1] so any future regression to raw
  logits surfaces in the assertion rather than as a silent score-shape
  drift.

Both new tests use query_with(...).timeout(...).fetch_all() with extended
timeouts to absorb first-run model download cost; the default 30 s
session timeout is too tight for the larger models. Run via the existing
CI step: `cargo nextest run -p uni-db --features provider-onnx --test
reranker_integration --run-ignored all`.

Verified: 29/29 reranker tests pass (25 mock + 4 real-ONNX). Workspace
3281/3281 pass with no regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant