Fix Cypher label disjunction panic and improve schema handling#64
Open
milliondreams wants to merge 5 commits intorustic-ai:mainfrom
Open
Fix Cypher label disjunction panic and improve schema handling#64milliondreams wants to merge 5 commits intorustic-ai:mainfrom
milliondreams wants to merge 5 commits intorustic-ai:mainfrom
Conversation
Closes rustic-ai#62. `MATCH (n:A|B)` and the `WHERE n:A OR n:B` rewrite both lower to a `Union` of per-label `Scan` operators (rustic-ai#56). When the labels carry different property sets the per-label `Scan` outputs have different Arrow widths and DataFusion's `union_schema` panics with `index out of bounds` — the panic escapes the await as a process abort. The existing rustic-ai#56 fixtures all used uniform `name: String` schemas, so the hazard stayed hidden. Three layers fix it: - `function_props`: register `ID`, `ELEMENTID`, `TYPE` as `entity_arg_only`. The unknown-function fallback was marking entities as `*` (full-entity wildcard) for plain `id(b)` access, which forced `df_planner::resolve_properties` to expand per-label schemas under each Union branch. Cures the most common trigger. - `planner`: add `label_branches_share_property_schema` and route both `plan_unbound_node` and `replace_scan_all_with_label_union` through a heterogeneous fallback. When branch schemas would diverge (different property names/types, or any schemaless label in the mix), every branch becomes a *single-label* `ScanMainByLabels` instead of a per-label `Scan`. The per-branch Union shape is kept (multi-label `ScanMainByLabels` has AND/intersection semantics — wrong for disjunction), but each branch now emits a uniform schemaless schema so the Union is safe. - `df_planner::plan_union`: defense-in-depth schema-mismatch check before `UnionExec::try_new`. Any future logical-Union mismatch now becomes a typed `Plan` error instead of a DataFusion panic. Tests in `cypher_label_disjunction_support.rs` add a heterogeneous fixture (DestA: 2 props, DestB: 4 props) and cover the inline `(:A|B)`, common-property projection, and `WHERE`-form variants.
Restores the [patch.crates-io] override pointing at ../uni-xervo, mirroring the pattern used for 0.9.0 prior to commit 668844e. The patch can be removed once 0.10.0 is published to crates.io.
Closes rustic-ai#63. Re-applying the same schema (the documented "register on every KB-open" pattern) was duplicating index entries in schema.indexes super-linearly: explicit append in apply()'s AddIndex arm, plus the rebuild loop calling create_*_index whose epilogue re-appends every existing entry on top. Production catalogs hit 60K+ duplicates within ~15 reopens, dragging KB-open into multi-minute territory. Three layered fixes: - uni-common SchemaManager::add_index is now upsert-by-name. The trailing add_index in every IndexManager::create_*_index becomes idempotent for free, so neither apply()'s rebuild loop nor Cypher CREATE INDEX (which depends on that epilogue) needs plumbing changes. Index names are stable per (label, property), so name is a correct upsert key. - apply()'s AddIndex arm short-circuits the synchronous Lance rebuild when the index is already registered with identical config. This is what closes the "KB open takes minutes" gap — the post-fix repro goes from 15.7 s to 0.020 s for 10 applies. - SchemaManager::load_from_store self-heals legacy bloated catalogs by collapsing duplicate entries by name (last writer wins, matching the upsert semantics). Emits a tracing warning once per bloated load. No immediate save() on load — the cleanup persists on the next mutation that calls save(), keeping load() free of surprise I/O. Tests in schema_apply_idempotency.rs cover the issue's three repros plus a bloated-catalog migration case. Unit tests in schema.rs cover upsert and load-dedup semantics directly.
The defense-in-depth guard added in 3f71cd4 (issue rustic-ai#62) compared both field names and types per position before UnionExec::try_new, which over-rejected the user-facing Cypher UNION clause. UNION routinely produces branches whose internal columns are namespaced by different pattern variables (e.g. `MATCH (a:A) RETURN a UNION MATCH (b:B) RETURN b` yields `a._vid` vs `b._vid` at position 1). DataFusion's union_schema handles name divergence by adopting left names; only width/type mismatches are the panic source. Caught by uni-tck Union1[4] / Union2[4] which went from FAIL to PASS after relaxing to type-only comparison. Also folds in cargo fmt for two long let-bindings introduced in 4786f4a / 3f71cd4 that the CI fmt check flagged.
uni-xervo 0.11.0 ships two reranker-relevant fixes: - Cross-encoder loader auto-detects token_type_ids from session.inputs() (commit d08411c). Pre-0.11.0 the loader hardcoded the input, which errored at ORT load for BAAI/bge-reranker-base (its ONNX export doesn't declare that input). So BGE reranker support was effectively blocked on this bump. - Generative-style reranker dispatch for Qwen3-Reranker family (commit 21642b5). Selected via `options.style = "generative"` on the ModelAliasSpec; default `cross-encoder` is unchanged so existing catalogs are safe. The bump is non-breaking on the consumer surface (runtime.reranker(alias) and RerankerModel::rerank signatures unchanged), so no uni-db code edits are forced. Two new #[ignore] integration tests in reranker_integration.rs lock in real-ONNX coverage for both families through uni-db's alias-based rerank pipeline: - test_real_onnx_bge_reranker_reranks_by_relevance — BAAI/bge-reranker-base via the default cross-encoder path. Validates the token_type_ids auto-detection lands. - test_real_onnx_qwen3_reranker_reranks_by_relevance — onnx-community/ Qwen3-Reranker-0.6B-ONNX (quantized model_q4 ~500 MB) via generative dispatch. Pins rerank_score in [0, 1] so any future regression to raw logits surfaces in the assertion rather than as a silent score-shape drift. Both new tests use query_with(...).timeout(...).fetch_all() with extended timeouts to absorb first-run model download cost; the default 30 s session timeout is too tight for the larger models. Run via the existing CI step: `cargo nextest run -p uni-db --features provider-onnx --test reranker_integration --run-ignored all`. Verified: 29/29 reranker tests pass (25 mock + 4 real-ONNX). Workspace 3281/3281 pass with no regressions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.