refactor: validate query field references via parse-tree walk#285
Merged
refactor: validate query field references via parse-tree walk#285
Conversation
Replace the regex-based pre-check in `UnifiedQueryParser::parse` with an authoritative walk over the parsed lexical query tree. The previous heuristic stripped quoted regions and skipped a hard-coded keyword allow-list (`AND`/`OR`/`NOT`/`TO`), which created false positives on range literals (`[A TO Z]`) and false negatives whenever a user wanted to declare a field that happened to share a name with a reserved keyword. Implementation: - Add `Query::collect_field_refs(&self, out: &mut HashSet<String>)` with a default that inserts `self.field()` if it is `Some`. - Override in `BooleanQuery` (recurse into all clauses), `AdvancedQuery` (core query plus filters / negative-filters / post-filters), and `SpanQueryWrapper` (insert `span_query.field_name()` since span queries are single-field). - Replace `check_known_fields` and `FIELD_REF_RE` in `engine::query::UnifiedQueryParser` with `validate_field_refs`, which collects the field set, sorts unknowns deterministically, and uses the same error wording as before. Validation runs after the cheap lexical parse but before the expensive vector parse, so a typo'd query is rejected before any embedder cost is incurred. Vector clauses do not need a separate walk: `split_query` only routes a clause to the vector parser when its field is in the known vector-field set, so any unknown field name remains in the lexical portion and is caught by the walker. Tests: 32 existing + 10 new unit tests covering field literally named `AND` / `TO`, `[A TO Z]` range literal, deeply nested boolean queries, quoted phrases containing `:`, and the empty `known_fields` fallback. The existing `dynamic_schema_test::query_dsl_rejects_unknown_field` integration test still passes (error message format preserved). User-visible behaviour is unchanged; documentation copy stays the same. Closes #283
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
UnifiedQueryParserwith an authoritative parse-tree walker that visits every parsed lexical query node and asserts each referenced field is declared in the schema.Query::collect_field_refsdefault trait method (overridden byBooleanQuery,AdvancedQuery, andSpanQueryWrapper); the engine-level validator collects the set, sorts unknowns deterministically, and surfaces them with the same error wording as before.[A TO Z]range literals, false negatives where users declared fields namedAND/OR/NOT/TO, and brittleness against future DSL extensions.Why
PR #280 introduced
check_known_fields, a regex pre-check that strips quoted regions and skips a hard-coded keyword allow-list. It was explicitly marked as a temporary heuristic — Issue #283 captured the follow-up to migrate to a tree walk once the implementation was fresh.Implementation
laurus/src/lexical/query.rs— AddQuery::collect_field_refs(&self, out: &mut HashSet<String>); default impl usesfield().laurus/src/lexical/query/boolean.rs— Override; recurse into every clause.laurus/src/lexical/query/advanced_query.rs— Override; recurse into core query plus filters / negative-filters / post-filters.laurus/src/lexical/query/span.rs— Override onSpanQueryWrapperto insertspan_query.field_name()(span queries operate on a single field).laurus/src/engine/query.rs— Replacecheck_known_fieldswithvalidate_field_refs; removeFIELD_REF_RE. Validation runs after the cheap lexical parse but before the expensive vector parse, so a typo'd query is rejected before any embedder cost is incurred. Vector clauses do not need a separate walk:split_queryonly routes a clause to the vector parser when its field is in the known vector-field set, so any unknown field name will have remained in the lexical portion and be caught by the walker.Tests
cargo test -p laurus --lib engine::query— 32 existing + 10 new unit tests.New tests target cases the regex got wrong:
AND(andTO) — accepted.[A TO Z]range literal —TOis no longer mistaken for a field name.:— never reaches the validator.known_fields— validation is a no-op, preserving backward compatibility.cargo test -p laurus --test dynamic_schema_test—query_dsl_rejects_unknown_fieldstill passes (error message format preserved).Out of scope
split_queryrouting).concepts/query_dsl.mdalready describe the contract correctly).Closes #283