Skip to content

refactor: validate query field references via parse-tree walk#285

Merged
mosuka merged 1 commit intomainfrom
refactor/query-dsl-parse-tree-validation
Apr 25, 2026
Merged

refactor: validate query field references via parse-tree walk#285
mosuka merged 1 commit intomainfrom
refactor/query-dsl-parse-tree-validation

Conversation

@mosuka
Copy link
Copy Markdown
Owner

@mosuka mosuka commented Apr 25, 2026

Summary

  • Replace the regex-based pre-check in UnifiedQueryParser with an authoritative parse-tree walker that visits every parsed lexical query node and asserts each referenced field is declared in the schema.
  • Adds a Query::collect_field_refs default trait method (overridden by BooleanQuery, AdvancedQuery, and SpanQueryWrapper); the engine-level validator collects the set, sorts unknowns deterministically, and surfaces them with the same error wording as before.
  • Resolves false positives on [A TO Z] range literals, false negatives where users declared fields named AND / OR / NOT / TO, and brittleness against future DSL extensions.

Why

PR #280 introduced check_known_fields, a regex pre-check that strips quoted regions and skips a hard-coded keyword allow-list. It was explicitly marked as a temporary heuristic — Issue #283 captured the follow-up to migrate to a tree walk once the implementation was fresh.

Implementation

  • laurus/src/lexical/query.rs — Add Query::collect_field_refs(&self, out: &mut HashSet<String>); default impl uses field().
  • laurus/src/lexical/query/boolean.rs — Override; recurse into every clause.
  • laurus/src/lexical/query/advanced_query.rs — Override; recurse into core query plus filters / negative-filters / post-filters.
  • laurus/src/lexical/query/span.rs — Override on SpanQueryWrapper to insert span_query.field_name() (span queries operate on a single field).
  • laurus/src/engine/query.rs — Replace check_known_fields with validate_field_refs; remove FIELD_REF_RE. Validation runs after the cheap lexical parse but before the expensive vector parse, so a typo'd query is rejected before any embedder cost is incurred. Vector clauses do not need a separate walk: split_query only routes a clause to the vector parser when its field is in the known vector-field set, so any unknown field name will have remained in the lexical portion and be caught by the walker.

Tests

cargo test -p laurus --lib engine::query — 32 existing + 10 new unit tests.

New tests target cases the regex got wrong:

  • Field literally named AND (and TO) — accepted.
  • [A TO Z] range literal — TO is no longer mistaken for a field name.
  • Deeply nested boolean — every leaf is visited.
  • Quoted phrase containing a : — never reaches the validator.
  • Empty known_fields — validation is a no-op, preserving backward compatibility.

cargo test -p laurus --test dynamic_schema_testquery_dsl_rejects_unknown_field still passes (error message format preserved).

Out of scope

  • Vector-side AST walk (covered transitively by split_query routing).
  • Documentation copy changes (user-visible behaviour is unchanged; the previous "Field validation" sections in concepts/query_dsl.md already describe the contract correctly).

Closes #283

Replace the regex-based pre-check in `UnifiedQueryParser::parse` with an
authoritative walk over the parsed lexical query tree. The previous
heuristic stripped quoted regions and skipped a hard-coded keyword
allow-list (`AND`/`OR`/`NOT`/`TO`), which created false positives on
range literals (`[A TO Z]`) and false negatives whenever a user wanted
to declare a field that happened to share a name with a reserved
keyword.

Implementation:

- Add `Query::collect_field_refs(&self, out: &mut HashSet<String>)`
  with a default that inserts `self.field()` if it is `Some`.
- Override in `BooleanQuery` (recurse into all clauses),
  `AdvancedQuery` (core query plus filters / negative-filters /
  post-filters), and `SpanQueryWrapper` (insert
  `span_query.field_name()` since span queries are single-field).
- Replace `check_known_fields` and `FIELD_REF_RE` in
  `engine::query::UnifiedQueryParser` with `validate_field_refs`,
  which collects the field set, sorts unknowns deterministically, and
  uses the same error wording as before. Validation runs after the
  cheap lexical parse but before the expensive vector parse, so a
  typo'd query is rejected before any embedder cost is incurred.
  Vector clauses do not need a separate walk: `split_query` only
  routes a clause to the vector parser when its field is in the known
  vector-field set, so any unknown field name remains in the lexical
  portion and is caught by the walker.

Tests: 32 existing + 10 new unit tests covering field literally named
`AND` / `TO`, `[A TO Z]` range literal, deeply nested boolean queries,
quoted phrases containing `:`, and the empty `known_fields` fallback.
The existing `dynamic_schema_test::query_dsl_rejects_unknown_field`
integration test still passes (error message format preserved).

User-visible behaviour is unchanged; documentation copy stays the same.

Closes #283
@mosuka mosuka merged commit ad931db into main Apr 25, 2026
22 checks passed
@mosuka mosuka deleted the refactor/query-dsl-parse-tree-validation branch April 25, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace regex-based field validation in UnifiedQueryParser with parse-tree walking

1 participant