Skip to content

fix(zql): execute sibling exists from selective roots#1

Draft
Karavil wants to merge 13 commits intocodex/mixed-or-costingfrom
codex/query-engine-normalization
Draft

fix(zql): execute sibling exists from selective roots#1
Karavil wants to merge 13 commits intocodex/mixed-or-costingfrom
codex/query-engine-normalization

Conversation

@Karavil
Copy link
Copy Markdown
Owner

@Karavil Karavil commented Apr 21, 2026

Stacked on rocicorp#5851.

This PR makes selective whereExists plans executable without app code manually setting {flip: true}. The previous stack teaches the planner to cost mixed OR branches more honestly. This PR adds the missing physical shapes: a root union for independent OR branches, and a child-key intersection for sibling EXISTS branches on the same relationship.

I regenerated the SQL below against current origin/main and this branch with the same hypothetical education app schema inspired by goblinsapp.com. The data is synthetic, but the shape is the one that mattered for us: 2,000 assignments, sparse assignment_to_student rows, and permission-style filters expressed as EXISTS. There are no personal names or real customer rows.

Current Zero already handles a simple single EXISTS well. This is not where the PR wins:

assignment
  .where('archived_at', 'IS', null)
  .whereExists('assignment_to_student', q =>
    q.where('student_id', '=', 'student-1'),
  );

Current origin/main and this branch both generate the child-root shape:

SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "id" = ? AND "archived_at" IS ?
ORDER BY "created_at" desc, "id" asc

The first real win is a mixed parent predicate and child EXISTS under OR.

assignment.where(({and, cmp, exists, or}) =>
  and(
    cmp('archived_at', 'IS', null),
    or(
      cmp('teacher_id', '=', 1),
      exists('assignment_to_student', q =>
        q.where('student_id', '=', 'student-1'),
      ),
    ),
  ),
);

Current origin/main chooses a semi-join for the membership branch. On the 2,000 assignment scenario, that is one assignment scan plus 2,003 membership probes:

SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "archived_at" IS ?
ORDER BY "created_at" desc, "id" asc

-- repeated 2,003 times
SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "assignment_id" = ? AND "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

With this stack, the membership branch starts from the student index and fetches only matching parents. The SQL work drops from 2,004 SQL calls to 5:

SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "archived_at" IS ?
ORDER BY "created_at" desc, "id" asc

SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

-- repeated 3 times
SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "id" = ? AND "archived_at" IS ?
ORDER BY "created_at" desc, "id" asc

The second win is root union for a plain parent OR EXISTS shape.

assignment.where(({cmp, exists, or}) =>
  or(
    cmp('teacher_id', '=', 1),
    exists('assignment_to_student', q =>
      q.where('student_id', '=', 'student-1'),
    ),
  ),
);

Current origin/main and the stacked base both use the child index for the EXISTS, but the parent branch is still a full assignment scan:

SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
ORDER BY "created_at" desc, "id" asc

SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

-- repeated 3 times
SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "id" = ?
ORDER BY "created_at" desc, "id" asc

This PR turns that into two selective roots. In the scenario seed, the old parent branch reads 2,000 assignments. The new parent branch reads the 20 assignments with teacher_id = 1.

SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "teacher_id" = ?
ORDER BY "created_at" desc, "id" asc

SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

-- repeated 3 times
SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "id" = ?
ORDER BY "created_at" desc, "id" asc

The third win is sibling EXISTS on the same relationship.

assignment
  .whereExists('assignment_to_student', q =>
    q.where('student_id', '=', 'student-1'),
  )
  .whereExists('assignment_to_student', q =>
    q.where('student_id', '=', 'student-2'),
  );

Current origin/main and the stacked base partially flip this. They scan student-1, fetch two parent assignments, then probe the second membership predicate once per fetched parent, plus the final stream exhaustion probe:

SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

-- repeated 2 times
SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "id" = ? AND TRUE
ORDER BY "created_at" desc, "id" asc

-- repeated 3 times
SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "assignment_id" = ? AND "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

This PR scans both child predicates first, intersects by assignment_id, and loads the one surviving parent. The SQL work drops from 6 calls to 3 for current Zero, and from 2,005 calls to 3 compared to a parent-pinned baseline:

-- repeated 2 times, once per student bind value
SELECT "assignment_id","student_id","created_at"
FROM "assignment_to_student"
WHERE "student_id" = ?
ORDER BY "assignment_id" asc, "student_id" asc

SELECT "id","teacher_id","archived_at","created_at"
FROM "assignment"
WHERE "id" = ?
ORDER BY "created_at" desc, "id" asc

The scenario seed makes the correctness check concrete: student-1 is attached to assignments 101 and 102, student-2 is attached to assignments 102 and 1500, and the only assignment returned is 102.

The implementation stays conservative. Root union refuses start, limit, and root related rows. The intersection path refuses nested related rows, nested subqueries, cursors, limits, explicit flip: false, incompatible relationship shapes, and child scans that are not unique for the correlation key. Root union also strips condition-only relationship payloads before merging branches, so the union schema stays honest.

The scenario harness now asserts optimized AST fragments, planner debug output, generated SQL, compacted SQL call counts, and returned rows. That gives us regression coverage for the thing we actually care about: same results, much less physical SQL work.

Verified with:

npm --workspace=zql run format
npm --workspace=zqlite run format
npm --workspace=zql run check-types
npm --workspace=zqlite run check-types
npm --workspace=zql run lint
npm --workspace=zqlite run lint
npm --workspace=zql run test
npm --workspace=zqlite run test

I am keeping this as a draft because it is stacked, and because the main review question is architectural. The optimization is good database-engine behavior, similar in spirit to SQLite's OR-by-union and PostgreSQL's bitmap-style key combination, but these physical alternatives probably want to move out of builder.ts if Zero keeps growing this planner surface.

Karavil added 5 commits April 21, 2026 00:41
Why: makes the query optimizer rules easier to audit and extend.

* Replace parallel OR and AND merge bookkeeping with a shared column domain rewrite rule

* Add coverage for overlapping IN predicate unions
Why: pin the edge cases that can quietly reintroduce broad scans or dead child branches.

* Absorb stricter OR branches even when the shared predicate is not common to every branch

* Collapse impossible non-scalar EXISTS branches and add a scenario that proves the child scan disappears

* Extend idempotence coverage to generated correlated subquery filters
@Karavil Karavil changed the title fix(zql): normalize planner filters before costing fix(zql): execute sibling exists from selective roots Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant