Skip to content

Fix: pushed-down WHERE + LIMIT/OFFSET silently drops matching rows#27

Merged
platypii merged 1 commit into
masterfrom
fix/scan-limit-with-pushed-filter
Jun 23, 2026
Merged

Fix: pushed-down WHERE + LIMIT/OFFSET silently drops matching rows#27
platypii merged 1 commit into
masterfrom
fix/scan-limit-with-pushed-filter

Conversation

@philcunliffe

Copy link
Copy Markdown
Contributor

Fixes #26.

Problem

icebergDataSource.scan() pushed LIMIT/OFFSET down by physical row position (seeking past offset rows; bounding the per-file read at fileRowStart + remaining) whenever the WHERE was resolved — which includes a WHERE fully pushed into the parquet read. But a pushed-down WHERE is matched per row, so the first N physical rows of a file may contain fewer than N (or zero) matches. Bounding by position reads only the leading rows and silently drops every match that sorts later in the file.

SELECT node_type FROM node WHERE node_type = 'File'            -- ✅ 4925 rows
SELECT node_type FROM node WHERE node_type = 'File' LIMIT 5    -- ❌ 0 rows (leading rows are all 'Session')

Fix

Gate position-based pushdown on !where instead of whereResolved. A pushed-down filter now takes the same path deletes already use: emit up to offset + limit matched rows and let the engine apply the final slice. The per-row remaining break still terminates early, so later files/row groups are skipped once enough matches are found.

Position→count only line up when every physical row is also a result row (no WHERE), so this is the precise condition.

Tests

  • New regression test: a match that sorts after the LIMIT window must still be returned (fails on master, passes here).
  • Corrected two tests that asserted the buggy appliedLimitOffset === true contract for a pushed WHERE, and strengthened the LIMIT/OFFSET case to verify the actual rows against a full-scan slice oracle (not just the count).
  • Full suite green: 577 passed.

icebergDataSource.scan() pushed LIMIT/OFFSET down by physical row position
(seeking past `offset` rows and bounding the per-file read at
`fileRowStart + remaining`) whenever the WHERE was *resolved* — which
includes a WHERE fully pushed into the parquet read. But a pushed-down
WHERE is matched per row, so the first N physical rows of a file may
contain fewer than N (or zero) matches. Bounding by position then reads
only the leading rows and silently drops every match that sorts later in
the file.

Example: `SELECT ... WHERE node_type = 'File' LIMIT 5` over a file whose
leading 1000 rows are all `Session` reads physical rows [0,5), matches
nothing, and returns 0 rows — while `COUNT(*)` (no LIMIT) returns the
true count. Any `WHERE <pushable predicate> LIMIT n` can under-return.

Gate position-based pushdown on `!where` instead of `whereResolved`, so a
pushed-down filter takes the same path deletes already use: emit up to
`offset + limit` matched rows and let the engine apply the final slice.
Early termination via the per-row `remaining` break is preserved.

Adds a regression test (a match that sorts after the LIMIT window must
still be returned) and corrects two tests that asserted the buggy
`appliedLimitOffset === true` contract for a pushed WHERE.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@platypii platypii merged commit 8242857 into master Jun 23, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pushed-down WHERE + LIMIT/OFFSET silently drops matching rows

2 participants