Fix: pushed-down WHERE + LIMIT/OFFSET silently drops matching rows#27
Merged
Conversation
icebergDataSource.scan() pushed LIMIT/OFFSET down by physical row position (seeking past `offset` rows and bounding the per-file read at `fileRowStart + remaining`) whenever the WHERE was *resolved* — which includes a WHERE fully pushed into the parquet read. But a pushed-down WHERE is matched per row, so the first N physical rows of a file may contain fewer than N (or zero) matches. Bounding by position then reads only the leading rows and silently drops every match that sorts later in the file. Example: `SELECT ... WHERE node_type = 'File' LIMIT 5` over a file whose leading 1000 rows are all `Session` reads physical rows [0,5), matches nothing, and returns 0 rows — while `COUNT(*)` (no LIMIT) returns the true count. Any `WHERE <pushable predicate> LIMIT n` can under-return. Gate position-based pushdown on `!where` instead of `whereResolved`, so a pushed-down filter takes the same path deletes already use: emit up to `offset + limit` matched rows and let the engine apply the final slice. Early termination via the per-row `remaining` break is preserved. Adds a regression test (a match that sorts after the LIMIT window must still be returned) and corrects two tests that asserted the buggy `appliedLimitOffset === true` contract for a pushed WHERE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
platypii
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #26.
Problem
icebergDataSource.scan()pushed LIMIT/OFFSET down by physical row position (seeking pastoffsetrows; bounding the per-file read atfileRowStart + remaining) whenever the WHERE was resolved — which includes a WHERE fully pushed into the parquet read. But a pushed-down WHERE is matched per row, so the first N physical rows of a file may contain fewer than N (or zero) matches. Bounding by position reads only the leading rows and silently drops every match that sorts later in the file.Fix
Gate position-based pushdown on
!whereinstead ofwhereResolved. A pushed-down filter now takes the same path deletes already use: emit up tooffset + limitmatched rows and let the engine apply the final slice. The per-rowremainingbreak still terminates early, so later files/row groups are skipped once enough matches are found.Position→count only line up when every physical row is also a result row (no WHERE), so this is the precise condition.
Tests
master, passes here).appliedLimitOffset === truecontract for a pushed WHERE, and strengthened the LIMIT/OFFSET case to verify the actual rows against a full-scan slice oracle (not just the count).