Fix: pushed-down WHERE + LIMIT/OFFSET silently drops matching rows by philcunliffe · Pull Request #27 · hyparam/icebird

philcunliffe · 2026-06-22T18:08:28Z

Fixes #26.

Problem

icebergDataSource.scan() pushed LIMIT/OFFSET down by physical row position (seeking past offset rows; bounding the per-file read at fileRowStart + remaining) whenever the WHERE was resolved — which includes a WHERE fully pushed into the parquet read. But a pushed-down WHERE is matched per row, so the first N physical rows of a file may contain fewer than N (or zero) matches. Bounding by position reads only the leading rows and silently drops every match that sorts later in the file.

SELECT node_type FROM node WHERE node_type = 'File'            -- ✅ 4925 rows
SELECT node_type FROM node WHERE node_type = 'File' LIMIT 5    -- ❌ 0 rows (leading rows are all 'Session')

Fix

Gate position-based pushdown on !where instead of whereResolved. A pushed-down filter now takes the same path deletes already use: emit up to offset + limit matched rows and let the engine apply the final slice. The per-row remaining break still terminates early, so later files/row groups are skipped once enough matches are found.

Position→count only line up when every physical row is also a result row (no WHERE), so this is the precise condition.

Tests

New regression test: a match that sorts after the LIMIT window must still be returned (fails on master, passes here).
Corrected two tests that asserted the buggy appliedLimitOffset === true contract for a pushed WHERE, and strengthened the LIMIT/OFFSET case to verify the actual rows against a full-scan slice oracle (not just the count).
Full suite green: 577 passed.

icebergDataSource.scan() pushed LIMIT/OFFSET down by physical row position (seeking past `offset` rows and bounding the per-file read at `fileRowStart + remaining`) whenever the WHERE was *resolved* — which includes a WHERE fully pushed into the parquet read. But a pushed-down WHERE is matched per row, so the first N physical rows of a file may contain fewer than N (or zero) matches. Bounding by position then reads only the leading rows and silently drops every match that sorts later in the file. Example: `SELECT ... WHERE node_type = 'File' LIMIT 5` over a file whose leading 1000 rows are all `Session` reads physical rows [0,5), matches nothing, and returns 0 rows — while `COUNT(*)` (no LIMIT) returns the true count. Any `WHERE <pushable predicate> LIMIT n` can under-return. Gate position-based pushdown on `!where` instead of `whereResolved`, so a pushed-down filter takes the same path deletes already use: emit up to `offset + limit` matched rows and let the engine apply the final slice. Early termination via the per-row `remaining` break is preserved. Adds a regression test (a match that sorts after the LIMIT window must still be returned) and corrects two tests that asserted the buggy `appliedLimitOffset === true` contract for a pushed WHERE. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

platypii approved these changes Jun 23, 2026

View reviewed changes

platypii merged commit 8242857 into master Jun 23, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: pushed-down WHERE + LIMIT/OFFSET silently drops matching rows#27

Fix: pushed-down WHERE + LIMIT/OFFSET silently drops matching rows#27
platypii merged 1 commit into
masterfrom
fix/scan-limit-with-pushed-filter

philcunliffe commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

philcunliffe commented Jun 22, 2026

Problem

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants