Background
Part of the Sort Pushdown epic (#23036) — specifically a follow-up to #22450 (runtime RG-level early stop via TopK dynamic filter).
#22450 closes the "dynamic mid-file" prune gap by dropping whole row groups whose stats prove they can't beat the current TopK threshold. There is one further, cleanly separable optimization on top of that mechanism, intentionally not in the merged PR because it depends on arrow-rs upstream:
For row groups that are fully_matched (every row passes the user's WHERE predicate after static pruning — typical when the predicate is selective on metadata stats), the per-row RowFilter evaluation is pure overhead. We can suppress the RowFilter for those RGs and re-install it for partially-matched RGs at the next boundary.
Why this is a follow-up, not part of #22450
The toggle requires asking the decoder which row group it will emit next so the toggle targets the correct RG. arrow-rs's internal page-index pruning (ColumnIndex path in try_build) can silently skip RGs that DataFusion's rg_plan still contains, so rg_plan.front() drifts off-by-N from the decoder's actual frontier. Without a way to sync, the fully_matched toggle decision lands on the wrong RG.
The clean fix is ParquetPushDecoder::peek_next_row_group() in arrow-rs (apache/arrow-rs#10158), which exposes the decoder's actual next-RG frontier as a public API. Until that lands, this optimization can't be shipped.
Design (already prototyped locally)
Three coordinated pieces, all extensions of #22450's PushDecoderStreamState:
-
Track fully_matched: Vec<bool> per RG in PreparedAccessPlan. Populated by the existing page-index pruning path (when every page of an RG passes the static predicate, mark it fully_matched).
-
Carry fully_matched through rg_plan as RgPlanEntry { rg_index, fully_matched }. Pass through reorder_by_statistics and reverse so the flag stays aligned with the RG index after Inexact runtime reorder.
-
Toggle the RowFilter at each RG boundary in PushDecoderStreamState::transition:
peek_next_row_group() → align rg_plan.front() with the decoder's actual next RG (handles arrow-rs's silent skips).
- If
rg_plan.front().fully_matched: rebuild decoder via into_builder().with_row_filter(empty).build() to suppress per-row eval.
- Otherwise: rebuild with the saved
RowFilterContext to restore the filter.
- Track skips with a new metric
row_filter_skipped_fully_matched.
into_builder preserves buffered bytes across rebuilds, so toggling the filter is free of extra IO.
Acceptance criteria
Architectural note
This optimization is a workaround for the lack of a fully_matched concept inside arrow-rs's own decode path. Long term, a cleaner design would move the optimization into arrow-rs: when the decoder builds the next RG's reader and finds it fully-matched via ColumnIndex, it could automatically skip the per-row RowFilter evaluation, and DataFusion would not need to mirror state at all. If that direction is taken upstream, the fully_matched field on ParquetAccessPlan, the toggle in PushDecoderStreamState, and the peek_next_row_group sync can all be deleted from DataFusion.
For now, the DataFusion-side toggle is the pragmatic path.
References
Background
Part of the Sort Pushdown epic (#23036) — specifically a follow-up to #22450 (runtime RG-level early stop via TopK dynamic filter).
#22450 closes the "dynamic mid-file" prune gap by dropping whole row groups whose stats prove they can't beat the current TopK threshold. There is one further, cleanly separable optimization on top of that mechanism, intentionally not in the merged PR because it depends on arrow-rs upstream:
For row groups that are
fully_matched(every row passes the user'sWHEREpredicate after static pruning — typical when the predicate is selective on metadata stats), the per-rowRowFilterevaluation is pure overhead. We can suppress theRowFilterfor those RGs and re-install it for partially-matched RGs at the next boundary.Why this is a follow-up, not part of #22450
The toggle requires asking the decoder which row group it will emit next so the toggle targets the correct RG. arrow-rs's internal page-index pruning (
ColumnIndexpath intry_build) can silently skip RGs that DataFusion'srg_planstill contains, sorg_plan.front()drifts off-by-N from the decoder's actual frontier. Without a way to sync, thefully_matchedtoggle decision lands on the wrong RG.The clean fix is
ParquetPushDecoder::peek_next_row_group()in arrow-rs (apache/arrow-rs#10158), which exposes the decoder's actual next-RG frontier as a public API. Until that lands, this optimization can't be shipped.Design (already prototyped locally)
Three coordinated pieces, all extensions of #22450's
PushDecoderStreamState:Track
fully_matched: Vec<bool>per RG inPreparedAccessPlan. Populated by the existing page-index pruning path (when every page of an RG passes the static predicate, mark itfully_matched).Carry
fully_matchedthroughrg_planasRgPlanEntry { rg_index, fully_matched }. Pass throughreorder_by_statisticsandreverseso the flag stays aligned with the RG index after Inexact runtime reorder.Toggle the
RowFilterat each RG boundary inPushDecoderStreamState::transition:peek_next_row_group()→ alignrg_plan.front()with the decoder's actual next RG (handles arrow-rs's silent skips).rg_plan.front().fully_matched: rebuild decoder viainto_builder().with_row_filter(empty).build()to suppress per-row eval.RowFilterContextto restore the filter.row_filter_skipped_fully_matched.into_builderpreserves buffered bytes across rebuilds, so toggling the filter is free of extra IO.Acceptance criteria
peek_next_row_group) is merged and released.parquetdependency to the release that includespeek_next_row_group.push_decoder.rs(RowFilterContext,filter_installed, the toggle block guarded bypeek_next_row_group()).fully_matchedtracking inParquetAccessPlan/PreparedAccessPlan(population, reorder, reverse, strip-empty).row_filter_skipped_fully_matched: CountonParquetFileMetrics.fully_matched_rgs_skip_row_filter— 4 RGs withWHERE v >= 3 ORDER BY v DESC LIMIT 3, assertrow_filter_skipped_fully_matched >= 1.EXPLAIN ANALYZEfor the new metric.sort_pushdown_inexactnarrow-projection queries (Q1 / Q2) — already at −46% / −60% with feat(parquet): intra-file early stopping via statistics + dynamic filters #22450 — confirm no regression and ideally a further trim from the saved row-filter CPU.Architectural note
This optimization is a workaround for the lack of a
fully_matchedconcept inside arrow-rs's own decode path. Long term, a cleaner design would move the optimization into arrow-rs: when the decoder builds the next RG's reader and finds it fully-matched viaColumnIndex, it could automatically skip the per-rowRowFilterevaluation, and DataFusion would not need to mirror state at all. If that direction is taken upstream, thefully_matchedfield onParquetAccessPlan, the toggle inPushDecoderStreamState, and thepeek_next_row_groupsync can all be deleted from DataFusion.For now, the DataFusion-side toggle is the pragmatic path.
References