Skip to content

[Follow-up #22450] Per-RG fully_matched RowFilter skip · needs arrow-rs#10158 (peek_next_row_group) #23067

@zhuqi-lucas

Description

@zhuqi-lucas

Background

Part of the Sort Pushdown epic (#23036) — specifically a follow-up to #22450 (runtime RG-level early stop via TopK dynamic filter).

#22450 closes the "dynamic mid-file" prune gap by dropping whole row groups whose stats prove they can't beat the current TopK threshold. There is one further, cleanly separable optimization on top of that mechanism, intentionally not in the merged PR because it depends on arrow-rs upstream:

For row groups that are fully_matched (every row passes the user's WHERE predicate after static pruning — typical when the predicate is selective on metadata stats), the per-row RowFilter evaluation is pure overhead. We can suppress the RowFilter for those RGs and re-install it for partially-matched RGs at the next boundary.

Why this is a follow-up, not part of #22450

The toggle requires asking the decoder which row group it will emit next so the toggle targets the correct RG. arrow-rs's internal page-index pruning (ColumnIndex path in try_build) can silently skip RGs that DataFusion's rg_plan still contains, so rg_plan.front() drifts off-by-N from the decoder's actual frontier. Without a way to sync, the fully_matched toggle decision lands on the wrong RG.

The clean fix is ParquetPushDecoder::peek_next_row_group() in arrow-rs (apache/arrow-rs#10158), which exposes the decoder's actual next-RG frontier as a public API. Until that lands, this optimization can't be shipped.

Design (already prototyped locally)

Three coordinated pieces, all extensions of #22450's PushDecoderStreamState:

  1. Track fully_matched: Vec<bool> per RG in PreparedAccessPlan. Populated by the existing page-index pruning path (when every page of an RG passes the static predicate, mark it fully_matched).

  2. Carry fully_matched through rg_plan as RgPlanEntry { rg_index, fully_matched }. Pass through reorder_by_statistics and reverse so the flag stays aligned with the RG index after Inexact runtime reorder.

  3. Toggle the RowFilter at each RG boundary in PushDecoderStreamState::transition:

    • peek_next_row_group() → align rg_plan.front() with the decoder's actual next RG (handles arrow-rs's silent skips).
    • If rg_plan.front().fully_matched: rebuild decoder via into_builder().with_row_filter(empty).build() to suppress per-row eval.
    • Otherwise: rebuild with the saved RowFilterContext to restore the filter.
    • Track skips with a new metric row_filter_skipped_fully_matched.

into_builder preserves buffered bytes across rebuilds, so toggling the filter is free of extra IO.

Acceptance criteria

  • arrow-rs#10158 (peek_next_row_group) is merged and released.
  • Pin DataFusion's parquet dependency to the release that includes peek_next_row_group.
  • Restore the locally-prototyped toggle code into push_decoder.rs (RowFilterContext, filter_installed, the toggle block guarded by peek_next_row_group()).
  • Restore the fully_matched tracking in ParquetAccessPlan / PreparedAccessPlan (population, reorder, reverse, strip-empty).
  • Register and expose row_filter_skipped_fully_matched: Count on ParquetFileMetrics.
  • Integration test: fully_matched_rgs_skip_row_filter — 4 RGs with WHERE v >= 3 ORDER BY v DESC LIMIT 3, assert row_filter_skipped_fully_matched >= 1.
  • SLT coverage on EXPLAIN ANALYZE for the new metric.
  • Bench check: sort_pushdown_inexact narrow-projection queries (Q1 / Q2) — already at −46% / −60% with feat(parquet): intra-file early stopping via statistics + dynamic filters #22450 — confirm no regression and ideally a further trim from the saved row-filter CPU.

Architectural note

This optimization is a workaround for the lack of a fully_matched concept inside arrow-rs's own decode path. Long term, a cleaner design would move the optimization into arrow-rs: when the decoder builds the next RG's reader and finds it fully-matched via ColumnIndex, it could automatically skip the per-row RowFilter evaluation, and DataFusion would not need to mirror state at all. If that direction is taken upstream, the fully_matched field on ParquetAccessPlan, the toggle in PushDecoderStreamState, and the peek_next_row_group sync can all be deleted from DataFusion.

For now, the DataFusion-side toggle is the pragmatic path.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions