Skip to content

Investigate backfill under-fetching in resource-based job claiming #267

@daniel-thom

Description

@daniel-thom

Problem

Copilot noted a remaining edge case in PR #266: the backfill query currently uses
LIMIT remaining_limit, but the Rust packing pass can still skip some of those returned rows.

That means the server may return fewer jobs than it could have if the top remaining_limit
backfill candidates do not pack together, even though lower-ranked candidates outside that limited
window would fit the remaining resources.

Example shape:

remaining CPU = 4
remaining_limit = 4

backfill candidates returned by SQL:
  job A: 3 CPU
  job B: 3 CPU
  job C: 3 CPU
  job D: 3 CPU

Rust claims A, then skips B/C/D because only 1 CPU remains.

Lower-ranked 1-CPU jobs may exist, but the backfill query did not fetch them.

Scope

This is distinct from the GPU-saturation paging fix in PR #266. That PR keeps the query bounded and
addresses the observed case where a primary page is dominated by higher-priority GPU jobs and
lower-priority CPU jobs can fill leftover CPU capacity.

Possible approaches

  • Over-fetch a bounded multiple of remaining_limit, with a reasonable cap.
  • Make the backfill pass iterative/page-based until either the claim limit is met, resources are
    saturated, or a maximum number of backfill pages has been scanned.
  • Add instrumentation first to see whether skips in the backfill pass are common enough to justify a
    broader heuristic.

Acceptance criteria

  • Add a regression test where the first backfill window contains candidates that individually fit the
    SQL remaining-resource filters but do not pack together, while lower-ranked candidates would fit.
  • Keep total SQL work bounded.
  • Preserve existing priority ordering and scheduler fallback behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions