Reduce SQLite write contention by introducing in-memory job buffering and batched database updates in torc server

Summary

The torc server currently handles job requests from many job runners by interacting with the database for each request and job state update. With a large number of runners and short job durations, around 5 seconds, this creates heavy pressure on SQLite’s single-writer model and leads to lock contention.

This issue proposes changing the torc server to maintain a bounded in-memory set of ready jobs and to batch state updates back to the database on a periodic interval, instead of performing database writes for each individual runner interaction.
Problem

Today, job runners contact the torc server for work, and the server interacts with SQLite to claim and update jobs. Since SQLite only allows one writer at a time, a workload with many runners and short-lived jobs causes contention around that write path.

This is especially noticeable when:

    many job runners are polling for work
    jobs complete quickly, around 5 seconds
    each request results in DB activity for claim/update operations
    priority handling must still be preserved

The current model amplifies writes and lock contention because the effective write rate scales with runner activity.
Proposed change

Move job dispatch and near-term state tracking into the torc server process:

    maintain a bounded in-memory queue or pool of ready-to-run jobs
    periodically poll the database to refill that in-memory pool
    hand jobs to runners directly from memory when they request work
    accept runner status updates immediately in memory
    periodically flush accumulated state changes back to the database in batches

Under this model, the torc server remains the only writer to the database, but instead of writing once per runner interaction, it writes on a controlled periodic cadence.
Goals

    reduce SQLite write contention
    reduce lock contention caused by many runners competing through the server
    decouple runner request rate from database write rate
    preserve existing priority behavior
    support bounded prefetching so memory usage and staleness remain controlled

Non-goals

    changing the runner-to-server communication pattern
    changing job semantics beyond temporary in-memory buffering
    guaranteeing that in-memory state survives a torc server crash

Expected behavior

A possible implementation would look like this:

    Server periodically queries SQLite for a bounded set of ready jobs.
    The number prefetched is based on estimated runnable capacity over the next polling interval, plus a configurable buffer.
    Jobs are ordered or selected in a way that preserves current priority semantics.
    Runners request work from the server.
    The server assigns jobs from the in-memory pool without requiring an immediate DB write for each request.
    Runner updates, such as started, completed, failed, are recorded in memory immediately.
    The server flushes accumulated state transitions back to SQLite on a periodic interval.

Rationale

This should reduce the number of SQLite writes from being driven by the number of active runners to being driven primarily by the torc server’s flush interval. That should significantly reduce contention on SQLite’s single-writer path.

For workloads with many runners and short jobs, this should improve scheduler throughput and responsiveness while keeping SQLite as the durable source of truth on a delayed basis.
Tradeoffs and considerations

    Crash recovery: If the torc server crashes before flushing in-memory updates, some jobs may not be marked complete in SQLite and may be re-run. This is acceptable for this proposal.
    Temporary DB staleness: SQLite will lag behind the server’s in-memory state until the next flush.
    Priority correctness: Any prefetching logic must preserve existing priority behavior.
    Bounded memory: The in-memory ready queue should be capped and not attempt to load all ready jobs.
    Flush interval tuning: There is a tradeoff between lower contention and longer persistence delay.
    Observability: It may be useful to expose metrics for prefetched jobs, flush latency, batch sizes, and queue depth.

Possible implementation details

    configurable polling interval for refilling ready jobs
    configurable flush interval for persisting updates
    configurable prefetch window and buffer size
    adaptive prefetch sizing based on recent runner throughput
    batching of state transitions in a single DB write transaction where possible

Open questions

    Should claimed jobs be marked in the DB only on flush, or should some lightweight reservation metadata be persisted earlier?
    Should the prefetch window be fixed, configurable, or adaptive?
    How should jobs be handled if they are prefetched but not assigned before the next refill cycle?
    Are there any fairness concerns across priorities or queues when jobs are prefetched in batches?

Success criteria

    fewer SQLite lock contention events under high runner counts
    reduced time runners spend waiting for job assignment
    reduced database write frequency relative to runner request frequency
    no regression in priority scheduling behavior


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce SQLite write contention by introducing in-memory job buffering and batched database updates in torc server #277

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce SQLite write contention by introducing in-memory job buffering and batched database updates in torc server #277

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions