Skip to content

feat: process-wide durable upload architecture — foundations (#184)#233

Open
javi11 wants to merge 22 commits into
mainfrom
session/silly-proskuriakova-39f357
Open

feat: process-wide durable upload architecture — foundations (#184)#233
javi11 wants to merge 22 commits into
mainfrom
session/silly-proskuriakova-39f357

Conversation

@javi11

@javi11 javi11 commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Foundational, independently-tested layers of the process-wide durable upload architecture (issue #184). Builds on top of the per-poster shared worker pool already merged in #228/#230 and extends it to be truly process-wide, plus adds the durable persistence + verification machinery for crash-safe re-posting.

Phases 0–4 + re-poster + partial observability — every reusable component, each -race tested. The remaining live cutover (writing manifests into the upload path, the verification service Run loop + ownership, upload-slot release, removing the job-local checkLoop, crash recovery, cleanup lifecycle, frontend stats) is intentionally not in this PR.

Why this is still needed after #228/#230

#228 shares one worker pool within a single poster, but the processor still creates a new poster per job, so workers and read-ahead buffers still multiply with max_concurrent_uploads. This PR introduces a transfer Runtime (owned by the processor, shared across jobs) and a poster.Engine that gates effective concurrency and in-flight memory globally, layered onto the existing per-poster pump — not replacing it.

What's included

  • Phase 0 — config (a3d2e58): posting.upload_buffer_memory_limit, par2.max_concurrent_jobs, post_check.max_concurrent_checks + defaults/validation + Wails models + settings UI + en/es/fr/tr i18n. No behavior change.
  • Phase 1 — PAR2 scheduler (f580a87): process-wide par2.Scheduler so memory_limit applies per active job. New postie.Runtime owns it; postie.New retained as standalone wrapper.
  • Phase 2 — upload engine (084c018): poster.Engine (worker-slot semaphore + buffer-byte budget) injected via NewWithEngine. Two gates on the existing pump: read-ahead reserves per-article budget; runUploadJob acquires a global worker slot. Nil engine = standalone.
  • Phase 3 — durable persistence (b170c19): streaming zstd JSONL manifest store (atomic temp+rename); migration 007 (transfer_files + lease-based verification_failures, no per-article rows); transferstore DAO; stable FileJob.TransferID surviving retries.
  • Phase 4 — verification service core (b8f2e1c): lease-based verification.Service (VerifyFile STATs a manifest persisting only misses; ProcessDueFailures re-posts then STAT-rechecks with backoff). Decoupled via Stater/Reposter.
  • Phase 5 — re-poster (61a5ccf): extracted behavior-preserving poster.postYenc; poster.Reposter re-posts from a manifest record reusing the exact Message-ID/headers.
  • Phase 7 partial — observability (d056c75): Runtime.Metrics() / Processor.TransferRuntimeMetrics().

Test plan

  • go test ./internal/... ./pkg/... — all green
  • -race on all new packages
  • golangci-lint clean on new code (2 remaining findings are pre-existing in main)
  • migration 007 applies via MigrateUp
  • Live cutover + real-NNTP verification — follow-up PR

Defaults are conservative and backward compatible; no runtime behavior change for existing users yet (engine wiring is additive and nil-safe).

javi11 added 22 commits June 14, 2026 16:47
Introduces three conservative, backward-compatible config fields that
underpin the durable process-wide upload architecture:

- posting.upload_buffer_memory_limit (int64, 0=auto): caps total upload
  engine buffer memory independent of queue concurrency.
- par2.max_concurrent_jobs (int, default 1): bounds concurrent PAR2 jobs
  so memory_limit applies per active job, not per queue job.
- post_check.max_concurrent_checks (int, 0=auto): caps global STAT
  verification concurrency.

Wires defaults (par2 normalised to 1 on load), Validate() guards against
negative values, the generated Wails models, the Par2/PostCheck/Posting
settings sections, and en/es/fr/tr translations. Adds config tests for
defaults, load normalisation, and validation. No behavior change yet;
later phases consume these fields.
…184 phase 1)

Introduces par2.Scheduler, a process-wide concurrency gate that allows at
most par2.max_concurrent_jobs PAR2 operations to run at once; excess work
waits for a slot (cancellable via context) rather than running concurrently.
par2.ScheduledExecutor wraps each per-job executor so its progress/config are
preserved while execution is gated. Adds metrics (Active/Queued/Capacity) for
later observability.

A new postie.Runtime owns the shared scheduler. Processor creates one Runtime
during init and closes it on shutdown; jobs are now built via
postie.NewWithRuntime so all jobs share the gate. postie.New is retained as a
compatibility wrapper that creates a private scheduler for CLI/standalone use.

This prevents multiple concurrent queue jobs from each reserving the full
configured PAR2 memory_limit. Tests cover the concurrency bound, cancellation
while queued, nil-runtime fallback, and config-driven capacity, all under -race.
#184 phase 2)

Builds on the per-poster shared worker pool from #228 to make limits truly
process-wide: a new poster.Engine (owned by the transfer Runtime, injected via
NewWithEngine) bounds effective posting concurrency and in-flight buffer memory
across ALL jobs, not just within one poster.

ComputeEngineBudget derives limits from article size, posting.upload_buffer_memory_limit
(0 = auto) and total upload connection capacity:
  per_article  = article_size + ~yenc + 256 KiB
  auto_budget  = clamp(per_article * min(conn_capacity, 32), 64MiB, 512MiB)
  worker_count = min(conn_capacity, floor(budget / per_article))

The per-poster pump is unchanged; integration is two gates: the read-ahead
goroutine reserves per-article budget before allocating/reading (blocking when
the global budget is exhausted), and runUploadJob acquires a global worker slot
before posting. Both reservations are released on every buffer lifecycle exit
(post done, read error, submit cancel). Nil engine = standalone, no gate.

The Runtime sizes one engine from upload-pool connection capacity and shares it
with every job via postie.NewWithEngine. Tests cover the budget formula,
bounded buffers, bounded workers, cancellation and nil-safety under -race.
…phase 7 partial)

Adds Runtime.Metrics() returning a RuntimeMetrics snapshot (active/queued
upload workers, worker count, reserved/configured buffer bytes, active/queued
PAR2 jobs and capacity) and Processor.TransferRuntimeMetrics() to surface it.
This makes the scheduler metrics introduced in phases 1-2 queryable so memory
growth can be attributed to a subsystem. Nil-safe throughout; tested under -race.

Note: queue-stat fields (pendingVerification/verificationFailed) and full
frontend dashboard wiring are deferred to the durable-verification phases.
…table transfer_id (#184 phase 3)

Adds the persistence foundation for durable uploads and verification:

- internal/manifest: streaming, zstd-compressed JSON Lines manifests written
  via temp-file + atomic rename. One ArticleRecord per posted segment captures
  everything needed to re-post byte-for-byte with the same Message-ID (offset,
  body size, message-id, subject, from, groups, date, custom headers, X-Nxg,
  yenc part metadata). Versioned header; streaming reader rejects unknown
  versions. RecordFromArticle bridges from article.Article.

- migration 007: transfer_files (one row per source/PAR2 file -> manifest,
  with upload/verification state + next_check_at + cleanup_policy) and
  verification_failures (ONLY failed articles, with database leases). Successful
  articles are never persisted to avoid millions of SQLite rows.

- internal/transferstore: DAO over both tables — UpsertFile/GetFile/List,
  Set{Upload,Verification}State, AddFailure (idempotent), lease-based
  ClaimDueFailures, UpdateFailureAfterCheck, ReclaimExpiredLeases, CountFailures.

- queue: FileJob gains a stable TransferID assigned at creation, persisted in
  the job body so it survives retries and crash recovery.

All tested under -race (manifest round-trip/atomicity/streaming, store CRUD +
leasing/reclaim, transfer_id stability across retry). Not yet wired into the
posting path — phases 4-6 consume this.
… phase 4)

Adds internal/verification.Service, the durable post-check engine that operates
on the transfer store + manifests, decoupled from NNTP via Stater/Reposter
interfaces (so it is unit-testable and wireable later):

- VerifyFile streams a file's manifest, STATs every article with bounded
  concurrency (post_check.max_concurrent_checks), and persists ONLY the misses.
  No misses -> file marked verified; misses -> verifying + failure rows.
- ProcessDueFailures claims a due batch via database leases and, per failure,
  re-posts (reusing the manifest record's exact Message-ID/headers) while under
  MaxReposts, else STAT-rechecks with exponential backoff up to MaxBackoff,
  giving up after MaxDeferredChecks. Resolving the last failure flips the file
  to verified; exhaustion flips it to verification_failed.
- Manifest records for re-posts are resolved per-file (each manifest streamed
  at most once per batch), never loading a whole large upload into memory.

Tested under -race: all-present, persisted misses, repost-then-resolve,
exhaustion->failed, and empty-batch. Not yet wired into the processor or the
posting path — the live cutover (write manifests during upload, release slot
before verification, remove job-local checkLoop, crash recovery) remains.
…phase 5)

Extracts the shared posting primitive postYenc (header build + yEnc meta +
stale-connection retry + throttle + stats) from postArticleWithBody so the
normal upload path and the durable re-post path produce byte-identical posts.
Behavior-preserving; covered by existing poster tests.

Adds poster.Reposter, a process-wide re-poster that reconstructs an article
from a manifest.ArticleRecord, reads its body from the original source file at
the recorded offset, and re-posts reusing the exact Message-ID and headers
through the shared upload pool + engine gates. Implements verification.Reposter.

Tested under -race: Message-ID/body reuse, missing-source error, engine
resource drain, no-pool guard.
…184 phase 5)

Adds internal/transferwriter.Recorder, the bridge from the poster's article
list to the durable manifest store + transfer persistence. Per job (bound to a
transfer_id), RecordFile writes an immutable manifest for a source/PAR2 file
(atomic temp+rename) and upserts its transfer_files row in the planned state,
before the file's articles are posted — so an interrupted upload can be resumed
and verified with the exact Message-IDs that were posted.

File role is inferred from the path (par2 -> generated_par2); the file id is a
stable hash of the source path so retries/recovery map to the same manifest.
Tested under -race: manifest+row round-trip, par2 role, deterministic id.
#184 phase 5)

Adds an optional poster.ManifestSink, invoked in addPost right after a file's
articles are built and BEFORE any network posting. When set, the file's
manifest is recorded first; if recording fails the file is failed (the source
file handle is closed and an error returned) so nothing is posted without
durable recovery data. A nil sink preserves standalone behaviour.

Wired through NewWithEngine (new optional param); New passes nil. Postie passes
nil for now — the per-job sink instance (bound to transfer_id + store) is wired
in the next step. Tested under -race: sink invoked once with the file's
articles; sink error aborts before PostYenc is ever called.
…184 phase 5)

Activates durable manifest recording end-to-end:

- Runtime now owns a *transferstore.Store + manifest base dir, and exposes
  NewManifestRecorder(transferID) which returns a transferwriter-backed
  poster.ManifestSink (or untyped nil when no store / empty transferID).
- queue.Queue.DB() exposes the shared sql.DB so the transfer store reuses the
  same connection.
- Processor builds the store from the queue DB and a manifest dir alongside the
  database file, passes them to NewRuntime, and threads job.TransferID into
  postie.NewWithRuntime.
- Postie builds the per-job recorder from the runtime and hands it to the
  poster, which writes each file's manifest + transfer_files row before posting.

With this, every queue job writes durable manifests keyed by its stable
transfer_id; standalone/CLI use (nil runtime/store) is unaffected. Tested under
-race (recorder construction, nil-safety) plus the existing poster sink tests.
… scheduling (#184 phase 5)

MarkUploaded advances a transfer file to the uploaded state (both upload_state
and verification_state), recording posted_at and the first-check due time
(posted_at + post_check.delay). ListDueFiles returns files awaiting their first
verification check (verification_state=uploaded, next_check_at<=now), oldest
first, so the durable verification service can sleep-until-due and pick work.
Tested under -race.
…phase 5)

Adds Service.Run(ctx): a background loop (PollInterval, default 15s) that each
cycle reclaims expired leases, runs the first verification check for due files
(VerifyDueFiles), then processes due verification failures. VerifyDueFiles lists
files whose check is due and verifies each, logging and skipping per-file errors
so one bad file never stalls the batch. Tested under -race (only due uploaded
files are verified; not-due files are left untouched).
…re verify (#184 phase 5)

Flips verification to the durable service when running with a store (durable
mode), wiring the cutover end to end:

- poster: the legacy in-poster checkLoop is skipped in durable mode (manifest
  sink set) via immediateCheckEnabled(); posting returns as soon as articles are
  posted, so the queue upload slot is released before propagation delay.
- Postie: after a successful Post, marks the transfer's files uploaded
  (Recorder.CompleteUpload) with next_check_at = now + post_check.delay, so the
  durable service picks them up. No-op in standalone mode.
- transferwriter.Recorder.CompleteUpload advances every file of the transfer to
  the uploaded state.
- Runtime builds the verification.Service (verify-pool Stater + engine-backed
  Reposter, config mapped from post_check) and exposes RunVerification; the
  Processor starts it once as a background goroutine.

Result: completed uploads no longer hold a queue slot during propagation;
verification + re-posting run durably in the background and survive restarts.
Standalone/CLI use (no store) keeps the in-poster checkLoop. All packages pass
under -race.
… mode (#184 phase 6)

Data-safety guard for the durable upload path: previously delete_original ran
in processFile immediately after posting — i.e. BEFORE the background service
verified the articles propagated, destroying the only recovery source if
verification later failed.

Now, when durable verification is active (Runtime.DurableVerificationEnabled),
the processor retains originals at upload completion and marks the completed
item pending_verification instead of verified. Deferred cleanup (running the
post-upload script + delete_original + manifest removal once a transfer is
verified) is the follow-up; until it lands, durable-mode originals are retained
rather than deleted — a safe failure mode. Standalone mode is unchanged.

Adds Runtime.DurableVerificationEnabled() (tested) and Processor.durableMode().
…ion (#184 phase 6)

Completes the durable cleanup lifecycle: cleanup now happens only AFTER the
background verification service confirms a transfer, never before.

- Upload completion persists the delete-original intent into the transfer's
  cleanup_policy (Recorder.CompleteUpload(deleteOriginal); store.SetCleanupPolicy).
  The processor resolves the policy (job override > global) and hands it to
  Postie via SetDeleteOriginal before posting.
- New internal/transfercleaner.Cleaner.CleanupTransfer runs ONLY when every file
  of a transfer is verified and NONE failed: it deletes originals whose policy
  requests it, removes generated PAR2 (unless maintain_par2), removes manifests,
  and drops the transfer rows. If any file failed verification it retains
  everything (originals, PAR2, manifests, rows) for operator recovery. Idempotent.
- The verification service triggers the cleaner whenever a file reaches verified
  (VerifyFile no-miss path and reconcileFileState); the Runtime builds the
  cleaner (maintain_par2 from config) and attaches it via SetCleaner.
- store.DeleteFilesByTransfer added.

Net: durable mode is now safe end to end — originals survive until verified,
are deleted only on success per policy, retained on failure, and manifests no
longer accumulate. (Known deviation: the post-upload script still runs at
upload completion, not post-verification; tracked as follow-up — it is not a
data-loss path.) All packages pass under -race; cleaner deletion logic TDD'd
with real temp files.
… item (#184 phase 6)

Links the durable transfer state back to the user-facing queue row so the UI no
longer shows pending_verification forever.

- store: SetCompletedItemForTransfer (stamps transfer_files.completed_item_id)
  and SetCompletedItemVerificationStatus (updates completed_items.verification_status).
- processor: on durable completion, links the completed item id to the transfer
  (alongside the existing pending_verification mark).
- verification service: finalizeTransfer replaces the per-file cleanup trigger —
  when every file of a transfer is terminal it sets the completed item to
  "verified" (then cleans up) or "verification_failed" (and retains everything).

Tested under -race: store updates, and VerifyFile flipping the completed item
to verified once all articles are confirmed.
…posted articles (#184 phase 6)

On restart, an interrupted upload's job is re-queued with its stable transfer_id,
so a manifest already exists for each file. Previously addPost regenerated fresh
articles (new Message-IDs), orphaning the partially-posted articles and creating
duplicate NZB segments, and re-uploaded everything from scratch.

Now addPost detects an existing manifest (ManifestSink.ExistingArticles) and
takes a recovery branch (addRecoveredPost): it reconstructs the articles from
the manifest records — preserving the exact Message-IDs, offsets and headers —
STATs each (filterMissing, bounded concurrency) and re-posts ONLY the articles
still missing on the server, without rewriting the manifest. If all articles are
already present the file completes immediately. A STAT error treats the article
as missing (re-post is idempotent); with no verify pool, all are re-posted.

This avoids duplicate segments and re-uploading terabytes after a crash. The
durable verification service already resumes on restart (reclaims expired leases
+ picks up due files/failures). transferwriter.Recorder.ExistingArticles added.
Tested under -race (existing-articles round-trip; filterMissing keep/skip/order
and no-pool fallback).
… verification (#184 phase 6)

On upgrade, any rows from the pre-durable pending_article_checks table are
migrated into verification_failures as STAT-only records: transfer_id =
completed_item_id, file_id = "legacy", article_index = -1. Lacking a manifest,
they cannot be re-posted (per the design) but the durable verification service
continues STAT-rechecking them with backoff. The legacy rows are deleted in the
same transaction; the migration is idempotent and a no-op when the table is
absent or empty.

The processor runs the migration once at startup when a durable store is
available. Tested under -race: rows mapped correctly, legacy table emptied,
claimable as failures, and second run migrates nothing.
…phase 7)

Adds pendingVerification and verificationFailed counts end to end:
- queue.GetQueueStats counts completed_items by verification_status.
- backend QueueStats struct gains the two fields (also flow through the web
  /queue/stats JSON and Wails bindings/models automatically).
- QueueStats.svelte shows a verification row (ShieldCheck / ShieldAlert) only
  when there is verification activity, so non-durable setups are unaffected.
- en/es/fr/tr i18n keys added.

Tested: queue stats counts under the Go suite; svelte-check passes.
Surfaces the process-wide upload engine + PAR2 scheduler metrics
(Processor.TransferRuntimeMetrics) in the diagnostics UI:
- backend App.GetTransferRuntimeMetrics returns a backend TransferRuntimeMetrics
  (active/queued upload workers, worker count, reserved/budget buffer bytes,
  active/queued PAR2 jobs, capacity).
- web GET /api/v1/metrics/transfer-runtime + Wails binding + models.ts type +
  unified/web clients.
- /metrics route shows a "Transfer Runtime" section (workers, buffer memory,
  PAR2 jobs), polled every 5s, hidden when no engine is active.
- en/es/fr/tr i18n.

svelte-check clean; Go build/tests green.
…mode (#184 phase 6)

Completes the durable cleanup lifecycle: the post-upload script now fires only
once a transfer is verified, not at upload completion.

- Extracts runPostUploadScript from Postie.ExecutePostUploadScript (preserving
  retry-status tracking via the queue) so it can run from the cleanup path too.
- transfercleaner runs the script (when configured) before deleting sources, and
  only when all files verified — never on verification failure.
- Runtime builds the cleaner's ScriptRunner from the post-upload-script config +
  queue, resolving the NZB path (store.GetCompletedItemNZBPath) and source path
  from the transfer's files.
- Processor skips the completion-time script call in durable mode (standalone
  mode unchanged).

Tested under -race: script runs once on full verify, never on failure.
The scriptQueue parameter added to NewRuntime in 05d2b68 updated the production
call sites but runtime_test.go was not staged in that commit, so CI failed to
build pkg/postie (5-arg calls vs the new 6-arg signature). Pass nil scriptQueue
in the three test calls.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant