feat: process-wide durable upload architecture — foundations (#184)#233
Open
javi11 wants to merge 22 commits into
Open
feat: process-wide durable upload architecture — foundations (#184)#233javi11 wants to merge 22 commits into
javi11 wants to merge 22 commits into
Conversation
Introduces three conservative, backward-compatible config fields that underpin the durable process-wide upload architecture: - posting.upload_buffer_memory_limit (int64, 0=auto): caps total upload engine buffer memory independent of queue concurrency. - par2.max_concurrent_jobs (int, default 1): bounds concurrent PAR2 jobs so memory_limit applies per active job, not per queue job. - post_check.max_concurrent_checks (int, 0=auto): caps global STAT verification concurrency. Wires defaults (par2 normalised to 1 on load), Validate() guards against negative values, the generated Wails models, the Par2/PostCheck/Posting settings sections, and en/es/fr/tr translations. Adds config tests for defaults, load normalisation, and validation. No behavior change yet; later phases consume these fields.
…184 phase 1) Introduces par2.Scheduler, a process-wide concurrency gate that allows at most par2.max_concurrent_jobs PAR2 operations to run at once; excess work waits for a slot (cancellable via context) rather than running concurrently. par2.ScheduledExecutor wraps each per-job executor so its progress/config are preserved while execution is gated. Adds metrics (Active/Queued/Capacity) for later observability. A new postie.Runtime owns the shared scheduler. Processor creates one Runtime during init and closes it on shutdown; jobs are now built via postie.NewWithRuntime so all jobs share the gate. postie.New is retained as a compatibility wrapper that creates a private scheduler for CLI/standalone use. This prevents multiple concurrent queue jobs from each reserving the full configured PAR2 memory_limit. Tests cover the concurrency bound, cancellation while queued, nil-runtime fallback, and config-driven capacity, all under -race.
#184 phase 2) Builds on the per-poster shared worker pool from #228 to make limits truly process-wide: a new poster.Engine (owned by the transfer Runtime, injected via NewWithEngine) bounds effective posting concurrency and in-flight buffer memory across ALL jobs, not just within one poster. ComputeEngineBudget derives limits from article size, posting.upload_buffer_memory_limit (0 = auto) and total upload connection capacity: per_article = article_size + ~yenc + 256 KiB auto_budget = clamp(per_article * min(conn_capacity, 32), 64MiB, 512MiB) worker_count = min(conn_capacity, floor(budget / per_article)) The per-poster pump is unchanged; integration is two gates: the read-ahead goroutine reserves per-article budget before allocating/reading (blocking when the global budget is exhausted), and runUploadJob acquires a global worker slot before posting. Both reservations are released on every buffer lifecycle exit (post done, read error, submit cancel). Nil engine = standalone, no gate. The Runtime sizes one engine from upload-pool connection capacity and shares it with every job via postie.NewWithEngine. Tests cover the budget formula, bounded buffers, bounded workers, cancellation and nil-safety under -race.
…phase 7 partial) Adds Runtime.Metrics() returning a RuntimeMetrics snapshot (active/queued upload workers, worker count, reserved/configured buffer bytes, active/queued PAR2 jobs and capacity) and Processor.TransferRuntimeMetrics() to surface it. This makes the scheduler metrics introduced in phases 1-2 queryable so memory growth can be attributed to a subsystem. Nil-safe throughout; tested under -race. Note: queue-stat fields (pendingVerification/verificationFailed) and full frontend dashboard wiring are deferred to the durable-verification phases.
…table transfer_id (#184 phase 3) Adds the persistence foundation for durable uploads and verification: - internal/manifest: streaming, zstd-compressed JSON Lines manifests written via temp-file + atomic rename. One ArticleRecord per posted segment captures everything needed to re-post byte-for-byte with the same Message-ID (offset, body size, message-id, subject, from, groups, date, custom headers, X-Nxg, yenc part metadata). Versioned header; streaming reader rejects unknown versions. RecordFromArticle bridges from article.Article. - migration 007: transfer_files (one row per source/PAR2 file -> manifest, with upload/verification state + next_check_at + cleanup_policy) and verification_failures (ONLY failed articles, with database leases). Successful articles are never persisted to avoid millions of SQLite rows. - internal/transferstore: DAO over both tables — UpsertFile/GetFile/List, Set{Upload,Verification}State, AddFailure (idempotent), lease-based ClaimDueFailures, UpdateFailureAfterCheck, ReclaimExpiredLeases, CountFailures. - queue: FileJob gains a stable TransferID assigned at creation, persisted in the job body so it survives retries and crash recovery. All tested under -race (manifest round-trip/atomicity/streaming, store CRUD + leasing/reclaim, transfer_id stability across retry). Not yet wired into the posting path — phases 4-6 consume this.
… phase 4) Adds internal/verification.Service, the durable post-check engine that operates on the transfer store + manifests, decoupled from NNTP via Stater/Reposter interfaces (so it is unit-testable and wireable later): - VerifyFile streams a file's manifest, STATs every article with bounded concurrency (post_check.max_concurrent_checks), and persists ONLY the misses. No misses -> file marked verified; misses -> verifying + failure rows. - ProcessDueFailures claims a due batch via database leases and, per failure, re-posts (reusing the manifest record's exact Message-ID/headers) while under MaxReposts, else STAT-rechecks with exponential backoff up to MaxBackoff, giving up after MaxDeferredChecks. Resolving the last failure flips the file to verified; exhaustion flips it to verification_failed. - Manifest records for re-posts are resolved per-file (each manifest streamed at most once per batch), never loading a whole large upload into memory. Tested under -race: all-present, persisted misses, repost-then-resolve, exhaustion->failed, and empty-batch. Not yet wired into the processor or the posting path — the live cutover (write manifests during upload, release slot before verification, remove job-local checkLoop, crash recovery) remains.
…phase 5) Extracts the shared posting primitive postYenc (header build + yEnc meta + stale-connection retry + throttle + stats) from postArticleWithBody so the normal upload path and the durable re-post path produce byte-identical posts. Behavior-preserving; covered by existing poster tests. Adds poster.Reposter, a process-wide re-poster that reconstructs an article from a manifest.ArticleRecord, reads its body from the original source file at the recorded offset, and re-posts reusing the exact Message-ID and headers through the shared upload pool + engine gates. Implements verification.Reposter. Tested under -race: Message-ID/body reuse, missing-source error, engine resource drain, no-pool guard.
…184 phase 5) Adds internal/transferwriter.Recorder, the bridge from the poster's article list to the durable manifest store + transfer persistence. Per job (bound to a transfer_id), RecordFile writes an immutable manifest for a source/PAR2 file (atomic temp+rename) and upserts its transfer_files row in the planned state, before the file's articles are posted — so an interrupted upload can be resumed and verified with the exact Message-IDs that were posted. File role is inferred from the path (par2 -> generated_par2); the file id is a stable hash of the source path so retries/recovery map to the same manifest. Tested under -race: manifest+row round-trip, par2 role, deterministic id.
#184 phase 5) Adds an optional poster.ManifestSink, invoked in addPost right after a file's articles are built and BEFORE any network posting. When set, the file's manifest is recorded first; if recording fails the file is failed (the source file handle is closed and an error returned) so nothing is posted without durable recovery data. A nil sink preserves standalone behaviour. Wired through NewWithEngine (new optional param); New passes nil. Postie passes nil for now — the per-job sink instance (bound to transfer_id + store) is wired in the next step. Tested under -race: sink invoked once with the file's articles; sink error aborts before PostYenc is ever called.
…184 phase 5) Activates durable manifest recording end-to-end: - Runtime now owns a *transferstore.Store + manifest base dir, and exposes NewManifestRecorder(transferID) which returns a transferwriter-backed poster.ManifestSink (or untyped nil when no store / empty transferID). - queue.Queue.DB() exposes the shared sql.DB so the transfer store reuses the same connection. - Processor builds the store from the queue DB and a manifest dir alongside the database file, passes them to NewRuntime, and threads job.TransferID into postie.NewWithRuntime. - Postie builds the per-job recorder from the runtime and hands it to the poster, which writes each file's manifest + transfer_files row before posting. With this, every queue job writes durable manifests keyed by its stable transfer_id; standalone/CLI use (nil runtime/store) is unaffected. Tested under -race (recorder construction, nil-safety) plus the existing poster sink tests.
… scheduling (#184 phase 5) MarkUploaded advances a transfer file to the uploaded state (both upload_state and verification_state), recording posted_at and the first-check due time (posted_at + post_check.delay). ListDueFiles returns files awaiting their first verification check (verification_state=uploaded, next_check_at<=now), oldest first, so the durable verification service can sleep-until-due and pick work. Tested under -race.
…phase 5) Adds Service.Run(ctx): a background loop (PollInterval, default 15s) that each cycle reclaims expired leases, runs the first verification check for due files (VerifyDueFiles), then processes due verification failures. VerifyDueFiles lists files whose check is due and verifies each, logging and skipping per-file errors so one bad file never stalls the batch. Tested under -race (only due uploaded files are verified; not-due files are left untouched).
…re verify (#184 phase 5) Flips verification to the durable service when running with a store (durable mode), wiring the cutover end to end: - poster: the legacy in-poster checkLoop is skipped in durable mode (manifest sink set) via immediateCheckEnabled(); posting returns as soon as articles are posted, so the queue upload slot is released before propagation delay. - Postie: after a successful Post, marks the transfer's files uploaded (Recorder.CompleteUpload) with next_check_at = now + post_check.delay, so the durable service picks them up. No-op in standalone mode. - transferwriter.Recorder.CompleteUpload advances every file of the transfer to the uploaded state. - Runtime builds the verification.Service (verify-pool Stater + engine-backed Reposter, config mapped from post_check) and exposes RunVerification; the Processor starts it once as a background goroutine. Result: completed uploads no longer hold a queue slot during propagation; verification + re-posting run durably in the background and survive restarts. Standalone/CLI use (no store) keeps the in-poster checkLoop. All packages pass under -race.
… mode (#184 phase 6) Data-safety guard for the durable upload path: previously delete_original ran in processFile immediately after posting — i.e. BEFORE the background service verified the articles propagated, destroying the only recovery source if verification later failed. Now, when durable verification is active (Runtime.DurableVerificationEnabled), the processor retains originals at upload completion and marks the completed item pending_verification instead of verified. Deferred cleanup (running the post-upload script + delete_original + manifest removal once a transfer is verified) is the follow-up; until it lands, durable-mode originals are retained rather than deleted — a safe failure mode. Standalone mode is unchanged. Adds Runtime.DurableVerificationEnabled() (tested) and Processor.durableMode().
…ion (#184 phase 6) Completes the durable cleanup lifecycle: cleanup now happens only AFTER the background verification service confirms a transfer, never before. - Upload completion persists the delete-original intent into the transfer's cleanup_policy (Recorder.CompleteUpload(deleteOriginal); store.SetCleanupPolicy). The processor resolves the policy (job override > global) and hands it to Postie via SetDeleteOriginal before posting. - New internal/transfercleaner.Cleaner.CleanupTransfer runs ONLY when every file of a transfer is verified and NONE failed: it deletes originals whose policy requests it, removes generated PAR2 (unless maintain_par2), removes manifests, and drops the transfer rows. If any file failed verification it retains everything (originals, PAR2, manifests, rows) for operator recovery. Idempotent. - The verification service triggers the cleaner whenever a file reaches verified (VerifyFile no-miss path and reconcileFileState); the Runtime builds the cleaner (maintain_par2 from config) and attaches it via SetCleaner. - store.DeleteFilesByTransfer added. Net: durable mode is now safe end to end — originals survive until verified, are deleted only on success per policy, retained on failure, and manifests no longer accumulate. (Known deviation: the post-upload script still runs at upload completion, not post-verification; tracked as follow-up — it is not a data-loss path.) All packages pass under -race; cleaner deletion logic TDD'd with real temp files.
… item (#184 phase 6) Links the durable transfer state back to the user-facing queue row so the UI no longer shows pending_verification forever. - store: SetCompletedItemForTransfer (stamps transfer_files.completed_item_id) and SetCompletedItemVerificationStatus (updates completed_items.verification_status). - processor: on durable completion, links the completed item id to the transfer (alongside the existing pending_verification mark). - verification service: finalizeTransfer replaces the per-file cleanup trigger — when every file of a transfer is terminal it sets the completed item to "verified" (then cleans up) or "verification_failed" (and retains everything). Tested under -race: store updates, and VerifyFile flipping the completed item to verified once all articles are confirmed.
…posted articles (#184 phase 6) On restart, an interrupted upload's job is re-queued with its stable transfer_id, so a manifest already exists for each file. Previously addPost regenerated fresh articles (new Message-IDs), orphaning the partially-posted articles and creating duplicate NZB segments, and re-uploaded everything from scratch. Now addPost detects an existing manifest (ManifestSink.ExistingArticles) and takes a recovery branch (addRecoveredPost): it reconstructs the articles from the manifest records — preserving the exact Message-IDs, offsets and headers — STATs each (filterMissing, bounded concurrency) and re-posts ONLY the articles still missing on the server, without rewriting the manifest. If all articles are already present the file completes immediately. A STAT error treats the article as missing (re-post is idempotent); with no verify pool, all are re-posted. This avoids duplicate segments and re-uploading terabytes after a crash. The durable verification service already resumes on restart (reclaims expired leases + picks up due files/failures). transferwriter.Recorder.ExistingArticles added. Tested under -race (existing-articles round-trip; filterMissing keep/skip/order and no-pool fallback).
… verification (#184 phase 6) On upgrade, any rows from the pre-durable pending_article_checks table are migrated into verification_failures as STAT-only records: transfer_id = completed_item_id, file_id = "legacy", article_index = -1. Lacking a manifest, they cannot be re-posted (per the design) but the durable verification service continues STAT-rechecking them with backoff. The legacy rows are deleted in the same transaction; the migration is idempotent and a no-op when the table is absent or empty. The processor runs the migration once at startup when a durable store is available. Tested under -race: rows mapped correctly, legacy table emptied, claimable as failures, and second run migrates nothing.
…phase 7) Adds pendingVerification and verificationFailed counts end to end: - queue.GetQueueStats counts completed_items by verification_status. - backend QueueStats struct gains the two fields (also flow through the web /queue/stats JSON and Wails bindings/models automatically). - QueueStats.svelte shows a verification row (ShieldCheck / ShieldAlert) only when there is verification activity, so non-durable setups are unaffected. - en/es/fr/tr i18n keys added. Tested: queue stats counts under the Go suite; svelte-check passes.
Surfaces the process-wide upload engine + PAR2 scheduler metrics (Processor.TransferRuntimeMetrics) in the diagnostics UI: - backend App.GetTransferRuntimeMetrics returns a backend TransferRuntimeMetrics (active/queued upload workers, worker count, reserved/budget buffer bytes, active/queued PAR2 jobs, capacity). - web GET /api/v1/metrics/transfer-runtime + Wails binding + models.ts type + unified/web clients. - /metrics route shows a "Transfer Runtime" section (workers, buffer memory, PAR2 jobs), polled every 5s, hidden when no engine is active. - en/es/fr/tr i18n. svelte-check clean; Go build/tests green.
…mode (#184 phase 6) Completes the durable cleanup lifecycle: the post-upload script now fires only once a transfer is verified, not at upload completion. - Extracts runPostUploadScript from Postie.ExecutePostUploadScript (preserving retry-status tracking via the queue) so it can run from the cleanup path too. - transfercleaner runs the script (when configured) before deleting sources, and only when all files verified — never on verification failure. - Runtime builds the cleaner's ScriptRunner from the post-upload-script config + queue, resolving the NZB path (store.GetCompletedItemNZBPath) and source path from the transfer's files. - Processor skips the completion-time script call in durable mode (standalone mode unchanged). Tested under -race: script runs once on full verify, never on failure.
The scriptQueue parameter added to NewRuntime in 05d2b68 updated the production call sites but runtime_test.go was not staged in that commit, so CI failed to build pkg/postie (5-arg calls vs the new 6-arg signature). Pass nil scriptQueue in the three test calls.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Foundational, independently-tested layers of the process-wide durable upload architecture (issue #184). Builds on top of the per-poster shared worker pool already merged in #228/#230 and extends it to be truly process-wide, plus adds the durable persistence + verification machinery for crash-safe re-posting.
Phases 0–4 + re-poster + partial observability — every reusable component, each
-racetested. The remaining live cutover (writing manifests into the upload path, the verification serviceRunloop + ownership, upload-slot release, removing the job-localcheckLoop, crash recovery, cleanup lifecycle, frontend stats) is intentionally not in this PR.Why this is still needed after #228/#230
#228 shares one worker pool within a single poster, but the processor still creates a new poster per job, so workers and read-ahead buffers still multiply with
max_concurrent_uploads. This PR introduces a transferRuntime(owned by the processor, shared across jobs) and aposter.Enginethat gates effective concurrency and in-flight memory globally, layered onto the existing per-poster pump — not replacing it.What's included
a3d2e58):posting.upload_buffer_memory_limit,par2.max_concurrent_jobs,post_check.max_concurrent_checks+ defaults/validation + Wails models + settings UI + en/es/fr/tr i18n. No behavior change.f580a87): process-widepar2.Schedulersomemory_limitapplies per active job. Newpostie.Runtimeowns it;postie.Newretained as standalone wrapper.084c018):poster.Engine(worker-slot semaphore + buffer-byte budget) injected viaNewWithEngine. Two gates on the existing pump: read-ahead reserves per-article budget;runUploadJobacquires a global worker slot. Nil engine = standalone.b170c19): streaming zstd JSONLmanifeststore (atomic temp+rename); migration 007 (transfer_files+ lease-basedverification_failures, no per-article rows);transferstoreDAO; stableFileJob.TransferIDsurviving retries.b8f2e1c): lease-basedverification.Service(VerifyFileSTATs a manifest persisting only misses;ProcessDueFailuresre-posts then STAT-rechecks with backoff). Decoupled viaStater/Reposter.61a5ccf): extracted behavior-preservingposter.postYenc;poster.Reposterre-posts from a manifest record reusing the exact Message-ID/headers.d056c75):Runtime.Metrics()/Processor.TransferRuntimeMetrics().Test plan
go test ./internal/... ./pkg/...— all green-raceon all new packagesgolangci-lintclean on new code (2 remaining findings are pre-existing in main)MigrateUpDefaults are conservative and backward compatible; no runtime behavior change for existing users yet (engine wiring is additive and nil-safe).