Skip to content

feat: Resilient Background Job Retry & Monitoring (#130)#466

Open
qiridigital wants to merge 1 commit intorohitdash08:mainfrom
qiridigital:feat/resilient-job-retry
Open

feat: Resilient Background Job Retry & Monitoring (#130)#466
qiridigital wants to merge 1 commit intorohitdash08:mainfrom
qiridigital:feat/resilient-job-retry

Conversation

@qiridigital
Copy link

Summary

Implements Resilient Background Job Retry & Monitoring as described in #130.

Acceptance Criteria

  • Production ready implementation
  • Includes tests
  • Documentation updated (schema.sql)

Model: BackgroundJob

Persistent job queue entry stored in the background_jobs table:

Field Type Description
job_type str Named handler identifier
payload JSON text Arbitrary job data
status PENDING / RUNNING / SUCCEEDED / DEAD Current state
attempts int How many times we've tried
max_attempts int Cap before marking DEAD (default 5)
last_error text Last exception message
next_run_at datetime When to next attempt
last_run_at datetime When last attempt occurred
finished_at datetime When job succeeded

Service: services/jobs.py

  • @register_job_handler("name") decorator — register any callable as a handler for a named job type
  • enqueue(job_type, payload) — persist a new PENDING job (with optional run_at scheduling)
  • Exponential backoff: delay = 10s * 2^(attempt-1) (10s, 20s, 40s, 80s, 160s)
  • start_scheduler(app) — launches a daemon thread polling every 30s for runnable PENDING jobs
  • stop_scheduler() — signals graceful shutdown
  • Built-in send_reminder handler wired to existing email/WhatsApp delivery

Routes: /jobs (admin-only)

Endpoint Method Description
/jobs GET List jobs (filterable by ?status=)
/jobs/stats GET Aggregate counts by status
/jobs/enqueue POST Manually enqueue a job
/jobs/:id/retry POST Reset DEAD job back to PENDING

Tests (test_jobs.py) — 13 tests

  • Unit: enqueue creates PENDING job, success flow, retry on failure, dead after max attempts, unknown handler
  • API: stats/list require admin (403 for regular users), admin stats format, list with enqueued job, retry dead job, retry non-dead fails, enqueue missing job_type, auth required

/claim #130

- Model: BackgroundJob with status (PENDING/RUNNING/SUCCEEDED/DEAD),
  attempts, max_attempts, last_error, next_run_at, exponential backoff
- Service: services/jobs.py
  - register_job_handler() decorator for named handlers
  - enqueue() to persist jobs
  - Exponential backoff: BASE_DELAY * 2^(attempt-1) seconds between retries
  - MAX_ATTEMPTS=5 before marking job DEAD
  - Background polling thread with configurable POLL_INTERVAL_SECONDS
  - Built-in send_reminder handler wired to existing reminder delivery
  - start_scheduler() / stop_scheduler() for lifecycle management
- Routes: GET /jobs (list), GET /jobs/stats (counts), POST /jobs/enqueue,
  POST /jobs/:id/retry (reset DEAD jobs) — all admin-only
- Schema: background_jobs table with status+next_run_at composite index
- App: scheduler auto-starts on app creation (skipped in TESTING mode)
- Tests: test_jobs.py — 13 tests covering enqueue, success, retry,
  dead-letter, unknown handler, admin-only access, stats, manual enqueue,
  retry endpoint, auth

Closes rohitdash08#130
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a DB-backed background job queue with retry/backoff, plus admin endpoints to monitor/enqueue/retry jobs, and adds corresponding schema + tests (per issue #130).

Changes:

  • Added BackgroundJob model + background_jobs table/index in schema.sql.
  • Introduced app.services.jobs for enqueueing, running, retry/backoff, and a polling scheduler thread.
  • Added /jobs admin endpoints and a new test_jobs.py suite covering unit + API behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
packages/backend/app/models.py Adds BackgroundJob ORM model.
packages/backend/app/db/schema.sql Adds background_jobs table + index.
packages/backend/app/services/jobs.py Implements enqueue/run/retry logic and scheduler thread; registers built-in send_reminder handler.
packages/backend/app/routes/jobs.py Adds admin-only monitoring/enqueue/retry endpoints.
packages/backend/app/routes/init.py Registers the new jobs blueprint.
packages/backend/app/init.py Starts the scheduler on app startup (non-testing).
packages/backend/tests/test_jobs.py Adds unit + API tests for the job system.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +18 to +19
from .extensions import db
from .models import BackgroundJob
import json
import logging
import threading
import time
Comment on lines +6 to +8
- APScheduler-based scheduler that polls and dispatches jobs
- Prometheus metrics for monitoring
- Admin monitoring endpoint at GET /jobs
Comment on lines +55 to +58
# Start background job scheduler (skip in test environments)
if not app.config.get("TESTING"):
from .services.jobs import start_scheduler
start_scheduler(app)
@@ -0,0 +1,118 @@
"""Background job monitoring endpoints (admin + self-service)."""

import json
Comment on lines +3 to +4
import json
from datetime import datetime, timedelta
Comment on lines +55 to +58
# Start background job scheduler (skip in test environments)
if not app.config.get("TESTING"):
from .services.jobs import start_scheduler
start_scheduler(app)
Comment on lines +25 to +27
MAX_ATTEMPTS = 5
BASE_DELAY_SECONDS = 10 # first retry waits 10 s, doubles each time
POLL_INTERVAL_SECONDS = 30
Comment on lines +166 to +167
from .services.reminders import send_reminder
from .models import Reminder as ReminderModel
Comment on lines +55 to +58
# Start background job scheduler (skip in test environments)
if not app.config.get("TESTING"):
from .services.jobs import start_scheduler
start_scheduler(app)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants