PAYMENTS-11567 Resque latency metrics by WillemHoman · Pull Request #30 · bigcommerce/bc-prometheus-ruby

WillemHoman · 2026-05-28T03:39:16Z

What? Why?

Bigpay's domain-event publishing was suspected of being slow. To diagnose this, a previous PR (bigpay#10597, PAYMENTS-11567) added per-job Prometheus histograms (queue latency and publish duration) directly inside the Resque worker child process. The synchronous metric-flush required to ensure observations weren't lost when the fork-per-job child exited had to wait on the bc-prom client's background drain cycle, adding ~480 ms per job. With 20 worker pods each handling jobs serially, fleet throughput on the scheduled_action queue collapsed from ~1000 jobs/sec to ~40 jobs/sec, queues backed up, and the change was reverted.

This PR provides the same diagnostic visibility — but implemented in the parent Resque worker process where no synchronous flush is needed. Bigpay (and any other service using bc-prometheus-ruby with Resque) opts in via an env var, gets per-job metrics with bounded cardinality, and pays zero per-job latency cost.

Two new histograms are exposed:

resque_job_queue_latency_seconds{job_class} — time from scheduled_at (or enqueued_at) until the worker picks the job up. Tells you whether the queue is backing up. Only emits for ActiveJob-shaped payloads — see below.
resque_job_perform_duration_seconds{job_class} — total child-process lifetime (fork → Process.waitpid return). A proxy for "how long is this job class actually taking?" — includes fork overhead and exit, but at the seconds-scale used for incident diagnosis the ~15 ms of fixed overhead is noise. Works for every Resque job regardless of payload shape.

Off by default. Opt in per service by setting PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1 on the worker pods.

ActiveJob-shaped payload contract for queue_latency: the gem reads three fields from payload['args'][0] (which must be a Hash):

job_class — the user's actual job class name; used as the metric label.
enqueued_at — ISO 8601 string; used as the queue-latency anchor when scheduled_at is absent.
scheduled_at — ISO 8601 string; preferred over enqueued_at when present (e.g. retries-with-backoff).

ActiveJob (jobs inheriting from ApplicationJob, enqueued via .perform_later) produces this shape natively — the payload is wrapped by ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps those three keys into args[0]. Vanilla Resque jobs (class MyJob; @queue = :foo; def self.perform; end, enqueued via Resque.enqueue) do not produce this shape — their args are raw primitive values, not a wrapping hash — so queue_latency silently no-ops for them. perform_duration is unaffected. See Supporting queue_latency for vanilla Resque jobs below for the options if this matters for your service.

Why in bc-prometheus-ruby (not in each client service)

The instrumentation hooks live at the Resque::Worker level — Resque.before_fork for queue latency, Module#prepend around Resque::Worker#perform_with_fork for perform duration. These are gem-level concerns: every BigCommerce service using Resque has the same Worker class and the same fork-per-job lifecycle. Adding the metrics here means:

One implementation, used by all services, with consistent metric names and label semantics for cross-service comparison in Grafana.
No per-service code change required to adopt — just flip the env var on worker pods.
The architectural lesson from the bigpay incident is encoded once. Future services don't need to rediscover that in-child metric collection requires a synchronous flush and that the flush is incompatible with fork-per-job latency budgets. The gem just does the right thing.
Opt-in via env var protects services that don't want the per-job cardinality cost (one histogram observation per job per worker pod can be a lot at fleet scale).

If this lived in each service's repo, every team would write similar Resque.before_fork glue, with subtle inconsistencies in label naming, anchor-time logic, and error handling.

What was implemented

New env var

PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED (default 0). Read once at boot via the existing Bigcommerce::Prometheus.configure machinery (lib/bigcommerce/prometheus/configuration.rb). Mirrors the existing PROMETHEUS_ENABLED opt-in pattern.

New module: `Bigcommerce::Prometheus::Integrations::Resque::JobMetrics`

Lives at lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb. Class-method API:

JobMetrics.start(client:) — no-op unless the env var is on. Prepends WorkerInstrumentation onto Resque::Worker. Idempotent.
JobMetrics.record_queue_latency(payload) — pushes a resque_job envelope with metric: 'queue_latency', value = seconds from scheduled_at ∨ enqueued_at to now, label job_class. Takes a JobPayload (built once per job in the prepend).
JobMetrics.record_perform_duration(payload, duration) — pushes a resque_job envelope with metric: 'perform_duration', value = the supplied duration, label job_class. Same JobPayload instance.

Payload parsing is its own class — Bigcommerce::Prometheus::Integrations::Resque::JobPayload — which wraps a Resque::Job in initialize, eagerly extracts job_class and anchor_time via type-checking extractors (no exception-based control flow), and exposes them as plain attr_readers. The JobMetrics recording methods take this prebuilt payload rather than re-parsing the job on each call.

The WorkerInstrumentation submodule is the prepend that wraps perform_with_fork. It builds the JobPayload once, records the queue latency before super, and records the perform duration in ensure:

def perform_with_fork(job, &block)
  started_at = Process.clock_gettime(Process::CLOCK_MONOTONIC)
  payload    = JobPayload.new(job)
  JobMetrics.record_queue_latency(payload)
  super
ensure
  JobMetrics.record_perform_duration(
    payload,
    Process.clock_gettime(Process::CLOCK_MONOTONIC) - started_at
  )
end

Both record_* methods rescue StandardError and log a warning — metric push failures never propagate into the publish/perform path.

Type collectors

Two type collectors, one per envelope shape, registered side-by-side in Instrumentors::Resque#start. The upstream PrometheusExporter::Server::Collector routes each envelope to whichever collector's type matches envelope['type'] — no in-collector dispatch needed.

Bigcommerce::Prometheus::TypeCollectors::Resque (lib/bigcommerce/prometheus/type_collectors/resque.rb) continues to own the aggregate worker/queue gauges (resque_workers_total, jobs_failed_total, jobs_pending_total, jobs_processed_total, queues_total, queue_sizes) fed by Collectors::Resque#collect.
Bigcommerce::Prometheus::TypeCollectors::ResqueJob (lib/bigcommerce/prometheus/type_collectors/resque_job.rb) owns the two new histograms:
- resque_job_queue_latency_seconds with buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).
- resque_job_perform_duration_seconds with buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).
It calls super(type: 'resque_job', …) in initialize so the registered type string matches the type: 'resque_job' envelopes that JobMetrics.record_* emits.

Wiring

Integrations::Resque.start(client:) now also calls JobMetrics.start(client:). If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.

Docs

README updated with a new "Per-job metrics (opt-in)" subsection under Resque, and a row added to the Configuration table for resque_per_job_metrics_enabled.
CHANGELOG entry under Pending Release.

Specs

spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb — JobMetrics-specific responsibilities (envelope shape, error rescue, integration with JobPayload).
spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb — payload-parsing edge cases for JobPayload (anchor selection between scheduled_at and enqueued_at, ActiveJob unwrapping, vanilla-Resque fallback, time-parsing edge cases, field independence under partial failure).
spec/bigcommerce/prometheus/type_collectors/resque_spec.rb — aggregate type collector; existing aggregate behaviour preserved.
spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb — per-job type collector, including a #type assertion that the registered string is 'resque_job' (the production-routing contract).

Design decisions

Parent-side, not child-side

This is the central decision and the whole reason this PR exists. In-child metric collection requires a synchronous flush before exit! (Resque's default child-exit), and the flush is bounded by the bc-prom worker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes. The parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without anyone synchronously waiting on it. Moving the instrumentation to the parent eliminates the per-job latency tax entirely.

`Module#prepend` for `Resque::Worker#perform_with_fork`

Resque exposes before_fork for the queue-latency timing (parent-side, runs just before fork), but it doesn't have an "after-perform-in-parent" hook to capture fork-to-waitpid duration. Module#prepend is the cleanest Ruby idiom for adding ensure-block timing around a method we don't own. Less fragile than alias_method chains, doesn't shadow inheritance behaviour for subclasses.

`job_class` as the only label (this PR)

Per-job histograms can blow up Prometheus cardinality if labels are unbounded (e.g. user IDs, store IDs). job_class is bounded by the number of ActiveJob classes in the service — typically tens, not thousands. ActiveJob payloads are unwrapped (payload.dig('args', 0, 'job_class')) to label by the user's actual class name rather than ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper. Falls back to the raw Resque payload class for non-ActiveJob jobs.

A follow-up PR will add an optional protocol — Class.additional_resque_metric_labels(payload) — letting jobs add extra labels (e.g. event_name for bigpay's domain-event publish jobs). Deferred to keep this PR focused on the generic baseline.

Opt-in via env var, default off

Two reasons:

Cardinality budget: per-job histogram observations across the fleet add up. Services that don't need this signal shouldn't pay the cost.
Operational caution: the previous in-child attempt regressed throughput 25× in prod. Even though the parent-side approach doesn't have the same failure mode, defaulting off and letting each service opt in on its own timeline reduces risk.

Pattern mirrors PROMETHEUS_ENABLED. Toggling requires a worker-pod restart, which matches the env-var-at-boot pattern used elsewhere in the gem.

Idempotent `start`

start is called from Integrations::Resque.start, which itself is invoked from before_first_fork. The hook system is designed to be called once per worker process lifetime, but the prepend should still be safe under repeated calls. Implementation guards against double-registration of both the before_fork hook and the Resque::Worker prepend.

Errors never propagate

Both record_queue_latency and record_perform_duration rescue StandardError and log a warning. A misbehaving Prometheus client, a network blip, or a bug in the bc-prom transport should never break a Resque job's execution path. The metric is observational; the job is operational.

Per-job histograms live in their own TypeCollector

The two new histograms get a dedicated TypeCollectors::ResqueJob rather than being grafted onto the existing TypeCollectors::Resque. Reasons:

One *TypeCollector per envelope type matches the upstream gem convention (PrometheusExporter::Server::ResqueCollector#type returns the literal "resque", ActiveRecordCollector#type returns "active_record", etc.). Two envelope shapes → two collectors → the upstream router does the dispatch. Putting a case data['type'] inside one collector would duplicate routing logic the framework already provides, and hide the per-job collector behind the aggregate collector's name.
Explicit, legible type string. ResqueJob#initialize calls super(type: 'resque_job', …). So 'resque_job' is what appears in the wire envelopes from JobMetrics, in the registered @collectors[...] slot, in router lookups, and in any debugging output. Without the override, TypeCollectors::Base's auto-derivation would yield bigcommerce_prometheus_s_resquejob (the s is the leftover from typecollectorS, plus the flattened ResqueJob class name) — an ugly accidental string leaking into the wire format. One line of override keeps the type human-readable and decouples routing from the class name.
Doesn't lean on the symmetric-gsub coincidence that makes the existing aggregate pair work. Collectors::Resque → TypeCollectors::Resque routes today only because the auto-derivation on both sides happens to produce the same string (bigcommerce_prometheus_s_resque). Renaming either side silently breaks routing; the upstream router still finds something, but it's the wrong something. The new pair asserts its type explicitly and is robust to class renames.
Single-responsibility within the file. Aggregate gauges and per-job histograms have different sources, different update cadences, and different cardinality profiles. Keeping them in separate files makes each one's surface easier to read and to reason about.

Resque is not added as a development dependency (and the testing gap that leaves)

An initial attempt added gem 'resque', '>= 2.0' to the dev Gemfile to enable faithful testing of the new integration code. It was reverted when CI surfaced a dependency-resolution conflict:

The gemspec requires rack >= 3.0.
gem 'resque' pulls in sinatra as a transitive dep.
Older Sinatra versions cap rack < 3.
Under cold bundle resolution (CI has no Gemfile.lock — it's gitignored), the resolver picks an older Sinatra to satisfy Resque's loose sinatra >= 0.9.2 constraint, which then conflicts with the gemspec's rack >= 3.0.
→ bundle install fails on CI before specs even start.

The fix is straightforward (gem 'sinatra', '>= 4.0' forces a rack-3-compatible Sinatra), but that's a separate concern that deserves its own PR. It changes the gem's dev-bundle composition independently of the Resque-metrics work and should be revertable on its own if it causes unrelated friction. Bundling it into this PR couples two distinct changes.

Testing gap in this PR: the spec for Integrations::Resque::JobMetrics does not invoke .start, because .start calls ::Resque.before_fork and ::Resque::Worker.prepend — both of which need Resque loaded. Instead, the spec sets @client directly via instance_variable_set and exercises the pure logic (record_queue_latency, record_perform_duration — anchor selection, payload unwrapping, label assembly, error rescue) without touching Resque's API surface.

What's covered: the value-shaping logic (where the majority of bugs would live).

What's not covered: the install side — does .start actually hook before_fork and prepend the Worker module idempotently? Not testable without Resque in dev deps.

This is consistent with the gem's existing pattern. Collectors::Resque, Integrations::Resque.start, and Instrumentors::Resque#start are also entirely untested for the same reason — Resque isn't in dev deps. The "optional integration" framing made this look intentional; in practice it's accumulated testing debt.

Follow-up PRs (separate from this work):

Bump the gem's Sinatra dev dep to >= 4.0 so Resque can be added to dev deps without breaking rack-3 resolution. Small, independently reviewable, independently revertable.
Once Resque is available in dev: backfill specs for the install-side behaviour of JobMetrics.start (idempotent prepend, hook registration), plus the existing untested code paths in Collectors::Resque#collect, Integrations::Resque.start, and Instrumentors::Resque#start.

Both deferred to keep this PR scoped to the metrics feature itself.

Supporting `queue_latency` for vanilla Resque jobs

This PR's queue_latency metric only emits for ActiveJob-shaped payloads. ActiveJob jobs produce this shape automatically; vanilla Resque jobs do not, and the metric silently no-ops for them. perform_duration works regardless.

The contract — for queue_latency to emit, the Resque payload's args[0] must be a Hash containing:

job_class (string) — used as the metric label.
enqueued_at (ISO 8601 string) — anchor when scheduled_at is absent.
scheduled_at (ISO 8601 string, optional) — preferred anchor when present.

If a service has vanilla Resque jobs where queue_latency observability matters, there are three rough paths. None are in scope for this PR. Each service should evaluate which (if any) fits its constraints.

Option A (recommended): a service-local wrapper that produces an ActiveJob-shaped payload

Introduce a small wrapper class in the service that mediates Resque.enqueue calls — it stamps job_class, enqueued_at, and (optionally) scheduled_at into args[0] using the exact key names listed above, and delegates perform to the target job. Because the wrapper produces an ActiveJob-shaped payload, the gem reads it without modification — no gem changes needed.

Callers opt in by changing their enqueue calls; the job class itself is untouched. Coexists safely with un-migrated callers.

Things to think about:

Per-call-site change (every enqueue you want metric coverage for).
The wrapper must use the exact key names above; renaming any of them means the gem won't read the payload.
Resque-level per-job hooks (before_perform_*, on_failure_*) fire on the wrapper, not the target. Verify the target jobs don't rely on them.
resque-scheduler YAML entries don't fit this pattern cleanly; recurring scheduled jobs need a different approach.
No deploy ordering risk — old (non-ActiveJob-shaped) and new (ActiveJob-shaped) payloads coexist; non-ActiveJob-shaped payloads silently skip the metric.

Option B: migrate the job to ActiveJob

Change the job's class to inherit from ApplicationJob, convert def self.perform to instance def perform, update callers from Resque.enqueue(...) to .perform_later(...). ActiveJob produces the right payload shape automatically.

Things to think about:

Larger per-job change (signature, callers, possibly retry/hook conversions).
Argument serialisation differs — ActiveRecord records become GlobalIDs, custom types may need serialisers. Existing args might come through subtly differently.
In-flight deploy hazard: payloads already in Redis at deploy time were enqueued under the old shape and will fail when dequeued under the new ActiveJob class. Plan a drain or two-step deploy under a transitional class name.
Buys you more than just queue_latency — retry_on, lifecycle callbacks, backend portability.

Option C: do nothing, accept the gap

Many Resque jobs (maintenance, backfill, ad-hoc) aren't latency-sensitive. Leave them without the metric.

Things to think about:

Grafana dashboards aggregating across all job types will have gaps without obvious explanation.
Operators looking at a specific raw Resque job's queue latency will see nothing.

Decision guidance

The wrapper approach is the smallest change per job and has no deploy risk, making it the right default for one-off observability needs. ActiveJob migration is a bigger investment that buys more than just queue_latency, justified when a job is being modernised more broadly. Accepting the gap is the right call for low-volume, low-stakes jobs where the metric wouldn't drive any operational decision.

This is intentionally signposted, not prescribed — implementers should work through the trade-offs for their own jobs. If there's reviewer appetite to land a wrapper helper in the gem itself, it can come as a follow-up PR.

Rollout

After release, consuming services opt in by:

Bumping their Gemfile to the new gem version.
Setting PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1 on worker pods only (not web, gRPC, etc. — they don't run Resque::Worker).
Verifying metrics appear at /metrics (or wherever scraped) with the resque_job_* names and job_class labels.

For bigpay specifically, this replaces the reverted PAYMENTS-11567 instrumentation. A follow-up bigpay PR will:

Set the env var on worker pods via Launchbay.
Delete the in-child PublishJobInstrumentation machinery introduced and reverted in the original attempt.
Optionally implement additional_resque_metric_labels on BasePublishToDomainEventingJob once the follow-up gem PR adds the label-extension protocol.

How was it tested?

It hasn't :p

… on worker processing time

WillemHoman force-pushed the PAYMENTS-11567-resque_latency branch 2 times, most recently from 16680cb to fe8acb7 Compare May 28, 2026 10:01

feat(platform): PAYMENTS-11567 Resque metrics for time spent on queue…

12b01d8

… on worker processing time

WillemHoman force-pushed the PAYMENTS-11567-resque_latency branch from fe8acb7 to 12b01d8 Compare May 29, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAYMENTS-11567 Resque latency metrics#30

PAYMENTS-11567 Resque latency metrics#30
WillemHoman wants to merge 1 commit into
mainfrom
PAYMENTS-11567-resque_latency

WillemHoman commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WillemHoman commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What? Why?

Why in bc-prometheus-ruby (not in each client service)

What was implemented

New env var

New module: Bigcommerce::Prometheus::Integrations::Resque::JobMetrics

Type collectors

Wiring

Docs

Specs

Design decisions

Parent-side, not child-side

Module#prepend for Resque::Worker#perform_with_fork

job_class as the only label (this PR)

Opt-in via env var, default off

Idempotent start

Errors never propagate

Per-job histograms live in their own TypeCollector

Resque is not added as a development dependency (and the testing gap that leaves)

Supporting queue_latency for vanilla Resque jobs

Option A (recommended): a service-local wrapper that produces an ActiveJob-shaped payload

Option B: migrate the job to ActiveJob

Option C: do nothing, accept the gap

Decision guidance

Rollout

How was it tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WillemHoman commented May 28, 2026 •

edited

Loading

New module: `Bigcommerce::Prometheus::Integrations::Resque::JobMetrics`

`Module#prepend` for `Resque::Worker#perform_with_fork`

`job_class` as the only label (this PR)

Idempotent `start`

Supporting `queue_latency` for vanilla Resque jobs