PAYMENTS-11567 Resque latency metrics#30
Draft
WillemHoman wants to merge 1 commit into
Draft
Conversation
16680cb to
fe8acb7
Compare
… on worker processing time
fe8acb7 to
12b01d8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What? Why?
Bigpay's domain-event publishing was suspected of being slow. To diagnose this, a previous PR (bigpay#10597, PAYMENTS-11567) added per-job Prometheus histograms (queue latency and publish duration) directly inside the Resque worker child process. The synchronous metric-flush required to ensure observations weren't lost when the fork-per-job child exited had to wait on the bc-prom client's background drain cycle, adding ~480 ms per job. With 20 worker pods each handling jobs serially, fleet throughput on the
scheduled_actionqueue collapsed from ~1000 jobs/sec to ~40 jobs/sec, queues backed up, and the change was reverted.This PR provides the same diagnostic visibility — but implemented in the parent Resque worker process where no synchronous flush is needed. Bigpay (and any other service using bc-prometheus-ruby with Resque) opts in via an env var, gets per-job metrics with bounded cardinality, and pays zero per-job latency cost.
Two new histograms are exposed:
resque_job_queue_latency_seconds{job_class}— time fromscheduled_at(orenqueued_at) until the worker picks the job up. Tells you whether the queue is backing up. Only emits for ActiveJob-shaped payloads — see below.resque_job_perform_duration_seconds{job_class}— total child-process lifetime (fork →Process.waitpidreturn). A proxy for "how long is this job class actually taking?" — includes fork overhead and exit, but at the seconds-scale used for incident diagnosis the ~15 ms of fixed overhead is noise. Works for every Resque job regardless of payload shape.Off by default. Opt in per service by setting
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1on the worker pods.ActiveJob-shaped payload contract for
queue_latency: the gem reads three fields frompayload['args'][0](which must be a Hash):job_class— the user's actual job class name; used as the metric label.enqueued_at— ISO 8601 string; used as the queue-latency anchor whenscheduled_atis absent.scheduled_at— ISO 8601 string; preferred overenqueued_atwhen present (e.g. retries-with-backoff).ActiveJob (jobs inheriting from
ApplicationJob, enqueued via.perform_later) produces this shape natively — the payload is wrapped byActiveJob::QueueAdapters::ResqueAdapter::JobWrapper, which stamps those three keys intoargs[0]. Vanilla Resque jobs (class MyJob; @queue = :foo; def self.perform; end, enqueued viaResque.enqueue) do not produce this shape — their args are raw primitive values, not a wrapping hash — soqueue_latencysilently no-ops for them.perform_durationis unaffected. See Supportingqueue_latencyfor vanilla Resque jobs below for the options if this matters for your service.Why in bc-prometheus-ruby (not in each client service)
The instrumentation hooks live at the
Resque::Workerlevel —Resque.before_forkfor queue latency,Module#prependaroundResque::Worker#perform_with_forkfor perform duration. These are gem-level concerns: every BigCommerce service using Resque has the same Worker class and the same fork-per-job lifecycle. Adding the metrics here means:If this lived in each service's repo, every team would write similar
Resque.before_forkglue, with subtle inconsistencies in label naming, anchor-time logic, and error handling.What was implemented
New env var
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED(default0). Read once at boot via the existingBigcommerce::Prometheus.configuremachinery (lib/bigcommerce/prometheus/configuration.rb). Mirrors the existingPROMETHEUS_ENABLEDopt-in pattern.New module:
Bigcommerce::Prometheus::Integrations::Resque::JobMetricsLives at
lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb. Class-method API:JobMetrics.start(client:)— no-op unless the env var is on. PrependsWorkerInstrumentationontoResque::Worker. Idempotent.JobMetrics.record_queue_latency(payload)— pushes aresque_jobenvelope withmetric: 'queue_latency', value = seconds fromscheduled_at ∨ enqueued_atto now, labeljob_class. Takes aJobPayload(built once per job in the prepend).JobMetrics.record_perform_duration(payload, duration)— pushes aresque_jobenvelope withmetric: 'perform_duration', value = the supplied duration, labeljob_class. SameJobPayloadinstance.Payload parsing is its own class —
Bigcommerce::Prometheus::Integrations::Resque::JobPayload— which wraps aResque::Jobininitialize, eagerly extractsjob_classandanchor_timevia type-checking extractors (no exception-based control flow), and exposes them as plain attr_readers. TheJobMetricsrecording methods take this prebuilt payload rather than re-parsing the job on each call.The
WorkerInstrumentationsubmodule is the prepend that wrapsperform_with_fork. It builds theJobPayloadonce, records the queue latency beforesuper, and records the perform duration inensure:Both
record_*methods rescueStandardErrorand log a warning — metric push failures never propagate into the publish/perform path.Type collectors
Two type collectors, one per envelope shape, registered side-by-side in
Instrumentors::Resque#start. The upstreamPrometheusExporter::Server::Collectorroutes each envelope to whichever collector'stypematchesenvelope['type']— no in-collector dispatch needed.Bigcommerce::Prometheus::TypeCollectors::Resque(lib/bigcommerce/prometheus/type_collectors/resque.rb) continues to own the aggregate worker/queue gauges (resque_workers_total,jobs_failed_total,jobs_pending_total,jobs_processed_total,queues_total,queue_sizes) fed byCollectors::Resque#collect.Bigcommerce::Prometheus::TypeCollectors::ResqueJob(lib/bigcommerce/prometheus/type_collectors/resque_job.rb) owns the two new histograms:resque_job_queue_latency_secondswith buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).resque_job_perform_duration_secondswith buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).It calls
super(type: 'resque_job', …)ininitializeso the registered type string matches thetype: 'resque_job'envelopes thatJobMetrics.record_*emits.Wiring
Integrations::Resque.start(client:)now also callsJobMetrics.start(client:). If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.Docs
resque_per_job_metrics_enabled.Specs
spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb—JobMetrics-specific responsibilities (envelope shape, error rescue, integration withJobPayload).spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb— payload-parsing edge cases forJobPayload(anchor selection betweenscheduled_atandenqueued_at, ActiveJob unwrapping, vanilla-Resque fallback, time-parsing edge cases, field independence under partial failure).spec/bigcommerce/prometheus/type_collectors/resque_spec.rb— aggregate type collector; existing aggregate behaviour preserved.spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb— per-job type collector, including a#typeassertion that the registered string is'resque_job'(the production-routing contract).Design decisions
Parent-side, not child-side
This is the central decision and the whole reason this PR exists. In-child metric collection requires a synchronous flush before
exit!(Resque's default child-exit), and the flush is bounded by the bc-prom worker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes. The parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without anyone synchronously waiting on it. Moving the instrumentation to the parent eliminates the per-job latency tax entirely.Module#prependforResque::Worker#perform_with_forkResque exposes
before_forkfor the queue-latency timing (parent-side, runs just before fork), but it doesn't have an "after-perform-in-parent" hook to capture fork-to-waitpid duration.Module#prependis the cleanest Ruby idiom for addingensure-block timing around a method we don't own. Less fragile thanalias_methodchains, doesn't shadow inheritance behaviour for subclasses.job_classas the only label (this PR)Per-job histograms can blow up Prometheus cardinality if labels are unbounded (e.g. user IDs, store IDs).
job_classis bounded by the number of ActiveJob classes in the service — typically tens, not thousands. ActiveJob payloads are unwrapped (payload.dig('args', 0, 'job_class')) to label by the user's actual class name rather thanActiveJob::QueueAdapters::ResqueAdapter::JobWrapper. Falls back to the raw Resque payload class for non-ActiveJob jobs.A follow-up PR will add an optional protocol —
Class.additional_resque_metric_labels(payload)— letting jobs add extra labels (e.g.event_namefor bigpay's domain-event publish jobs). Deferred to keep this PR focused on the generic baseline.Opt-in via env var, default off
Two reasons:
Pattern mirrors
PROMETHEUS_ENABLED. Toggling requires a worker-pod restart, which matches the env-var-at-boot pattern used elsewhere in the gem.Idempotent
startstartis called fromIntegrations::Resque.start, which itself is invoked frombefore_first_fork. The hook system is designed to be called once per worker process lifetime, but the prepend should still be safe under repeated calls. Implementation guards against double-registration of both thebefore_forkhook and theResque::Workerprepend.Errors never propagate
Both
record_queue_latencyandrecord_perform_durationrescueStandardErrorand log a warning. A misbehaving Prometheus client, a network blip, or a bug in the bc-prom transport should never break a Resque job's execution path. The metric is observational; the job is operational.Per-job histograms live in their own TypeCollector
The two new histograms get a dedicated
TypeCollectors::ResqueJobrather than being grafted onto the existingTypeCollectors::Resque. Reasons:*TypeCollectorper envelope type matches the upstream gem convention (PrometheusExporter::Server::ResqueCollector#typereturns the literal"resque",ActiveRecordCollector#typereturns"active_record", etc.). Two envelope shapes → two collectors → the upstream router does the dispatch. Putting acase data['type']inside one collector would duplicate routing logic the framework already provides, and hide the per-job collector behind the aggregate collector's name.ResqueJob#initializecallssuper(type: 'resque_job', …). So'resque_job'is what appears in the wire envelopes fromJobMetrics, in the registered@collectors[...]slot, in router lookups, and in any debugging output. Without the override,TypeCollectors::Base's auto-derivation would yieldbigcommerce_prometheus_s_resquejob(thesis the leftover fromtypecollectorS, plus the flattenedResqueJobclass name) — an ugly accidental string leaking into the wire format. One line of override keeps the type human-readable and decouples routing from the class name.Collectors::Resque→TypeCollectors::Resqueroutes today only because the auto-derivation on both sides happens to produce the same string (bigcommerce_prometheus_s_resque). Renaming either side silently breaks routing; the upstream router still finds something, but it's the wrong something. The new pair asserts its type explicitly and is robust to class renames.Resque is not added as a development dependency (and the testing gap that leaves)
An initial attempt added
gem 'resque', '>= 2.0'to the devGemfileto enable faithful testing of the new integration code. It was reverted when CI surfaced a dependency-resolution conflict:rack >= 3.0.gem 'resque'pulls insinatraas a transitive dep.rack < 3.Gemfile.lock— it's gitignored), the resolver picks an older Sinatra to satisfy Resque's loosesinatra >= 0.9.2constraint, which then conflicts with the gemspec'srack >= 3.0.bundle installfails on CI before specs even start.The fix is straightforward (
gem 'sinatra', '>= 4.0'forces a rack-3-compatible Sinatra), but that's a separate concern that deserves its own PR. It changes the gem's dev-bundle composition independently of the Resque-metrics work and should be revertable on its own if it causes unrelated friction. Bundling it into this PR couples two distinct changes.Testing gap in this PR: the spec for
Integrations::Resque::JobMetricsdoes not invoke.start, because.startcalls::Resque.before_forkand::Resque::Worker.prepend— both of which need Resque loaded. Instead, the spec sets@clientdirectly viainstance_variable_setand exercises the pure logic (record_queue_latency,record_perform_duration— anchor selection, payload unwrapping, label assembly, error rescue) without touching Resque's API surface.What's covered: the value-shaping logic (where the majority of bugs would live).
What's not covered: the install side — does
.startactually hookbefore_forkand prepend the Worker module idempotently? Not testable without Resque in dev deps.This is consistent with the gem's existing pattern.
Collectors::Resque,Integrations::Resque.start, andInstrumentors::Resque#startare also entirely untested for the same reason — Resque isn't in dev deps. The "optional integration" framing made this look intentional; in practice it's accumulated testing debt.Follow-up PRs (separate from this work):
>= 4.0so Resque can be added to dev deps without breaking rack-3 resolution. Small, independently reviewable, independently revertable.JobMetrics.start(idempotent prepend, hook registration), plus the existing untested code paths inCollectors::Resque#collect,Integrations::Resque.start, andInstrumentors::Resque#start.Both deferred to keep this PR scoped to the metrics feature itself.
Supporting
queue_latencyfor vanilla Resque jobsThis PR's
queue_latencymetric only emits for ActiveJob-shaped payloads. ActiveJob jobs produce this shape automatically; vanilla Resque jobs do not, and the metric silently no-ops for them.perform_durationworks regardless.The contract — for
queue_latencyto emit, the Resque payload'sargs[0]must be a Hash containing:job_class(string) — used as the metric label.enqueued_at(ISO 8601 string) — anchor whenscheduled_atis absent.scheduled_at(ISO 8601 string, optional) — preferred anchor when present.If a service has vanilla Resque jobs where
queue_latencyobservability matters, there are three rough paths. None are in scope for this PR. Each service should evaluate which (if any) fits its constraints.Option A (recommended): a service-local wrapper that produces an ActiveJob-shaped payload
Introduce a small wrapper class in the service that mediates
Resque.enqueuecalls — it stampsjob_class,enqueued_at, and (optionally)scheduled_atintoargs[0]using the exact key names listed above, and delegatesperformto the target job. Because the wrapper produces an ActiveJob-shaped payload, the gem reads it without modification — no gem changes needed.Callers opt in by changing their enqueue calls; the job class itself is untouched. Coexists safely with un-migrated callers.
Things to think about:
before_perform_*,on_failure_*) fire on the wrapper, not the target. Verify the target jobs don't rely on them.Option B: migrate the job to ActiveJob
Change the job's class to inherit from
ApplicationJob, convertdef self.performto instancedef perform, update callers fromResque.enqueue(...)to.perform_later(...). ActiveJob produces the right payload shape automatically.Things to think about:
queue_latency—retry_on, lifecycle callbacks, backend portability.Option C: do nothing, accept the gap
Many Resque jobs (maintenance, backfill, ad-hoc) aren't latency-sensitive. Leave them without the metric.
Things to think about:
Decision guidance
The wrapper approach is the smallest change per job and has no deploy risk, making it the right default for one-off observability needs. ActiveJob migration is a bigger investment that buys more than just
queue_latency, justified when a job is being modernised more broadly. Accepting the gap is the right call for low-volume, low-stakes jobs where the metric wouldn't drive any operational decision.This is intentionally signposted, not prescribed — implementers should work through the trade-offs for their own jobs. If there's reviewer appetite to land a wrapper helper in the gem itself, it can come as a follow-up PR.
Rollout
After release, consuming services opt in by:
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1on worker pods only (not web, gRPC, etc. — they don't runResque::Worker)./metrics(or wherever scraped) with theresque_job_*names andjob_classlabels.For bigpay specifically, this replaces the reverted PAYMENTS-11567 instrumentation. A follow-up bigpay PR will:
PublishJobInstrumentationmachinery introduced and reverted in the original attempt.additional_resque_metric_labelsonBasePublishToDomainEventingJobonce the follow-up gem PR adds the label-extension protocol.How was it tested?
It hasn't :p