Skip to content

fix(notifier): detach ScheduleImmediate async work from request context#774

Merged
notandy merged 7 commits into
mainfrom
fix/notifier-schedule-immediate-context
May 20, 2026
Merged

fix(notifier): detach ScheduleImmediate async work from request context#774
notandy merged 7 commits into
mainfrom
fix/notifier-schedule-immediate-context

Conversation

@notque
Copy link
Copy Markdown
Contributor

@notque notque commented May 18, 2026

Summary

  • Notifier.ScheduleImmediate schedules a gocron one-shot job that runs after the HTTP handler returns. It captured params.HTTPRequest.Context(), which net/http cancels at handler return — so the async DB lookup and Campfire HTTP call both failed with context canceled. Every immediate endpoint notification was silently dropped in production.
  • Use gocron's native func(ctx context.Context) task signature so the task receives a job-scoped context from the scheduler. That context is independent of the caller's HTTP request lifetime AND is cancelled when the scheduler shuts down, so in-flight tasks short-circuit cleanly on shutdown.
  • Wrap the task ctx with a 30s context.WithTimeout so a stuck DB query or Campfire HTTP call cannot run unbounded and stall scheduler shutdown.
  • Add a regression test that cancels the parent context before the gocron task fires. Without the fix, the test fails with async notification job did not complete after parent context was cancelled.
  • Rewrite the existing ScheduleImmediate test to use a buffered channel instead of assert.Eventually over a shared variable, which was tripping -race.

Why this matters

Found during review of #772. Affects every endpoint approval path that calls ScheduleImmediate — the user-visible symptom is missing Campfire/email notifications with no error returned to the API caller (the handler had already responded 200).

Review-driven evolution

Initial fix used context.WithoutCancel(ctx) to detach cancellation while preserving request-scoped values. Per @notandy's review the implementation was simplified to gocron's job context, which already provides caller-detachment AND adds graceful shutdown cancellation that WithoutCancel did not. Verified via grep that no caller in internal/controller/ or internal/middlewares/ propagates request-scoped ctx values into the notifier task body, so the change is value-lossless.

Test plan

  • go test -race -count=3 ./internal/notifier/... — passes
  • Regression test (TestNotifier_ScheduleImmediate_ParentContextCancelled) fails on main, passes here — RED-GREEN verified
  • make check linters: shellcheck, golangci-lint (0 issues), go-licence-detector, addlicense, reuse — all green

…task

The ScheduleImmediate gocron task ran asynchronously after the HTTP
handler returned, but captured the request-scoped context. net/http
cancels that context on handler return, so the async task's DB lookup
and Campfire HTTP call both failed with "context canceled" — every
immediate notification was silently dropped in production.

Detach the async task from request-scope cancellation with
context.WithoutCancel, preserving any request-scoped values (trace IDs,
log fields). Add a regression test that cancels the parent context
before the gocron task fires; without the fix, the test fails with
"async notification job did not complete after parent context was
cancelled".

Also rewrite the existing ScheduleImmediate test to use a buffered
channel instead of assert.Eventually over a shared variable, which was
tripping the race detector.
Copilot AI review requested due to automatic review settings May 18, 2026 19:54
@notque notque requested review from a team, m-kratochvil, notandy and ronchi-oss as code owners May 18, 2026 19:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates notifier immediate scheduling so asynchronous notification jobs are detached from HTTP request cancellation while preserving request-scoped context values, and adds regression coverage around parent context cancellation.

Changes:

  • Uses context.WithoutCancel for ScheduleImmediate DB lookup and notification send.
  • Reworks the existing immediate notification test to use a channel.
  • Adds a cancellation regression test for async notification jobs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/notifier/notifier.go Detaches the async notification context from caller cancellation.
internal/notifier/notifier_test.go Updates immediate notification testing and adds parent-cancellation coverage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/notifier/notifier.go Outdated
Comment thread internal/notifier/notifier_test.go Outdated
Address two findings from Copilot review on PR #774:

1. context.WithoutCancel drops the parent's deadline along with its
   cancellation, so the gocron task's DB query and Campfire HTTP call
   could run unbounded and stall scheduler shutdown. Wrap the detached
   context with context.WithTimeout(asyncCtx, 30s) inside the task and
   defer cancel().

2. The TestNotifier_ScheduleImmediate_ParentContextCancelled regression
   test cancelled the parent context AFTER calling ScheduleImmediate,
   leaving a race: gocron could pick up the task and complete the
   in-memory pgxmock call before cancel() fired, letting the test pass
   on broken code. Cancel BEFORE scheduling so the task is guaranteed
   to observe an already-cancelled parent. Verified: 5/5 RED on the
   un-fixed implementation, GREEN with -race on the fix.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread internal/notifier/notifier_test.go Outdated
notque added 2 commits May 18, 2026 13:15
…ancelled header

Rewrite the test's doc comment to describe the durable invariant — that
ScheduleImmediate's async work must not be tied to the caller's context lifetime —
and the production scenario it guards against, instead of referencing project
history. Also explain why the parent ctx is cancelled before scheduling rather
than after, so future readers do not misread the setup as a sequencing bug.

Comment-only; no behavior change. Addresses Copilot review comment on PR #774
about the prior header contradicting the actual test sequence.
Comment thread internal/notifier/notifier.go Outdated
Per @notandy review on PR #774, swap the closure-captured
`context.WithoutCancel(ctx)` pattern for gocron's native
`func(ctx context.Context)` task signature.

gocron passes a job-scoped context that is already independent of
the caller's request lifetime, so WithoutCancel was redundant. The
job context also gets cancelled when the scheduler shuts down,
which gives in-flight tasks a cooperative cancellation signal that
WithoutCancel did not — a small but real shutdown-semantics
improvement on top of the existing 30s WithTimeout bound.

The outer ctx parameter is preserved (as `_`) for API symmetry
with other notifier methods and to keep the controller call site
natural; verified via grep that no caller relies on
request-scoped ctx values flowing through.

Tests unchanged and still pass under -race -count=3, including
the cancelled-parent-context invariant test.
Copilot AI review requested due to automatic review settings May 19, 2026 21:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

internal/notifier/notifier_test.go:185

  • The test HTTP handler ignores JSON decode errors (_ = Decode(...)), which can hide failures and make this regression test non-diagnostic. Prefer require.NoError/assert.NoError on Decode so a malformed request body fails the test immediately.
	server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		var req CampfireRequest
		_ = json.NewDecoder(r.Body).Decode(&req)
		w.WriteHeader(http.StatusOK)
		received <- req
	}))

Comment thread internal/notifier/notifier.go
Comment thread internal/notifier/notifier_test.go Outdated
Comment thread internal/notifier/notifier_test.go Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 19, 2026 22:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread internal/notifier/notifier.go
Per Copilot review on PR #774, the two channel-driven test handlers
in notifier_test.go silently dropped JSON decode errors with `_ =`,
which would mask body-shape regressions: a malformed request would
produce a zero-valued CampfireRequest, fail the downstream assertion
on req.ProjectID, and surface as a confusing equality mismatch
instead of "decode error".

Use assert.NoError (not require.NoError) because the handler runs
on a goroutine spawned by httptest.Server. require's t.FailNow
calls runtime.Goexit, which only kills the calling goroutine and
leaves the test main running on stale state; assert marks the test
failed and lets the goroutine finish cleanly.

Tests pass under -race -count=3.
@notque notque changed the title fix(notifier): use context.WithoutCancel for ScheduleImmediate async task fix(notifier): detach ScheduleImmediate async work from request context May 19, 2026
@github-actions
Copy link
Copy Markdown

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/sapcc/archer/internal/notifier 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/sapcc/archer/internal/notifier/notifier.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/sapcc/archer/internal/notifier/notifier_test.go

@notque notque requested a review from notandy May 19, 2026 22:20
@notandy notandy merged commit 61cf749 into main May 20, 2026
6 checks passed
notandy pushed a commit that referenced this pull request May 20, 2026
Address two findings from Copilot review on PR #774:

1. context.WithoutCancel drops the parent's deadline along with its
   cancellation, so the gocron task's DB query and Campfire HTTP call
   could run unbounded and stall scheduler shutdown. Wrap the detached
   context with context.WithTimeout(asyncCtx, 30s) inside the task and
   defer cancel().

2. The TestNotifier_ScheduleImmediate_ParentContextCancelled regression
   test cancelled the parent context AFTER calling ScheduleImmediate,
   leaving a race: gocron could pick up the task and complete the
   in-memory pgxmock call before cancel() fired, letting the test pass
   on broken code. Cancel BEFORE scheduling so the task is guaranteed
   to observe an already-cancelled parent. Verified: 5/5 RED on the
   un-fixed implementation, GREEN with -race on the fix.
@notandy notandy deleted the fix/notifier-schedule-immediate-context branch May 20, 2026 13:56
notandy pushed a commit that referenced this pull request May 20, 2026
…ancelled header

Rewrite the test's doc comment to describe the durable invariant — that
ScheduleImmediate's async work must not be tied to the caller's context lifetime —
and the production scenario it guards against, instead of referencing project
history. Also explain why the parent ctx is cancelled before scheduling rather
than after, so future readers do not misread the setup as a sequencing bug.

Comment-only; no behavior change. Addresses Copilot review comment on PR #774
about the prior header contradicting the actual test sequence.
notandy pushed a commit that referenced this pull request May 20, 2026
Per @notandy review on PR #774, swap the closure-captured
`context.WithoutCancel(ctx)` pattern for gocron's native
`func(ctx context.Context)` task signature.

gocron passes a job-scoped context that is already independent of
the caller's request lifetime, so WithoutCancel was redundant. The
job context also gets cancelled when the scheduler shuts down,
which gives in-flight tasks a cooperative cancellation signal that
WithoutCancel did not — a small but real shutdown-semantics
improvement on top of the existing 30s WithTimeout bound.

The outer ctx parameter is preserved (as `_`) for API symmetry
with other notifier methods and to keep the controller call site
natural; verified via grep that no caller relies on
request-scoped ctx values flowing through.

Tests unchanged and still pass under -race -count=3, including
the cancelled-parent-context invariant test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants