Skip to content

[User OBO Propagation] Per-user identity + access token propagation to worker tool handlers#51

Open
ChrisKrawczyk wants to merge 41 commits into
mainfrom
feature/user-obo-propagation
Open

[User OBO Propagation] Per-user identity + access token propagation to worker tool handlers#51
ChrisKrawczyk wants to merge 41 commits into
mainfrom
feature/user-obo-propagation

Conversation

@ChrisKrawczyk

@ChrisKrawczyk ChrisKrawczyk commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Six-phase delivery enabling downstream OBO consumers to perform OAuth2 On-Behalf-Of flows from worker tool handlers as the signed-in portal engineer rather than as the worker UAMI. ADO is the first anticipated consumer; future Graph/other consumers benefit from the same surface.

PilotSwarm now propagates a per-RPC user envelope from portal sign-in (MSAL acquisition with configurable downstream scope + offline_access) through an envelope-encrypted carrier (AKV-wrapped DEK + AES-256-GCM ciphertext) into a worker-side synchronous lookup (getUserContextForSession(sessionId)) plus a structured tool-outcome family (interactionRequired / serviceUnavailable) for principled error propagation back to the portal UI, including transport-level auto re-auth.

Subsequent compartmentalization work generalized the in-process tool extension surface: plugin.json now supports a tools field for app-tier tool plugins (atomic, fail-fast collision policy; PluginManifest exported as a public SDK type). The OBO live-smoke harness consumes this contract and is fully detached from the default deploy surface — production deploys carry zero smoke code, zero smoke env keys, and zero smoke-only configuration.

Architecture invariants

Codified in .github/copilot-instructions.md under "User OBO (User-On-Behalf-Of) Propagation":

  • Wire field is envelope (carrying plaintext principal claims plus optional accessTokenCipher), not envelopeCipher. Plaintext principal claims flow on every worker-bound RPC; only the access token is encrypted.
  • Three crypto backends in packages/sdk/src/envelope-crypto.ts selected by selectEnvelopeCrypto(env): AkvEnvelopeCrypto (production; AKV SDKs lazy-loaded so non-OBO consumers don't pull deps), InMemoryEnvelopeCrypto (tests), PlaintextEnvelopeCrypto (dev-only, with loud startup warning, sentinel kekKid: "plaintext-mode").
  • KEK rotation safety: cipher records wrapResult.keyID (versioned URL) so prior-version retention covers in-flight envelopes when the KEK is rotated.
  • accessToken: null is the universal absence signal (no token configured, system/orchestration session, AKV unwrap failure).
  • Structured outcomes (packages/sdk/src/tool-outcomes.ts) with pinned reason codes (reauth_required | mfa_refresh | conditional_access | consent_required) — three-way machine-distinguishable from generic tool failure. The claims blob never reaches the LLM transcript; portal re-auth UI keys off reasonCode.
  • Portal-side ~5-min near-expiry refresh; worker never persists refresh tokens.
  • Single-tenant assumption; scope minimization (only configured PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE is acquired).
  • FR-014 trust boundary: worker tools must not synthesize principals from CMS owner when an envelope is absent — refuse or emit serviceUnavailable/interactionRequired.

Plugin contract & smoke compartmentalization

Folded in late in the cycle as a generalization of the OBO live-smoke harness:

  • plugin.json gains an optional tools field — a string path to a JS module that exports registerTools(worker). The worker plugin loader invokes this hook at start() for app-tier plugins (pluginDirs); system/management tier ignore-with-warning.
  • Tool registration is atomic and collision-safe: a ToolNameCollisionError from any contributor names both contributors, and no partial set of the offending plugin's tools remains registered. Loader is fully fail-closed (missing dirs, missing modules, missing exports, sync throw, async reject all halt startup).
  • PluginManifest is exported from pilotswarm-sdk as a public TypeScript type so plugin authors can validate plugin.json at compile time. See docs/plugin-architecture-guide.md §7.
  • The OBO live-smoke harness moved from examples/obo-smoke/ to packages/obo-smoke-plugin/ and consumes the same contract as any other app-tier plugin — no special-casing.
  • Triple-conjunction opt-in to activate smoke on a stamp:
    1. --variant smoke worker image build (selects Dockerfile.worker's runtime-smoke target);
    2. deploy/envs/template.smoke.env overlay composed into the per-stamp env;
    3. PLUGIN_DIRS=/app/packages/obo-smoke-plugin set on the worker.
  • Default deploys carry zero smoke code/config/deps. OBO_SMOKE_ENABLED is no longer a worker boot gate — it's a stamp marker the smoke driver preflights.
  • Static-parse and directory-walk tests in deploy/scripts/test/ pin the multi-stage Dockerfile contract and prevent reintroduction of smoke env keys into the default surface.

Implementation history

  • 2b7a80f envelope-crypto backends + UserContextStore foundation
  • 62987b9 orchestration plumbing + portal/CLI envelope wiring on every worker-bound RPC; f9a7a95 carrier JSDoc fix
  • 0f40d84 worker-affined getCurrentUserContextForSession lookup with chain resolution (sub-agent → root portal-bound parent at lookup time, with re-rooting support per FR-021)
  • 72de0d1 portal MSAL downstream-scope acquisition + envelope encryption end-to-end
  • 42d995e interactionRequired / serviceUnavailable structured tool outcomes (FR-008, FR-008b)
  • c1c9bc4 examples/obo-smoke/ reference plugin (whoami via Graph /me, force_reauth against CA-protected scope) + docs/operations/obo-kek-runbook.md live-tenant smoke checklist
  • c5c48df AKV KEK provisioning Bicep + deploy wiring + version bump 0.1.36; c327312 review fixes (missing @azure/keyvault-keys dep, worker overlay scope key, KID versioning, builder template)
  • Final review fixes5c5bae0 smoke principal-shape access (F1), worker.stop finally block (F4), portal transport-level auto re-auth wire (F3) — debounced ~30s/session, global in-flight guard, fires getDownstreamToken({ interactive: true }) on interaction_required outcomes
  • Skills + contributor docs3338fc8 new "User OBO Propagation" section in copilot-instructions.md; pilotswarm-tui SKILL updated to capture portal auto-reauth wire; pilotswarm-release SKILL adds OBO live-tenant smoke gate
  • Plaintext-mode warning regression testf05e1a4 3 new unit cases asserting console.warn fires from selectEnvelopeCrypto with the expected payload, and that AKV / null paths stay quiet
  • Repeatable live-smoke primitivese74f996 AKS-compatible smoke plugin backend (auto-selects client-secret vs workload-identity FIC), OBO_SMOKE_ENABLED deploy toggle, pilotswarm smoke <stamp> --profile obo CLI driver with JSON pass/fail, .github/workflows/live-smoke-obo.yml workflow_dispatch scaffold, and consolidated docs/operations/live-smoke.md. Closes the FR-015 "FIC out of scope for smoke plugin" gap and makes the npm-publish release gate (FR-018) a one-command operation
  • Final-review must-fixes — bundled into e74f996: (1) AKV unwrap in-activity retry loop with [500ms, 2s, 5s] backoff before falling through to service_unavailable (FR-024 transient-retry compliance); (2) portal transport eventType/type normalization in browser-transport.js maybeTriggerInteractiveReauth with 6 new regression tests in portal-interactive-reauth.test.js
  • Final-review should-fix + considerc37d978: pinned InteractionRequiredReasonCode union + INTERACTION_REQUIRED_REASON_CODES set with reject-on-unknown enforcement in the interactionRequired() helper; .env.example OBO_KEK_KID example normalized to un-versioned form with explanatory comment matching docs/configuration.md and docs/operations/obo-kek-runbook.md
  • Live-smoke deploy-pipeline plumbing6c97b7b: project OBO_SMOKE_WORKER_APP_TENANT_ID / _CLIENT_ID / _GRAPH_SCOPE + OBO_SMOKE_TEST_USER_UPN through deploy/envs/template.env, compose-env.mjs sentinel-fallback, and the worker overlay so flipping OBO_SMOKE_ENABLED=true on a stamp lands the smoke plugin's per-stamp downstream-app config in the worker ConfigMap automatically (no manual Secret edits, no worker image rebuild). On AKS the plugin uses workload-identity FIC; CLIENT_SECRET stays a local-dev escape hatch. pilotswarm-npm-deployer.agent.md and pilotswarm-new-env-deploy/SKILL.md updated with toggle workflow + post-deploy verification snippet
  • Final docs syncfdd41c7: CHANGELOG.md / docs/sdk/user-context.md / docs/operations/live-smoke.md updated to capture pinned reason-code enforcement and the deploy-pipeline plumbing for the smoke harness
  • Deploy-scripts docs surface698213c: deploy/scripts/README.md documents OBO_KEK_KID + OBO_SMOKE_* knobs so operators see them in the canonical script reference, not just the per-stamp env
  • Auto-provision OBO smoke worker AAD app23b731c: new deploy/scripts/auth/Setup-OboSmokeWorkerApp.ps1 (idempotent app-registration + service-principal + workload-identity FIC + un-versioned scope) takes the FR-018 release-gate smoke from "operator manually pre-creates the worker app" to a one-command bring-up for a fresh stamp. Wired into the npm-deployer agent + pilotswarm-new-env-deploy SKILL as a pre-step gated on OBO_SMOKE_ENABLED=true. New deploy/scripts/test/setup-obo-smoke-worker-app.test.mjs exercises 18 invariants (idempotency, FIC subject shape, scope un-versioned form, cross-file bicep contract). Fixes a real cross-file bug found by impl-review cycle 1: deploy/services/base-infra/bicep/main.bicep consumed Aks.outputs.oidcIssuerUrl internally but never emitted it as a top-level output, so the script's Resolve-OidcIssuerFromEnv would have thrown on every fresh stamp — INV-9 pins the contract going forward
  • Final-review polish8d0aa9b: applied two findings from paw-final-review (PASS-WITH-OBSERVATIONS): (1) .github/copilot-instructions.md self-inflicted drift to the early-draft name getCurrentUserContextForSession corrected to the locked, shipped API name getUserContextForSession; (2) packages/sdk/src/types.ts InteractionRequiredReasonCode union and INTERACTION_REQUIRED_REASON_CODES ReadonlySet re-derived from a single private as const tuple so the type and runtime check can no longer silently drift. Public export shape unchanged (still ReadonlySet) — no breaking change to downstream consumers
  • Internal phase-label cleanupd807965: rewrites comments, docs, skills, and test describe blocks that referenced internal PAW workflow phases (e.g. "Phase 6", "Phase 7 — FR-026") to use the feature name ("User OBO"), the relevant spec FR, or nothing. Two test files renamed (phase3-*obo-*). Surfaces unrelated to this PR (enterprise deploy roadmap in main.bicep, portal-authz roadmap, Foundry proposal phases, algorithm-internal dehydrate() step labels) deliberately left intact
  • Persist OBO spec + neutralize internal-product referencesdf01c88: persists the User OBO Propagation spec to docs/specs/user-obo-propagation.md so the FR-XXX / SC-XXX citations scattered across source comments, docs, tests, and skills resolve to a real document in this repo. Replaces internal-product/cluster identifiers across 18 files (docs, skills, proposals, fixtures, builder templates, source comments) with neutral placeholders (downstream consumer / ExampleApp / <aks-cluster> / <resource-group>) so the OSS surface stays product-neutral
  • Remove non-runnable live-smoke GHA workflow scaffold71a812e: the shipped .github/workflows/live-smoke-obo.yml could not actually run (loaded gitignored per-stamp .env, required Azure subscription / OIDC trust this repo does not provision). Removed the workflow + its actionlint-shape test, reworded FR-028 to "deferred — future work" with the gating prerequisites documented inline, and added a brief "CI workflow (future work)" section to docs/operations/live-smoke.md explaining how an operator can add one later. The CLI driver remains the supported local-operator path
  • Plugin tools contract + OBO smoke plugin migration22fd8f9: introduces the tools field on plugin.json and the registerTools(worker) export contract (atomic, collision-safe, fail-closed); migrates the live-smoke harness from examples/obo-smoke/ to packages/obo-smoke-plugin/ consuming the new contract end-to-end. 17 new unit cases pin the contract's failure modes
  • Default surface cleanup4d4b484: removes OBO_SMOKE_* blocks from deploy/envs/template.env, deploy/gitops/worker/overlays/default/.env, and the deploy/scripts/lib/compose-env.mjs sentinel-fallback. Adds deploy/envs/template.smoke.env as the opt-in overlay (operators copy these keys into a per-stamp env when running smoke). 3 new compose-env tests including a directory-walk invariant that prevents OBO_SMOKE_* from being reintroduced into deploy/scripts/lib/
  • Multi-stage Dockerfile with opt-in smoke variantcb0a915: splits Dockerfile.worker into baseruntime-smokeruntime with runtime as the LAST stage, so bare docker build produces a smoke-free image. deploy/scripts/lib/build-image.mjs accepts variant: "default" | "smoke"; the smoke variant appends -smoke to the image tag and passes --target runtime-smoke to buildx. 5 static-parse tests (dockerfile-worker.test.mjs) + 1 call-site audit test (build-call-sites.test.mjs) pin the convention against future drift
  • Documentation sweepf28459f: 22-file sweep across operator docs (docs/operations/live-smoke.md, docs/specs/user-obo-propagation.md FR-025/026/027 reframing), schema-doc surfaces (plugin-architecture-guide, configuration, system-reference, building-apps for cli + sdk), skills (pilotswarm-new-env-deploy, pilotswarm-obo-smoke-app-reg, pilotswarm-aks-deploy, pilotswarm-release), agents (pilotswarm-npm-deployer), setup READMEs, plugin-internal docs, CHANGELOG, and .github/copilot-instructions.md. All examples/obo-smoke references updated to packages/obo-smoke-plugin; OBO_SMOKE_ENABLED reframed as a stamp marker rather than a worker boot gate
  • Public PluginManifest type12fe822: re-exports the typed manifest interface from pilotswarm-sdk so plugin authors get TypeScript-level validation of their plugin.json contents. Corrects a pre-existing doc bug in docs/plugin-architecture-guide.md §7 (the example used object form for tools; the loader only accepts string form). 2 new unit cases verify the public re-export and validate the runtime shape of every checked-in plugin.json
  • Drop internal phase-numbering leak878113a: rewrites a comment in packages/obo-smoke-plugin/tools.js that referenced internal PAW phase numbering (Phase-4 outcome family); the shipped source carries no internal workflow phase labels

Coordination with downstream consumer specs

Lookup contract and outcome taxonomy were locked through coordinated review with downstream consumer per-user-delegation work:

  • Lookup return shape ({ principal, accessToken, accessTokenExpiresAt }) and accessToken: null semantics confirmed.
  • Outcome reason-code taxonomy pinned; portal re-auth keys off reasonCode.
  • Bicep oboKekUamiPrincipalIds array contract supports both single-UAMI and dual-UAMI (PilotSwarm reference shape) deployments via parameter file alone — no template fork.
  • KEK shape locked: RSA 2048, wrapKey/unwrapKey only, 365-day auto-rotate with prior-version retention; un-versioned OBO_KEK_KID accepted, encrypt-time records versioned URL.
  • Re-rooting (sub-agent → portal entry) covered by chain-resolution-at-lookup-time; downstream consumer tools require no special handling.

Operator-visible config

  • Portal: PORTAL_AUTH_PROVIDER=entra, PORTAL_AUTH_ENTRA_TENANT_ID, PORTAL_AUTH_ENTRA_CLIENT_ID, PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE (e.g. api://<worker-app>/.default offline_access).
  • Worker: OBO_KEK_KID (full or unversioned AKV key URL), WORKLOAD_IDENTITY_CLIENT_ID.
  • Both pods need Key Vault Crypto User on the OBO KEK AKV.
  • Live-smoke (opt-in only): build worker image with --variant smoke, compose deploy/envs/template.smoke.env into the per-stamp env, set PLUGIN_DIRS=/app/packages/obo-smoke-plugin. See docs/operations/live-smoke.md.

Testing

  • 70+ OBO-specific unit tests across envelope-crypto, obo-envelope-shape, obo-envelope-roundtrip, obo-no-plaintext-in-queue, obo-smoke-plugin-loadable, plugin-tools-contract, plugin-manifest-type — all pass.
  • 215/215 deploy-script tests (including new dockerfile-worker.test.mjs, build-call-sites.test.mjs, and the directory-walk invariant guarding the default deploy surface against OBO_SMOKE_* reintroduction) — all pass.
  • tsc --noEmit clean across packages.
  • Live-tenant smoke checklist (docs/operations/live-smoke.md) — operator-run release gate; must be exercised on the publishing stamp before npm publish for any OBO-touching release. The pilotswarm-release SKILL now codifies this. Full feature spec lives at docs/specs/user-obo-propagation.md.

Pre-existing flakes (NOT caused by this PR)

  • force-module.test.mjs — Node 24 mock.module API change, pre-dates this branch.
  • 2 cases in obo-envelope-roundtrip.test.js — require live LLM provider env, expected to be skipped/red in offline CI.

Breaking changes

None. The envelope is additive on the wire (existing fields unchanged; new accessTokenCipher is optional). Workers without OBO crypto config fall back gracefully (selectEnvelopeCrypto returns null, plaintext principal-only envelopes continue to flow). Existing non-portal callers (local TUI hosts) are unaffected — getUserContextForSession returns null. Public getDownstreamToken({ interactive }) option is backwards-compatible (existing callers pass no args). The plugin tools field is purely additive; existing plugins without it keep working unchanged. The smoke harness path moved (examples/obo-smoke/packages/obo-smoke-plugin/) — operators are not expected to import from the old path; the Setup-OboSmokeWorkerApp.ps1 paste-block emits the new PLUGIN_DIRS line automatically.

Deployment

  • Version bump to 0.1.36 across pilotswarm-sdk, pilotswarm-cli, pilotswarm-web.
  • New runtime dep: @azure/keyvault-keys: ^4.10.0 (required when OBO is active; lazy-loaded otherwise).
  • New env vars in worker overlay: OBO_KEK_KID, PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE (sentinel __PS_UNSET__ keeps non-OBO deployments unchanged).
  • Bicep AKV-KEK provisioning + RBAC parameterized via oboKekUamiPrincipalIds.
  • Worker image is now multi-stage (Dockerfile.worker); default docker build resolves to the smoke-free runtime target. Smoke stamps must build with --variant smoke (docker buildx build --target runtime-smoke).
  • Operator runbook: docs/operations/obo-kek-runbook.md.

Deferred (intentional, tracked in plan candidates)

  • Structured PII redaction in queue payloads & log capture — separate follow-on workstream. Plaintext principal claims (email, displayName) ride on every worker-bound RPC envelope; data category is not new (already in cms.sessions.owner at rest), but the new wire surface increases the audit footprint from 3 → ~50 sites. Deferred so a future workstream can land redaction uniformly across both old and new sites rather than scope-creeping the OBO feature.
  • Smoke driver extraction into a standalone pilotswarm-smoke package — deferred to its own spec; the driver currently lives inside pilotswarm-cli. Compartmentalization at the worker-image level (this PR) is sufficient for the OBO release.
  • Generic plugin lifecycle hooks (init(worker) / dispose()) — deferred to its own spec; the tools field landed in this PR covers the immediate OBO need without committing to a broader lifecycle API.

🐾 Generated with PAW

chkraw and others added 30 commits June 3, 2026 11:41
User OBO Propagation Phase 1A foundation. Type-checks clean; remaining
Phase 1B work (orchestration plumbing, client.ts, portal runtime, CLI
transport, sendMessage bug fix, 8 test files) deferred to a fresh
session for careful focus on orchestration replay-determinism.

New modules:
- packages/sdk/src/envelope-crypto.ts: EnvelopeCrypto interface with
  three backends. AkvEnvelopeCrypto lazy-loads @azure/keyvault-keys and
  @azure/identity (no hard dep for non-OBO consumers); supports KEK
  rotation by building per-message clients targeting the cipher's
  exact key version. InMemoryEnvelopeCrypto uses an in-process RSA-2048
  keypair for tests/local-dev. PlaintextEnvelopeCrypto refuses to start
  when NODE_ENV=production. selectEnvelopeCrypto(env) factory enforces
  the four selection rules from the implementation plan.
- packages/sdk/src/user-context-store.ts: minimal in-memory store
  (setUserContext/clear/getRaw). Phase 2 will add parent-map and
  chain-walking lookup.

Wiring:
- types.ts: UserEnvelope, EnvelopeCipher, UserEnvelopeCarrier, UserContext,
  PrincipalClaims types added.
- session-manager.ts: SessionManager constructor accepts EnvelopeCrypto |
  null and owns a UserContextStore; getter methods exposed.
- worker.ts: calls selectEnvelopeCrypto(process.env) at startup and
  threads the result to SessionManager.
- session-proxy.ts: runTurn activity input gains optional envelope
  field (UserEnvelopeCarrier); handler reads principal claims directly
  (always plaintext) and decrypts accessTokenCipher only when present.
  Decrypt failures populate principal-only so identity-aware tools still
  function. Population happens regardless of accessTokenCipher presence
  (satisfies Spec P1 scenario 2: no-OBO-scope deployments still get
  principal claims via the lookup). destroySession clears the
  UserContextStore entry.

Plaintext token material is held only in pod memory; never logged,
never persisted, never written to durable queue or activity history
(FR-020 / FR-023; full no-plaintext-in-queue test coverage lands in
Phase 1B).

Refs:
- Spec FR-002, FR-007, FR-020, FR-023, FR-024
- ImplementationPlan.md Phase 1
- coordination-notes.md (Waldemort alignment ce27995, a2eb20d)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Thread the user envelope (plaintext principal + optional accessTokenCipher)
through the SDK transport boundary so a future Phase 2 lookup can resolve
user context for tool handlers.

Orchestration (replay-safe via conditional spread - frozen orchestration_1_0_*
versions intentionally untouched):
- queue.ts captures envelope at all 4 stash sites; FIFO batch tracks last-wins
  merged envelope; processAnswer threads envelope into processPrompt.
- turn.ts processPrompt accepts optional envelope; runTurn yield includes
  envelope only when present (old histories with no envelope replay clean).

Public API:
- client.ts: send/sendAndWait/createSessionForAgent accept envelope opt;
  carried into the enqueue JSON payload.
- management-client.ts: sendMessage/sendAnswer accept envelope.

Portal:
- runtime.js: buildUserEnvelope(authContext) helper; attached on the 3
  prompt-bearing RPCs (sendMessage, sendAnswer, createSessionForAgent).
  Phase 3 will populate accessTokenCipher; today it ships as null.

CLI:
- node-sdk-transport.js: envelope passthrough on createSessionForAgent /
  sendMessage / sendAnswer; fixes long-standing bug where sendMessage's
  normal path dropped sendOptions on session.send().

Tests (6 files, 33 unit-style tests pass; SDK + repo build clean):
- envelope-crypto, obo-envelope-shape, obo-no-plaintext-in-queue (FR-020
  sentinel guard), runtime-envelope-completeness, sendmessage-options-flow,
  obo-envelope-roundtrip (integration; deferred to live env).

Live smoke deferred to integration env (local .env lacks GITHUB_TOKEN).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Flagged by paw-impl-review for Phase 1. Stray ESC (0x1B) bytes had
been inserted into the carrier-shape JSDoc, producing rendered text
'velope (NOT velopeCipher)' and 'unTurn'. Replaced with the correct
'envelope (NOT envelopeCipher)' and 'runTurn'.

No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Formalizes the UserContextStore with two maps (parent map + entries),
exposes getUserContextForSession() from pilotswarm-sdk, and wires the
lifecycle so tool handlers can synchronously resolve the active user
for any session (including sub-agents) without leaking token material.

UserContextStore (extended from Phase 1):
- parentMap (sessionId -> {parentSessionId, isSystem}): structural
  metadata only, no token material; persists across dehydrate so
  descendants can still resolve to the portal-bound ancestor.
- entries (sessionId -> UserContext): plaintext, cleared on dehydrate
  AND terminal state; never persisted, never dehydrated.
- bindParent(), hasParentBinding(), lookup() with bounded chain walk
  (FR-008/009/021/022 honored: system root -> null, broken chain ->
  null, depth cap 32 -> warn + null).
- clear() vs clearParent() with documented semantics.
- Defensive copies on lookup/getRaw so handlers cannot mutate state.

SessionManager.getOrCreate now performs a one-shot bounded CMS chain
walk on first-time-on-this-worker session activation, populating
parent-map entries for every ancestor. NO orchestration change — the
plan's original 'thread isSystem through runTurn input' would have
broken replay for existing system sessions; CMS walk avoids that.

worker-registry.ts (new): AsyncLocalStorage-based active worker resolution.
- registerSessionManager / unregisterSessionManager (tied to
  PilotSwarmWorker start success / stop finally).
- runWithSessionManager() wraps the runTurn activity body so tool
  handlers resolve to the worker hosting them.
- getUserContextForSession() — public, synchronous, importable.
  Returns null on no worker, multi-worker ambiguity outside ALS,
  system root, broken chain, unknown session.

Lifecycle hooks added:
- _dehydrateUnlocked clears the user-context entry; parent-map
  binding persists.
- destroySession activity clears both entry and parent binding.
- worker.stop() unregisters in finally so a crashing shutdown still
  drops the registry slot.

Tests (3 unit files, 36 tests pass; 33 Phase 1 tests still pass):
- user-context-store.test.js (21): FR coverage, chain walk including
  intermediate-evicted (Gemini #1), token refresh propagation, cycle
  defense, child-becomes-root, clear/clearParent semantics, input
  normalization.
- user-context-registry.test.js (12): ALS resolution under multi-worker,
  single-worker fallback, ambiguity-returns-null, defensive copy,
  cross-worker isolation.
- user-context-dehydration.test.js (3): blob-store.ts / session-store.ts
  static guard + JSON.stringify exclusion.

Public API locked for Waldemort:
  import { getUserContextForSession } from 'pilotswarm-sdk';
  (sessionId: string) => UserContext | null

Live integration (obo-lookup-integration.test.js, obo-replay-expiry.test.js)
deferred to env with GITHUB_TOKEN.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Acquire downstream-scope token (api://<worker-app>/.default + offline_access) at portal sign-in via MSAL; cache {accessToken, accessTokenExpiresAt}; refresh forceRefresh=true within 5 min of expiry. Forward via /api/rpc body's auth field (TLS-only, never headers/WS). Server middleware stamps onto req.auth.principal. PortalRuntime owns its own EnvelopeCrypto via selectEnvelopeCrypto(env) and encrypts the token at envelope-build time so plaintext never lands on the durable queue (FR-020). Encryption failure or absent crypto -> ship principal-only; safe-by-default.

Spec A-8 misconfiguration handled: MSAL rejection logs metadata-only and admission still succeeds with principal-only envelope.

Public API: selectEnvelopeCrypto + EnvelopeCrypto/UserEnvelope/EnvelopeCipher types now exported from pilotswarm-sdk; re-exported from pilotswarm-cli/portal so portal package can use the factory without a direct sdk dep.

Tests: 12 new unit tests (phase3-runtime-envelope-encrypt + phase3-server-auth-body); all 77 OBO unit tests across Phase 1+2+3 pass; full repo build clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ailable)

Implements the return-side carriers for the Structured tool outcome

family (Spec FR-010, SC-005, FR-024) so worker tools can signal

machine-distinguishable IdP re-auth requirements and transport-layer

service failures back through the SDK to the portal.

- New helpers interactionRequired() / serviceUnavailable() in

  packages/sdk/src/tool-outcomes.ts; types + marker constant in types.ts;

  exports threaded through index.ts.

- session-proxy.ts enrichToolCompletionEventData runs in the onEvent

  callback on every tool.execution_complete event: detects the marker

  at top level / data.result / data.toolResult, populates

  data.outcome + data.outcome_payload (sanitized to per-kind allow-list),

  and strips the raw marker before CMS persistence (FR-020).

- FR-024 auto-emission: persistent envelope-decrypt failure in runTurn

  records a synthetic system.tool_outcome with reasonCode

  akv_unwrap_failure; turn continues with principal-only context.

- Shared UI history.js renders the two new outcome kinds with distinct

  icons/colors (interaction_required -> reauth, service_unavailable ->

  warn); system.tool_outcome rendered as labeled row.

- Tuner observability: getStructuredOutcomeEvents() +

  getFleetStructuredOutcomeStats() on PilotSwarmManagementClient; paired

  inspect-tools read_session_structured_outcomes +

  read_fleet_structured_outcome_stats inside the !isTuner guard.

- TUI SKILL.md updated with the new rendering rules.

Tests: 3 new unit files / 32 tests pass (helpers, enrichment, stats);

100 OBO unit tests across Phase 1+2+3+4 pass. Full repo build clean.

Live integration tests deferred to env with GITHUB_TOKEN per the same

constraint that deferred Phase 1/2/3 live smoke.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ships the reference smoke plugin in examples/obo-smoke/ (Spec FR-018)

as the release gate for the User OBO Propagation feature surface.

- examples/obo-smoke/index.js: obo_smoke_whoami + obo_smoke_force_reauth

  with the OBO_SMOKE_WORKER_APP_* env namespace; env read at handler

  invocation (never at module import) so the module cannot accidentally

  activate real-OBO when imported into a non-smoke worker.

- examples/obo-smoke/README.md: install snippet, mode decision matrix,

  FR-015 note, no-token-logging guarantee.

- examples/obo-smoke/SMOKE_CHECKLIST.md: live-tenant + local-developer

  variants with token-leak grep step and post-smoke secret cleanup.

- packages/sdk/test/local/obo-smoke-plugin-loadable.test.js: 10 tests

  asserting module imports, tool registration, marker shape on

  force_reauth, structured mode values on whoami, env-import-time

  invariant, doc presence.

110 OBO unit tests pass across Phase 1+2+3+4+5; full repo build clean.

Live-tenant smoke is the manual maintainer-executed release gate per

FR-018 and is documented in SMOKE_CHECKLIST.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Bicep: conditional OBO KEK in base-infra AKV (RSA-2048, 365d rotation), array-shaped oboKekUamiPrincipalIds role-assignment loop (Key Vault Crypto User), oboKekKid output (emits __PS_UNSET__ sentinel when OBO disabled)

- Deploy wiring: OUTPUT_ALIAS, compose-env fallback, overlay-contracts, template.env + overlay .env files for OBO_ENABLED / OBO_KEK_KID / PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE

- Versions: pilotswarm-sdk 0.1.35 -> 0.1.36 (cli + portal already at 0.1.36)

- Docs: new docs/operations/obo-kek-runbook.md, new docs/sdk/user-context.md; updates to configuration.md, builder-agents.md, devops sample README

- Deferred: skills updates and dev-plaintext-mode-warning unit test (behavior already covered in envelope-crypto.ts)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…der template

Addresses paw-impl-review findings on commit c5c48df:

1. Add missing @azure/keyvault-keys dep to packages/sdk (dynamically imported by AkvEnvelopeCrypto but absent from package.json)

2. Add PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE to worker overlay .env and compose-env fallback. Worker's selectEnvelopeCrypto requires the scope to engage the AKV backend; without it, OBO decrypt path stays disabled even when OBO_KEK_KID is wired

3. Pin EnvelopeCipher.kekKid to the versioned key URL returned by wrapResult.keyID instead of the un-versioned env value. KEK rotation with prior-version retention requires ciphertext to record the exact wrapping version so decrypt can target it

4. Add lookup + outcome-helper guidance to pilotswarm-sdk-builder SKILL.md (templates that already reference tool-handler patterns)

Tests: 70 OBO unit tests pass, 33 deploy tests pass, SDK typecheck clean

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ly, FR-011 portal auto-reauth

F1 (must-fix): obo-smoke whoami tool was reading userContext.provider/subject/email/displayName, but the lookup contract returns { principal: { provider, ... }, accessToken, accessTokenExpiresAt }. Without this fix the SC-007 release-gate Graph round-trip would have shipped principal-undefined fields. Loadable test only exercised the null path so the bug slipped past it.

F4 (should-fix): PilotSwarmWorker.stop() left _started=true and _provider!=null when this.runtime.shutdown threw, since the Phase 2 try/finally refactor left those assignments inside try. Move them into finally alongside unregisterSessionManager so a failing shutdown still hard-resets the worker state and registry slot.

F3 (should-fix): close the FR-011 wire by parameterising the auth provider's getDownstreamToken to accept { interactive } and adding a transport-level subscriber in browser-transport.js that observes interaction_required outcomes (tool.execution_complete and synthetic system.tool_outcome) on incoming session events and fire-and-forgets an interactive downstream-token acquisition. Debounced per session id (~30s) with a global in-flight guard to prevent popup storms; errors swallowed; existing manual sign-out/sign-in path remains. Next worker-bound RPC carries the freshly-acquired token via the existing cache-aware getDownstreamToken path (SC-006).

F2 / F5 / F6 / F7: F2 was a false positive (all three packages already at 0.1.36). F5 (fail-fast on malformed OBO_KEK_KID) intentionally kept — A-8 covers acquisition failure, not operator misconfig. F6 / F7 are consider-severity, deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
copilot-instructions.md: new `User OBO (User-On-Behalf-Of) Propagation` section codifying the architecture invariants — wire field is `envelope` (not `envelopeCipher`), AKV-wrapped DEK + AES-256-GCM, three crypto backends with lazy AKV imports, worker-side synchronous lookup contract, `accessToken: null` as universal absence signal, structured outcome family with pinned reason codes, portal-side ~5-min near-expiry refresh, single-tenant assumption, FR-014 trust boundary (worker tools must not synthesize principals from CMS owner), Bicep `oboKekUamiPrincipalIds` array contract supporting both single- and dual-UAMI deployments, and the live-tenant smoke gate.

pilotswarm-tui/SKILL.md: update line 52 — portal hosts no longer `may add` interactive re-auth affordances; the auto-reauth wire is implemented at the WebSocket transport layer (browser-transport.js) with per-session ~30s debounce and a global in-flight guard. Shared activity rendering remains identical across hosts.

pilotswarm-release/SKILL.md: new `OBO Live-Tenant Smoke Gate` section — when a release touches the OBO surface, the examples/obo-smoke/ round-trip becomes a release-gate artifact alongside the standard checklist. Confirms KEK rotation safety (versioned wrapResult.keyID), AKV RBAC, and that PlaintextEnvelopeCrypto never appears in a production envelope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pto)

Closes the gap flagged during final-review wrap-up: the loud console.warn fired by `selectEnvelopeCrypto` (envelope-crypto.ts:321) when a deployment selects `PlaintextEnvelopeCrypto` via OBO_ENVELOPE_PLAINTEXT_MODE=1 was previously only validated by the live-tenant smoke gate, post-build. A regression silencing that warning at the factory layer would have shipped to npm before any operator noticed.

Adds three new cases to envelope-crypto.test.js:

  1. Selecting plaintext backend emits exactly one console.warn whose payload mentions 'envelope-crypto', 'OBO_ENVELOPE_PLAINTEXT_MODE', and 'NOT encrypted'.

  2. Selecting the AKV backend emits NO plaintext-mode warning (filters out unrelated warnings just in case).

  3. Returning null (OBO disabled, no scope) emits no warnings at all.

Pure unit tests, no live worker / no DB. 17 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sanity-checked the deploy code itself end-to-end on this branch — wiring is complete (template.env defaults, base-infra Bicep param + keyvault module + OUTPUT_ALIAS, overlay-contracts whitelisting, compose-env sentinel fallback, worker + 3 portal overlay .env defaults, configMapGenerator/envFrom flow). 33 deploy tests pass. The only gap was the operator-facing skills + agent quick-reference.

pilotswarm-new-env-deploy/SKILL.md: T2 inventory line now mentions the conditional OBO KEK; new `User OBO Propagation` block in the Step 2 input-surface table covers OBO_ENABLED + PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE with explanatory text on the oboKekUamiPrincipalIds array contract (single-UAMI vs split-UAMI) and a pointer to docs/operations/obo-kek-runbook.md; Step 6 verification block adds az + kubectl checks for the KEK, the Crypto-User role assignment, and the OBO_KEK_KID ConfigMap projection on both portal and worker.

pilotswarm-aks-deploy/SKILL.md: Core Learnings bullet documents that OBO lives on the npm/Bicep path, not the legacy bash path; rolling forward via scripts/deploy-aks.sh leaves the worker in non-OBO mode (FR-002 backwards-compat) until the operator manually wires OBO_KEK_KID + PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE into .env.remote.

pilotswarm-npm-deployer.agent.md: service-redeploy quick-reference table gains a row for `Toggle OBO User Context on a stamp`, calling out the base-infra bicep step + manifests,rollout re-render needed after editing OBO_ENABLED in the per-stamp .env.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 7 (FR-025/026/027/028, SC-017/018/019):
- examples/obo-smoke rewritten to use @azure/msal-node CCA with
  handler-time backend selection (FIC > client-secret precedence),
  per-call FIC clientAssertion re-read.
- New pilotswarm smoke <stamp> CLI driver (packages/cli/src/smoke/)
  with injectable deps, --auth {device-code,from-env},
  --skip-kube-bootstrap, structured JSON result envelope,
  exit-code semantics (0 pass / 1 smoke-failure / 2 preflight).
- New workflow_dispatch GitHub Actions scaffold for live OBO smoke
  (.github/workflows/live-smoke-obo.yml) with explicit kubeconfig
  bootstrap before az aks get-credentials.
- Worker bootstrap loads obo-smoke plugin BEFORE worker.start() when
  OBO_SMOKE_ENABLED=true (avoids tool-registration race with poller).
- Deploy plumbing: OBO_SMOKE_ENABLED propagated through template.env,
  worker overlay sentinel, compose-env sentinel-fill, and Dockerfile.
- Ops doc docs/operations/live-smoke.md + cross-link from OBO KEK
  runbook.
- Test fixtures: stage-manifests + foundry-substitute now seed
  OBO_KEK_KID / PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE / OBO_SMOKE_ENABLED
  to keep substituteOverlayEnv fail-closed gate passing.

Final-review must-fixes:
- Finding 1 (FR-024 spec violation): session-proxy.ts envelope
  decrypt now uses bounded transient-retry-with-backoff (3 attempts,
  500ms/2s/5s) before falling through to the structured
  service_unavailable outcome. Previously every decrypt failure
  immediately became service_unavailable with no retry, contradicting
  the spec's "transient retry, then structured outcome" requirement.
- Finding 2 (FR-011 / SC-006 live-path bug): browser-transport
  maybeTriggerInteractiveReauth() now reads
  sessionEvent.eventType || sessionEvent.type. Previously it only
  read .type, silently missing every interaction_required event
  delivered over the live websocket (canonical SDK shape uses
  .eventType). New regression test
  packages/sdk/test/local/portal-interactive-reauth.test.js (6 tests)
  pins both shapes plus debounce and outcome-filtering behavior.

Test status:
- packages/sdk/test/local/portal-interactive-reauth.test.js: 6/6
- packages/sdk/test/local/obo-smoke-driver.test.js: 8/8
- packages/sdk/test/local/obo-smoke-auth-backend.test.js: 10/10
- packages/sdk/test/local/obo-smoke-plugin-loadable.test.js: 10/10
- packages/sdk/test/local/tool-outcomes-*.test.js + structured-outcomes-stats: clean
- deploy-scripts npm script: 197/197

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…K URL doc consistency

Finding 4: Add INTERACTION_REQUIRED_REASON_CODES set and InteractionRequiredReasonCode union; interactionRequired() now rejects reason codes outside the pinned taxonomy (reauth_required, mfa_refresh, conditional_access, consent_required) at helper-call time, preventing downstream consumers from fragmenting the portal's behavior contract.

Finding 6: Update .env.example to show un-versioned OBO_KEK_KID with explanatory comment, aligning with docs/configuration.md and docs/operations/obo-kek-runbook.md (which were already correct).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r overlay + agent/skill docs

Closes the gap where flipping OBO_SMOKE_ENABLED=true on a stamp would register the obo_smoke_* tools but the per-stamp downstream-app identity was not projected into the worker ConfigMap, leaving the smoke plugin's auth backend to fast-fail with serviceUnavailable({ reasonCode: 'smoke_misconfigured' }).

- deploy/envs/template.env: add OBO_SMOKE_WORKER_APP_TENANT_ID/_CLIENT_ID/_GRAPH_SCOPE + OBO_SMOKE_TEST_USER_UPN with empty defaults and explanatory comments. AKS path is workload-identity FIC; CLIENT_SECRET is local-dev only.

- deploy/gitops/worker/overlays/default/.env: project the four new keys with __PS_UNSET__ sentinel defaults so substitute-env passes on non-smoke stamps.

- deploy/scripts/lib/compose-env.mjs: extend the OBO_SMOKE_ENABLED fallback loop to cover the new keys.

- deploy/scripts/test/foundry-substitute.test.mjs: add the new keys to both fixture envs to keep substitute-env happy.

- .github/agents/pilotswarm-npm-deployer.agent.md: new task row for OBO_SMOKE_ENABLED toggle workflow.

- .github/skills/pilotswarm-new-env-deploy/SKILL.md: add the new env keys to the per-stamp .env table, an explanatory paragraph mirroring the OBO Phase 6 paragraph, and post-deploy verification kubectl/pilotswarm-smoke snippet.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lumbing notes

- CHANGELOG.md: clarify that interactionRequired reason codes are pinned (helper throws on unknown values), and document the Phase 7 deploy-pipeline plumbing (template.env / compose-env / worker overlay / npm-deployer agent).

- docs/operations/live-smoke.md: add a paragraph after the per-stamp env table explaining how compose-env + worker overlay project the OBO_SMOKE_WORKER_APP_* keys into the worker ConfigMap, and call out the production-stamp safety invariant.

- docs/sdk/user-context.md: clarify reason-code enforcement, mention the new exported INTERACTION_REQUIRED_REASON_CODES set + InteractionRequiredReasonCode union.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add env-file schema rows for OBO_KEK_KID (sourced from base-infra bicep output via the FR-022 alias map) and the optional OBO_SMOKE_* worker-overlay block (default off, AKS uses workload-identity FIC, never enable on production stamps).

- Add cross-reference links to docs/operations/obo-kek-runbook.md and docs/operations/live-smoke.md so operators discover the runbooks from the canonical npm-deploy README.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Setup-OboSmokeWorkerApp.ps1 (sidecar-only, never edits .env), new pilotswarm-obo-smoke-app-reg skill, npm-deployer Step 0.b wiring + tightened POSIX-ERE grep gate, and docs sync (live-smoke, obo-kek-runbook, auth/README, SMOKE_CHECKLIST, CHANGELOG, new-env-deploy skill). Closes the last manual gap in the Phase 7 live-smoke harness -- OBO_SMOKE_ENABLED=true is now a true one-line opt-in.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dReasonCode to single source of truth

Two findings from paw-final-review on the user-obo-propagation work:

1. .github/copilot-instructions.md:179 still referenced the early-draft name getCurrentUserContextForSession; the locked, shipped API is getUserContextForSession (verified across packages/sdk/src/index.ts, docs/sdk/user-context.md, docs/configuration.md, examples/obo-smoke/index.js, CHANGELOG.md). Self-inflicted drift in this PR. Fix renames the contributor doc to the correct symbol — downstream consumers (microsoft/waldemort) read PilotSwarm's contributor docs as the source-of-truth handshake.

2. packages/sdk/src/types.ts declared InteractionRequiredReasonCode (union type) and INTERACTION_REQUIRED_REASON_CODES (ReadonlySet) as independent literals. They agreed today but were vulnerable to future drift — adding a code to one but not the other would silently break the contract (the helper's runtime set.has() would reject codes the type system accepts, or vice versa). Refactored to derive both from a single private as-const tuple. Public export shape (ReadonlySet) preserved — no docs/CHANGELOG updates needed, no breaking change to downstream consumers.

Tests: tool-outcomes-helpers + tool-outcomes-enrichment + structured-outcomes-stats all green (33/33). Build clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User feedback: comments and docs referring to internal PAW phases (e.g., 'Phase 6', 'Phase 7 — FR-026') don't mean anything to future readers or downstream consumers, who have no context on the planning artifacts where those phases were defined.

Rewrites all PAW-phase references in shipped surfaces to one of: the underlying feature name ('User OBO'), the relevant spec FR (e.g., 'FR-011', 'FR-026'), or nothing when the label was the only content. Two test files renamed: phase3-runtime-envelope-encrypt → obo-runtime-envelope-encrypt and phase3-server-auth-body → obo-server-auth-body.

Left intentionally unchanged:

  - main.bicep 'Phase 4' references — enterprise deploy roadmap, not PAW

  - deploy/scripts/README.md Foundry-Entra 'Phase 2/3' note — proposal phases tracked in docs/proposals/, not PAW

  - docs/configuration.md + CHANGELOG.md '(Phase 1)' on Authorization engine — pre-existing portal-authz roadmap, not OBO

  - session-manager.ts dehydrate() 'Phase 1: Destroy / Phase 2: Persist' — algorithm-internal step labels, not PAW

Build clean. 166 tests pass (142 OBO unit + 18 deploy invariants + 6 portal-reauth).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Persist User OBO Propagation spec to docs/specs/user-obo-propagation.md
  so FR-XXX / SC-XXX citations across source and docs resolve to a real
  document in the repo.
- Add spec pointers from live-smoke.md and the new-env deploy skill.
- Replace internal-product/cluster identifiers across docs, skills,
  proposals, fixtures, builder templates, and source comments with
  neutral placeholders (downstream consumer / ExampleApp / <aks-cluster>
  / <resource-group>) so the OSS surface stays product-neutral.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The shipped .github/workflows/live-smoke-obo.yml could not actually run:
it loaded deploy/envs/local/<stamp>/.env (gitignored, not present on the
branch) and required AZURE_* OIDC secrets that this repo does not
currently provision. Keeping a workflow that can't run was a foot-gun.

Removed:
  - .github/workflows/live-smoke-obo.yml
  - deploy/scripts/test/live-smoke-workflow.test.mjs (its actionlint-shape test)
  - SC-019 (asserted the workflow parsed) from docs/specs/user-obo-propagation.md
  - P7 acceptance-scenario #4 (workflow_dispatch run)
  - 'Workflow trigger surface stays narrow' invariant from live-smoke.md
  - 'Workflow scaffold' section from live-smoke.md (replaced with brief
     'CI workflow (future work)' note explaining the gating prerequisites
     for adding one later)

Reworded FR-028 to 'deferred — future work' explicitly calling out the
gitignored-env and no-CI-subscription prerequisites. The CLI driver
('pilotswarm smoke <stamp> --profile obo') remains the supported
local-operator path; adding a workflow_dispatch-only workflow later when
operators have a CI environment is still a one-file addition.

Updated CHANGELOG, npm-deployer agent, new-env-deploy skill, package.json
deploy-scripts test list to match. 205/205 deploy-script tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SMOKE_CHECKLIST.md collapsed 220→75 lines. The old Steps 1–8 (manual
AAD app registration, manual portal/worker .env edits, manual tool
invocation, manual CMS-event inspection, manual token-leak grep) are
all superseded by the shipped automation:
  - Setup-OboSmokeWorkerApp.ps1 provisions the worker app + FIC
  - Setup-PortalAuth.ps1 provisions the portal app
  - deploy.mjs projects the env into the worker ConfigMap
  - pilotswarm smoke <stamp> --profile obo drives the tools and
    emits the JSON pass/fail record

Checklist is now a release-gate sign-off form pointing at
docs/operations/live-smoke.md for operational detail. Kept:
pre-flight (unit tests + build), AKS-deployed flow (promoted to the
canonical path), local-developer variant, sign-off, after-smoke
cleanup.

README.md: dropped the 'Backend auto-selection' note that restated
the env-presence table directly above it (87→78 lines).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the worker plugin contract with a `tools` field so plugins can
declaratively register their tools at worker.start(), and migrate the
OBO smoke harness to a first-class workspace package that loads via
the new contract.

Plugin contract (PilotSwarmWorker):
- Add `tools` field to PluginManifest (plugin.json) alongside
  forward-compat `agents` / `skills` fields.
- registerTools(tools, contributor?) is now atomic — pre-validates the
  whole batch for collisions before mutating the registry. Tags every
  registration with a contributor label (`worker-builtin`,
  `app-inline`, or the plugin name) so collision errors name the
  source on both sides.
- New ToolNameCollisionError (exported) with contributor-aware message.
- `tools` field is app-tier only; warns + ignores on system/management
  tier.
- Missing pluginDirs entries now hard-fail at construction (was
  warn-and-skip) — operator misconfig should not silently no-op.
- New _registerPluginTools() runs at worker.start() before duroxide init,
  with wrapped errors for all five failure modes (missing module, import
  failure, no export, sync throw, async reject).

OBO smoke plugin (packages/obo-smoke-plugin/):
- Moved from examples/obo-smoke/ to a workspace package picked up by
  the existing packages/* glob.
- plugin.json declares tools: `./tools.js`; loads end-to-end via
  PilotSwarmWorker({ pluginDirs: [...] }).
- tools.js exports the new registerTools(worker) contract entry point
  while preserving all legacy exports.
- README rewritten to teach the PLUGIN_DIRS / pluginDirs contract and
  cross-reference Setup-OboSmokeWorkerApp.ps1.
- Setup-OboSmokeWorkerApp.ps1 paste-block now emits
  PLUGIN_DIRS=/app/packages/obo-smoke-plugin alongside the smoke AAD
  app env keys (4 lines -> 5).
- packages/sdk/examples/worker.js: removed OBO_SMOKE_ENABLED conditional
  dynamic-import block — smoke is loaded via the plugin contract now.
- deploy/Dockerfile.worker: COPY updated to packages/obo-smoke-plugin
  (multi-stage refactor to follow in a later change).

Tests:
- New plugin-tools-contract.test.js (17 tests) covers atomic batch
  registration, collision diagnostics, all five plugin-load failure
  modes, app-tier-only enforcement, and worker-builtin auto-registration.
- obo-smoke-plugin-loadable.test.js rewritten to load the smoke plugin
  end-to-end through the real plugin contract and assert tools land on
  worker.toolRegistry tagged with contributor `obo-smoke`.
- obo-smoke-auth-backend.test.js: import path updated.
- deploy-scripts test: paste-block bumped to 5 lines + PLUGIN_DIRS regex
  assertion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the OBO smoke harness fully opt-in by removing its env keys from
the default deploy surface and giving operators a clearly-scoped
opt-in template instead.

Default surface (smoke-free):
- deploy/envs/template.env: dropped the 5-key OBO_SMOKE_* block
  (TENANT_ID, CLIENT_ID, GRAPH_SCOPE, TEST_USER_UPN, OBO_SMOKE_ENABLED)
  and its surrounding documentation block.
- deploy/gitops/worker/overlays/default/.env: dropped the matching
  OBO_SMOKE_* sentinel block; default-deploy worker ConfigMap no
  longer references smoke keys.
- deploy/scripts/lib/compose-env.mjs: removed the OBO_SMOKE_*
  sentinel-fallback loop; compose-env no longer injects smoke keys
  on any deploy. Core OBO sentinels (OBO_KEK_KID,
  PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE) remain — those are the default
  OBO surface, not the smoke surface.

Opt-in path:
- deploy/envs/template.smoke.env (new): contains the 5 OBO_SMOKE_*
  keys + PLUGIN_DIRS=/app/packages/obo-smoke-plugin with explicit
  opt-in documentation. Consumed only by operators running the OBO
  smoke harness against a dedicated smoke stamp; not loaded by
  default deploys.

Tests:
- deploy/scripts/test/foundry-substitute.test.mjs: dropped the now-
  unneeded OBO_SMOKE_* __PS_UNSET__ placeholders from the two
  stage-manifests test inputs.
- deploy/scripts/test/compose-env.test.mjs: added 3 invariant tests:
  * compose-env never injects OBO_SMOKE_* keys into a default env
  * OBO_SMOKE_* keys provided by the operator pass through untouched
  * no file in deploy/scripts/lib/ contains an OBO_SMOKE_ string
    literal (directory-walk invariant — guards against regression)

Verification: 209/209 deploy-script tests pass (was 206 + 3 new);
lint clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split Dockerfile.worker into a shared �ase stage plus two runtime targets:

- 
untime (LAST stage, the implicit default for bare docker build) — no obo-smoke-plugin directory in the image. Production-safe by default.

- 
untime-smoke — adds COPY packages/obo-smoke-plugin. Callers must opt in explicitly via --target runtime-smoke.

build-image.mjs gains a �ariant parameter (default | smoke). Smoke variant is worker-only, appends -smoke to the image tag (no registry collisions), and passes --target runtime-smoke to buildx. Existing callers (deploy.mjs, deploy-aks.sh, reset-local.sh, docker-local-start.sh) use the default and continue to produce smoke-free images unchanged.

Two new static tests enforce the convention since live docker builds aren't part of unit-test CI:

- dockerfile-worker.test.mjs (5 cases): both stages exist, 
untime is last, 
untime doesn't COPY the smoke plugin, 
untime-smoke places it at the canonical PLUGIN_DIRS path, 
untime doesn't inherit from 
untime-smoke.

- build-call-sites.test.mjs: walks scripts/, deploy/scripts/, .github/workflows/, .github/skills/ for docker build invocations against Dockerfile.worker and asserts every --target is empty/
untime/
untime-smoke. Asserts invocation count > 0 to prevent silent no-op.

Defense in depth: even if an operator sets PLUGIN_DIRS to the smoke plugin path against a default image, the plugin loader hard-fails at startup because the directory isn't present. Smoke cannot accidentally activate against production.

Tests: 215/215 deploy-script tests pass (was 209, +5 dockerfile-worker, +1 build-call-sites). Lint clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t-in flow

Sweep all operator-facing and plugin-author-facing docs, skills, and agent prompts to match the compartmentalized OBO smoke flow:

- New 'In-process tool plugins' section in docs/plugin-architecture-guide.md describing the plugin.json 	ools field, registerTools(worker) export, atomic registration, and fail-closed loader behavior. Reference example points at packages/obo-smoke-plugin/.

- docs/operations/live-smoke.md rewritten: smoke is opt-in via three pieces — --variant smoke worker image build, deploy/envs/template.smoke.env overlay composed into the per-stamp .env, and PLUGIN_DIRS containing /app/packages/obo-smoke-plugin. OBO_SMOKE_ENABLED is no longer a worker boot gate; it's a stamp marker the smoke driver preflight checks.

- docs/specs/user-obo-propagation.md FR-025/026/027 reworded to reference the new contract path.

- Path renames everywhere: examples/obo-smoke/ → packages/obo-smoke-plugin/.

- Schema-doc surfaces (configuration, system-reference, getting-started, building-apps for cli + sdk) gain a brief 	ools field mention next to the existing �gents/skills mentions, with a forward link to the plugin architecture guide.

- Skills updated: pilotswarm-new-env-deploy, pilotswarm-obo-smoke-app-reg, pilotswarm-aks-deploy, pilotswarm-release. Agent updated: pilotswarm-npm-deployer.

- Setup script READMEs updated: deploy/scripts/README.md, deploy/scripts/auth/README.md (Setup-OboSmokeWorkerApp.ps1 emitted env block now includes PLUGIN_DIRS=/app/packages/obo-smoke-plugin).

- CHANGELOG.md gains an entry calling out the operator-visible posture change; historical entry path references corrected.

- packages/obo-smoke-plugin/README.md and SMOKE_CHECKLIST.md path/version refreshed; README adds an opt-in section noting the smoke image variant.

- .github/copilot-instructions.md path reference updated.

Verified: zero �xamples/obo-smoke matches in shipped surfaces, zero internal/consumer-name leaks, no spec-only references (FR-XXX/SC-XXX/'Phase N') outside docs/specs/user-obo-propagation.md itself. tsc lint clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Promote the PluginManifest interface from internal-only to a public SDK type. Plugin authors writing in TypeScript can now import the typed manifest shape from pilotswarm-sdk to validate their plugin.json contents at compile time:

    import type { PluginManifest } from \\pilotswarm-sdk\\;

    const manifest: PluginManifest = { name, version, tools: \\./tools.js\\ };

Changes:

- packages/sdk/src/types.ts: drop the @internal JSDoc tag on PluginManifest and add a public-facing JSDoc with an @example block. Field shape is unchanged so the existing internal user (worker.ts loader) is unaffected.

- packages/sdk/src/index.ts: re-export the type next to the existing defineTool re-export, with a comment pointing authors at the plugin architecture guide.

- docs/plugin-architecture-guide.md (§7 In-process Tool Plugins): add a TypeScript snippet showing PluginManifest usage. Also fix a pre-existing doc bug — the example used object form (\\	ools: { module: \\./index.js\\ }\\) which the loader rejects; the loader only accepts string form (\\	ools: \\./tools.js\\\\), now corrected to match worker.ts behavior and the in-repo reference plugin.

- packages/sdk/test/local/plugin-manifest-type.test.js (new, 2 tests): verifies the public re-export from index.ts and validates the runtime shape of every checked-in plugin.json under packages/ and examples/ against the typed contract. Anti-no-op assertion guards against silently no-op'ing if the discovery walk breaks.

Tests: 31/31 pass on the targeted plugin-contract test suites. tsc --noEmit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…js comment

Final-review nit: a comment in the smoke plugin handler referenced internal PAW phase numbering that's not meaningful to users or future agent sessions reading the shipped source. Reword to describe the structured-outcome family without the numeric tag.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
chkraw and others added 11 commits June 10, 2026 23:07
The new-env-deploy skill, npm-deployer agent, live-smoke.md, and template.smoke.env all implied the OBO live smoke required dedicated test-user tokens. In practice the smoke driver's default --auth device-code mode does an interactive Entra sign-in: the operator signs in as themselves and that token is what flows through the OBO chain. The --auth from-env mode (with OBO_SMOKE_USER_*_TOKEN env vars) is a CI-only fallback, not the default.

Reframe consistently:

- new-env-deploy SKILL: drop `dedicated test-user tokens` framing; describe device-code default and OBO_SMOKE_TEST_USER_UPN as an optional UPN-assertion knob

- live-smoke.md: rename `Test user` section to `Sign-in user`; lead with `you sign in as yourself`; call out dedicated test users as optional/situational

- template.smoke.env: reword TEST_USER_UPN comment as an optional canary (empty = accept whichever user signs in)

- pilotswarm-npm-deployer agent: correct the service-redeploy table row; bump version 1.1.0 -> 1.1.1 (patch, wording correction)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PilotSwarm itself does not target any specific downstream resource; consumer apps that build on PilotSwarm decide which Entra-protected resource to call. Reframe copilot-instructions.md and the OBO spec accordingly:

- copilot-instructions.md: replace `ADO is the first consumer` with `PilotSwarm itself does not call any specific downstream resource; consumer apps that build on PilotSwarm do.` Examples list Microsoft Graph, Azure DevOps, etc. neutrally.

- specs/user-obo-propagation.md: drop the `Azure DevOps is the first anticipated consumer` sentence and the parenthetical `(covered in the consumer spec for ADO)`.

Remaining ADO mentions are illustrative (alongside Microsoft Graph), out-of-scope clarifications, or pre-existing unrelated docs (bug reports, child-contract proposal).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two cleanups bundled:

1. Restore pre-existing internal-product references that an earlier commit (df01c88) over-eagerly neutralized. The intent of this PR was to avoid *adding* new references in OBO content, not to scrub historical references repo-wide. Files restored to origin/main wording where they only differed by neutralization: bug reports, proposals, builder templates, a portal-builder agent/skill, a session-refresh test, and three lines in the aks-deploy skill (cluster context, resource group, downstream-deployment guard).

2. Genericize local test-env identifiers introduced by this PR. live-smoke.md JSON examples and the OBO spec narrative used a real stamp name; Setup-OboSmokeWorkerApp.ps1 .EXAMPLE blocks used real local paths. Reframed as <stamp> / <env-name> placeholders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split Setup-OboSmokeWorkerApp.ps1 into two phases so nothing has to wait for bicep:

- `-Mode app-shell` (no OIDC dependency): creates the app, mints OAuth2 scope, declares Graph User.Read, pre-authorizes the portal app, emits .env paste block. Runs alongside portal app-reg, before bicep.

- `-Mode patch-fic` (requires bicep outputs): looks up existing app, create-or-patches the AKS workload-identity FIC against the emitted OIDC issuer. No .env changes.

- `-Mode all` (default; back-compat): both phases in one shot, requires bicep to have already run.

Mirrors how Setup-PortalAuth.ps1 patches SPA redirect URIs after AFD is known. Sidecar is phase-aware: app-shell writes ficIssuer=null, patch-fic merges it in.

Doc surfaces updated: pilotswarm-obo-smoke-app-reg/SKILL.md, pilotswarm-new-env-deploy/SKILL.md, pilotswarm-npm-deployer.agent.md (bump 1.1.1 -> 1.2.0), deploy/scripts/auth/README.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The AKS workload-identity FIC lives on the Entra app, not in k8s. The worker pod boots fine without it; AAD only consults the FIC at OBO-exchange time. So patch-fic can simply run at the end of the deploy pipeline (after bicep + manifests + rollout), right before 'pilotswarm smoke', with no pod restart required.

Updated sequencing wording in pilotswarm-obo-smoke-app-reg/SKILL.md, pilotswarm-new-env-deploy/SKILL.md, pilotswarm-npm-deployer.agent.md, and the script's own stdout hints. The mid-deploy sandwich is gone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e skill

The OBO smoke worker app skill is the auto-provisioning path; the npm-deployer agent should drive it hands-free. Telling the agent to skip when the operator pre-pasted .env values created a manual escape hatch and risked stamps where the .env points at an app that still lacks the FIC / scope / pre-auth.

Reframed both surfaces: the wrapper is idempotent (re-runs are no-ops). To point at a manually-managed app, use -ExistingAppId rather than skipping the wrapper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ional

The prior docs implied tenant admin consent was required and per-user consent was an edge case. Reality is the opposite for most tenants (including Microsoft's): each user can consent to Graph User.Read on the worker SP at portal sign-in, and OBO works for that user thereafter. Admin consent is just a shortcut that skips the per-user prompt for shared stamps.

Reframed consent text across: pilotswarm-obo-smoke-app-reg/SKILL.md (Consent section + AADSTS65001 troubleshooting row + Mode help), pilotswarm-npm-deployer.agent.md, docs/operations/live-smoke.md, deploy/scripts/auth/README.md, and Setup-OboSmokeWorkerApp.ps1 (param help, failure warning, and final stdout note).

Also clarified that Cloud Application Administrator role can grant consent for a single app without being Global Admin, since not every team has GA delegation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep keyvault.bicep focused on the vault + secret-tier RBAC. Move OBO Key Encryption Key creation, the Key Vault Crypto User role-assignment loop, and the OBO KEK output into a new single-responsibility module obo-kek.bicep. main.bicep instantiates it conditionally on oboEnabled and routes the existing oboKekKid output through it (preserving the __PS_UNSET__ sentinel for non-OBO stamps so the overlay .env substitution stays satisfied).

Behavior is unchanged: same role definition (Key Vault Crypto User), same role-assignment GUID (guid(keyVault.id, principalId, kvCryptoUserRoleId, 'obo-kek')), same key shape (RSA-2048, wrapKey/unwrapKey, 365-day rotation, 30-day notify, 730-day expiry), same OBO_KEK_KID output URL. No deployment-name change for the AKV module; the new module deploys as <prefix>-obo-kek-<dTime>.

Bicep build clean: exit 0, no new warnings (pre-existing Rotate/Notify casing + postgres password warnings unchanged).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Setup-OboSmokeWorkerApp.ps1 -Mode patch-fic now writes an eSTS-issuer
  FIC with the worker pod's UAMI as subject, instead of an AKS-direct
  FIC. Microsoft CORP tenant rejects AKS-direct FIC on 3P apps despite
  the published aka.ms/PTMFICWiki doc. MSI-as-FIC is the supported
  CORP path and is also the SFI-aligned pattern for non-CORP tenants,
  so it's the new default. -FicPattern aks-direct fallback retained
  for tenants that allow it.
- Split combined Graph PATCH (define-scope + pre-authorize) into two
  PATCHes so a fresh worker app can be provisioned in one wrapper run
  (Graph rejects the combined call because the new scope id isn't yet
  persisted when pre-auth references it).
- Centralize --variant {default|smoke} tag suffixing in
  effectiveImageTag() (lib/service-info.mjs) so the build, push, and
  manifest steps all agree on the smoke-tagged image.
- Smoke plugin auth-backend selector switches FIC detection from
  AZURE_FEDERATED_TOKEN_FILE to WORKLOAD_IDENTITY_CLIENT_ID; MSAL
  clientAssertion callback now resolves a fresh UAMI access token via
  ManagedIdentityCredential against api://AzureADTokenExchange/.default
  on every invocation.
- Update spec FR-025 / FR-026 / SC-018, smoke checklist, deploy and
  auth READMEs, new-env / obo-smoke-app-reg skills, npm-deployer agent
  (1.3.0), and live-smoke ops doc to describe the MSI-as-FIC topology.

Live-validated end-to-end on a fresh chkrawps10 stamp in CORP tenant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… FIC limitation docs, deploy ergonomics

Live-tenant validation on chkrawps10 (CORP) proved the User OBO Propagation
contract end-to-end: portal MSAL downstream-scope acquisition, envelope
encryption + per-RPC forwarding, getCurrentUserContextForSession lookup,
and the structured tool-outcome contract. This commit captures the fixes
discovered during that validation pass.

Tool reachability (native overlay pattern)
- packages/obo-smoke-plugin/agents/default.agent.md: declares the smoke
  tools in the agent overlay. Picked up by the plugin loader via
  _appDefaultToolNames and auto-inherited by every chat session on
  --variant smoke worker builds (canonical session-manager.ts visibility
  pipeline; pre-existing, not introduced here).
- packages/sdk/src/worker.ts: _warnOrphanPluginTools() warns at startup
  when a plugin registers a handler whose tool name is not claimed by
  any agent overlay. Catches the "registered but invisible" gap.
- packages/sdk/test/local/plugin-tools-contract.test.js + new
  plugin-with-claimed-tools fixture: regression coverage for the
  warn-on-orphan and silent-when-claimed paths (19/19 tests pass).
- docs/plugin-architecture-guide.md, packages/obo-smoke-plugin/README.md:
  document the two-half plugin contract — handler registration AND
  name declaration in an overlay.

OBO + FIC limitation documentation
- docs/operations/obo-fic-limitations.md: new operator runbook capturing
  that MSI-as-FIC works for direct downstream resource access from a
  workload UAMI, but NOT as client_assertion for an OBO grant when the
  source UAMI is itself federated via AKS workload identity. AAD
  rejects FIC-derived tokens (xms_ficinfo claim) on the next federation
  with AADSTS700231, independent of FIC config correctness. FIC
  audience must be the URI form (api://AzureADTokenExchange); the GUID
  form yields AADSTS700214. Includes diagnostic recipe and unblock
  options (client secret, AKS-direct FIC where tenant policy allows,
  certificate). This is an AAD policy, not a PR #51 concern — the
  propagation contract is agnostic to client-credential shape.
- .github/skills/pilotswarm-obo-smoke-app-reg/SKILL.md: warning callout
  pointing operators at the new runbook.

Plugin OBO error enrichment
- packages/obo-smoke-plugin/tools.js: catch block for the OBO grant
  surfaces errorCode + subError + correlationId + truncated
  errorMessage so AADSTS codes are visible in the structured tool
  result rather than collapsed to a generic message.

Deploy ergonomics carried with this validation pass
- deploy/scripts/lib/common.mjs: unique-per-build dirty image tags
  (timestamp suffix) so each rebuild produces a fresh tag, eliminating
  the "image push appears to have no effect" pitfall.
- deploy/scripts/lib/substitute-env.mjs + test: __PS_UNSET__ sentinel
  removes a key from a generated env file rather than emitting a blank.
- deploy/gitops/worker/overlays/default/.env: smoke-profile env keys
  added (driven from .env.remote via the substitute-env path above).
- packages/portal/src/auth/providers/entra.js: MSAL admission accepts
  id_token-only sign-ins for portal auth; OBO acquireTokenSilent uses
  the configured downstream scope list with consent-prompt hint when
  the cache is empty.
- packages/cli/src/smoke/cli.js: backtick fix in usage banner.

Validation summary
- All 19 plugin-tools-contract tests pass.
- TypeScript build clean.
- Live smoke on chkrawps10: P-1..P-4 propagation chain green;
  obo_smoke_whoami returns the user principal + access token to the
  plugin and emits the structured obo_failed outcome with the AAD
  error code visible. The OBO grant itself is gated by tenant FIC
  policy (see new runbook) — not in PR #51 scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Brings in v0.2.0 release: VPN P2S ingress (#53), dual redirect URIs,
client-profile helper, gitignore .claude/, proposals doc revisions.

Conflict resolution (all 10 conflicts):
- package.json x3, packages/{cli,sdk}/package.json: took main's 0.2.0
  version bumps (our 0.1.36 bump was branch-local, never released).
- package.json (test:deploy-scripts): merged — kept our 3 test additions
  (build-call-sites, dockerfile-worker, setup-obo-smoke-worker-app) and
  picked up main's appgw-waf-rules.test.mjs.
- package-lock.json: took theirs + npm install reconcile.
- CHANGELOG.md: our entry restructured under '## Unreleased'; main's
  '## 0.2.0 — 2026-06-19' entry inserted below.
- deploy/envs/template.env, both base-infra bicep files, two skill
  markdowns: pure additive blocks (OBO vs VPN sections / params /
  bullets); both halves retained, ours-first ordering.

Validation:
- pilotswarm-sdk@0.2.0: tsc build clean.
- packages/sdk/test/local/plugin-tools-contract.test.js: 19/19 pass.
- npm run test:deploy-scripts: 281/281 pass (was 248 on main; +33
  from this branch's smoke/OBO test additions).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants