[User OBO Propagation] Per-user identity + access token propagation to worker tool handlers#51
Open
ChrisKrawczyk wants to merge 41 commits into
Open
[User OBO Propagation] Per-user identity + access token propagation to worker tool handlers#51ChrisKrawczyk wants to merge 41 commits into
ChrisKrawczyk wants to merge 41 commits into
Conversation
User OBO Propagation Phase 1A foundation. Type-checks clean; remaining Phase 1B work (orchestration plumbing, client.ts, portal runtime, CLI transport, sendMessage bug fix, 8 test files) deferred to a fresh session for careful focus on orchestration replay-determinism. New modules: - packages/sdk/src/envelope-crypto.ts: EnvelopeCrypto interface with three backends. AkvEnvelopeCrypto lazy-loads @azure/keyvault-keys and @azure/identity (no hard dep for non-OBO consumers); supports KEK rotation by building per-message clients targeting the cipher's exact key version. InMemoryEnvelopeCrypto uses an in-process RSA-2048 keypair for tests/local-dev. PlaintextEnvelopeCrypto refuses to start when NODE_ENV=production. selectEnvelopeCrypto(env) factory enforces the four selection rules from the implementation plan. - packages/sdk/src/user-context-store.ts: minimal in-memory store (setUserContext/clear/getRaw). Phase 2 will add parent-map and chain-walking lookup. Wiring: - types.ts: UserEnvelope, EnvelopeCipher, UserEnvelopeCarrier, UserContext, PrincipalClaims types added. - session-manager.ts: SessionManager constructor accepts EnvelopeCrypto | null and owns a UserContextStore; getter methods exposed. - worker.ts: calls selectEnvelopeCrypto(process.env) at startup and threads the result to SessionManager. - session-proxy.ts: runTurn activity input gains optional envelope field (UserEnvelopeCarrier); handler reads principal claims directly (always plaintext) and decrypts accessTokenCipher only when present. Decrypt failures populate principal-only so identity-aware tools still function. Population happens regardless of accessTokenCipher presence (satisfies Spec P1 scenario 2: no-OBO-scope deployments still get principal claims via the lookup). destroySession clears the UserContextStore entry. Plaintext token material is held only in pod memory; never logged, never persisted, never written to durable queue or activity history (FR-020 / FR-023; full no-plaintext-in-queue test coverage lands in Phase 1B). Refs: - Spec FR-002, FR-007, FR-020, FR-023, FR-024 - ImplementationPlan.md Phase 1 - coordination-notes.md (Waldemort alignment ce27995, a2eb20d) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Thread the user envelope (plaintext principal + optional accessTokenCipher) through the SDK transport boundary so a future Phase 2 lookup can resolve user context for tool handlers. Orchestration (replay-safe via conditional spread - frozen orchestration_1_0_* versions intentionally untouched): - queue.ts captures envelope at all 4 stash sites; FIFO batch tracks last-wins merged envelope; processAnswer threads envelope into processPrompt. - turn.ts processPrompt accepts optional envelope; runTurn yield includes envelope only when present (old histories with no envelope replay clean). Public API: - client.ts: send/sendAndWait/createSessionForAgent accept envelope opt; carried into the enqueue JSON payload. - management-client.ts: sendMessage/sendAnswer accept envelope. Portal: - runtime.js: buildUserEnvelope(authContext) helper; attached on the 3 prompt-bearing RPCs (sendMessage, sendAnswer, createSessionForAgent). Phase 3 will populate accessTokenCipher; today it ships as null. CLI: - node-sdk-transport.js: envelope passthrough on createSessionForAgent / sendMessage / sendAnswer; fixes long-standing bug where sendMessage's normal path dropped sendOptions on session.send(). Tests (6 files, 33 unit-style tests pass; SDK + repo build clean): - envelope-crypto, obo-envelope-shape, obo-no-plaintext-in-queue (FR-020 sentinel guard), runtime-envelope-completeness, sendmessage-options-flow, obo-envelope-roundtrip (integration; deferred to live env). Live smoke deferred to integration env (local .env lacks GITHUB_TOKEN). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Flagged by paw-impl-review for Phase 1. Stray ESC (0x1B) bytes had been inserted into the carrier-shape JSDoc, producing rendered text 'velope (NOT velopeCipher)' and 'unTurn'. Replaced with the correct 'envelope (NOT envelopeCipher)' and 'runTurn'. No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Formalizes the UserContextStore with two maps (parent map + entries),
exposes getUserContextForSession() from pilotswarm-sdk, and wires the
lifecycle so tool handlers can synchronously resolve the active user
for any session (including sub-agents) without leaking token material.
UserContextStore (extended from Phase 1):
- parentMap (sessionId -> {parentSessionId, isSystem}): structural
metadata only, no token material; persists across dehydrate so
descendants can still resolve to the portal-bound ancestor.
- entries (sessionId -> UserContext): plaintext, cleared on dehydrate
AND terminal state; never persisted, never dehydrated.
- bindParent(), hasParentBinding(), lookup() with bounded chain walk
(FR-008/009/021/022 honored: system root -> null, broken chain ->
null, depth cap 32 -> warn + null).
- clear() vs clearParent() with documented semantics.
- Defensive copies on lookup/getRaw so handlers cannot mutate state.
SessionManager.getOrCreate now performs a one-shot bounded CMS chain
walk on first-time-on-this-worker session activation, populating
parent-map entries for every ancestor. NO orchestration change — the
plan's original 'thread isSystem through runTurn input' would have
broken replay for existing system sessions; CMS walk avoids that.
worker-registry.ts (new): AsyncLocalStorage-based active worker resolution.
- registerSessionManager / unregisterSessionManager (tied to
PilotSwarmWorker start success / stop finally).
- runWithSessionManager() wraps the runTurn activity body so tool
handlers resolve to the worker hosting them.
- getUserContextForSession() — public, synchronous, importable.
Returns null on no worker, multi-worker ambiguity outside ALS,
system root, broken chain, unknown session.
Lifecycle hooks added:
- _dehydrateUnlocked clears the user-context entry; parent-map
binding persists.
- destroySession activity clears both entry and parent binding.
- worker.stop() unregisters in finally so a crashing shutdown still
drops the registry slot.
Tests (3 unit files, 36 tests pass; 33 Phase 1 tests still pass):
- user-context-store.test.js (21): FR coverage, chain walk including
intermediate-evicted (Gemini #1), token refresh propagation, cycle
defense, child-becomes-root, clear/clearParent semantics, input
normalization.
- user-context-registry.test.js (12): ALS resolution under multi-worker,
single-worker fallback, ambiguity-returns-null, defensive copy,
cross-worker isolation.
- user-context-dehydration.test.js (3): blob-store.ts / session-store.ts
static guard + JSON.stringify exclusion.
Public API locked for Waldemort:
import { getUserContextForSession } from 'pilotswarm-sdk';
(sessionId: string) => UserContext | null
Live integration (obo-lookup-integration.test.js, obo-replay-expiry.test.js)
deferred to env with GITHUB_TOKEN.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Acquire downstream-scope token (api://<worker-app>/.default + offline_access) at portal sign-in via MSAL; cache {accessToken, accessTokenExpiresAt}; refresh forceRefresh=true within 5 min of expiry. Forward via /api/rpc body's auth field (TLS-only, never headers/WS). Server middleware stamps onto req.auth.principal. PortalRuntime owns its own EnvelopeCrypto via selectEnvelopeCrypto(env) and encrypts the token at envelope-build time so plaintext never lands on the durable queue (FR-020). Encryption failure or absent crypto -> ship principal-only; safe-by-default.
Spec A-8 misconfiguration handled: MSAL rejection logs metadata-only and admission still succeeds with principal-only envelope.
Public API: selectEnvelopeCrypto + EnvelopeCrypto/UserEnvelope/EnvelopeCipher types now exported from pilotswarm-sdk; re-exported from pilotswarm-cli/portal so portal package can use the factory without a direct sdk dep.
Tests: 12 new unit tests (phase3-runtime-envelope-encrypt + phase3-server-auth-body); all 77 OBO unit tests across Phase 1+2+3 pass; full repo build clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ailable) Implements the return-side carriers for the Structured tool outcome family (Spec FR-010, SC-005, FR-024) so worker tools can signal machine-distinguishable IdP re-auth requirements and transport-layer service failures back through the SDK to the portal. - New helpers interactionRequired() / serviceUnavailable() in packages/sdk/src/tool-outcomes.ts; types + marker constant in types.ts; exports threaded through index.ts. - session-proxy.ts enrichToolCompletionEventData runs in the onEvent callback on every tool.execution_complete event: detects the marker at top level / data.result / data.toolResult, populates data.outcome + data.outcome_payload (sanitized to per-kind allow-list), and strips the raw marker before CMS persistence (FR-020). - FR-024 auto-emission: persistent envelope-decrypt failure in runTurn records a synthetic system.tool_outcome with reasonCode akv_unwrap_failure; turn continues with principal-only context. - Shared UI history.js renders the two new outcome kinds with distinct icons/colors (interaction_required -> reauth, service_unavailable -> warn); system.tool_outcome rendered as labeled row. - Tuner observability: getStructuredOutcomeEvents() + getFleetStructuredOutcomeStats() on PilotSwarmManagementClient; paired inspect-tools read_session_structured_outcomes + read_fleet_structured_outcome_stats inside the !isTuner guard. - TUI SKILL.md updated with the new rendering rules. Tests: 3 new unit files / 32 tests pass (helpers, enrichment, stats); 100 OBO unit tests across Phase 1+2+3+4 pass. Full repo build clean. Live integration tests deferred to env with GITHUB_TOKEN per the same constraint that deferred Phase 1/2/3 live smoke. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ships the reference smoke plugin in examples/obo-smoke/ (Spec FR-018) as the release gate for the User OBO Propagation feature surface. - examples/obo-smoke/index.js: obo_smoke_whoami + obo_smoke_force_reauth with the OBO_SMOKE_WORKER_APP_* env namespace; env read at handler invocation (never at module import) so the module cannot accidentally activate real-OBO when imported into a non-smoke worker. - examples/obo-smoke/README.md: install snippet, mode decision matrix, FR-015 note, no-token-logging guarantee. - examples/obo-smoke/SMOKE_CHECKLIST.md: live-tenant + local-developer variants with token-leak grep step and post-smoke secret cleanup. - packages/sdk/test/local/obo-smoke-plugin-loadable.test.js: 10 tests asserting module imports, tool registration, marker shape on force_reauth, structured mode values on whoami, env-import-time invariant, doc presence. 110 OBO unit tests pass across Phase 1+2+3+4+5; full repo build clean. Live-tenant smoke is the manual maintainer-executed release gate per FR-018 and is documented in SMOKE_CHECKLIST.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Bicep: conditional OBO KEK in base-infra AKV (RSA-2048, 365d rotation), array-shaped oboKekUamiPrincipalIds role-assignment loop (Key Vault Crypto User), oboKekKid output (emits __PS_UNSET__ sentinel when OBO disabled) - Deploy wiring: OUTPUT_ALIAS, compose-env fallback, overlay-contracts, template.env + overlay .env files for OBO_ENABLED / OBO_KEK_KID / PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE - Versions: pilotswarm-sdk 0.1.35 -> 0.1.36 (cli + portal already at 0.1.36) - Docs: new docs/operations/obo-kek-runbook.md, new docs/sdk/user-context.md; updates to configuration.md, builder-agents.md, devops sample README - Deferred: skills updates and dev-plaintext-mode-warning unit test (behavior already covered in envelope-crypto.ts) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…der template Addresses paw-impl-review findings on commit c5c48df: 1. Add missing @azure/keyvault-keys dep to packages/sdk (dynamically imported by AkvEnvelopeCrypto but absent from package.json) 2. Add PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE to worker overlay .env and compose-env fallback. Worker's selectEnvelopeCrypto requires the scope to engage the AKV backend; without it, OBO decrypt path stays disabled even when OBO_KEK_KID is wired 3. Pin EnvelopeCipher.kekKid to the versioned key URL returned by wrapResult.keyID instead of the un-versioned env value. KEK rotation with prior-version retention requires ciphertext to record the exact wrapping version so decrypt can target it 4. Add lookup + outcome-helper guidance to pilotswarm-sdk-builder SKILL.md (templates that already reference tool-handler patterns) Tests: 70 OBO unit tests pass, 33 deploy tests pass, SDK typecheck clean Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ly, FR-011 portal auto-reauth
F1 (must-fix): obo-smoke whoami tool was reading userContext.provider/subject/email/displayName, but the lookup contract returns { principal: { provider, ... }, accessToken, accessTokenExpiresAt }. Without this fix the SC-007 release-gate Graph round-trip would have shipped principal-undefined fields. Loadable test only exercised the null path so the bug slipped past it.
F4 (should-fix): PilotSwarmWorker.stop() left _started=true and _provider!=null when this.runtime.shutdown threw, since the Phase 2 try/finally refactor left those assignments inside try. Move them into finally alongside unregisterSessionManager so a failing shutdown still hard-resets the worker state and registry slot.
F3 (should-fix): close the FR-011 wire by parameterising the auth provider's getDownstreamToken to accept { interactive } and adding a transport-level subscriber in browser-transport.js that observes interaction_required outcomes (tool.execution_complete and synthetic system.tool_outcome) on incoming session events and fire-and-forgets an interactive downstream-token acquisition. Debounced per session id (~30s) with a global in-flight guard to prevent popup storms; errors swallowed; existing manual sign-out/sign-in path remains. Next worker-bound RPC carries the freshly-acquired token via the existing cache-aware getDownstreamToken path (SC-006).
F2 / F5 / F6 / F7: F2 was a false positive (all three packages already at 0.1.36). F5 (fail-fast on malformed OBO_KEK_KID) intentionally kept — A-8 covers acquisition failure, not operator misconfig. F6 / F7 are consider-severity, deferred.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
copilot-instructions.md: new `User OBO (User-On-Behalf-Of) Propagation` section codifying the architecture invariants — wire field is `envelope` (not `envelopeCipher`), AKV-wrapped DEK + AES-256-GCM, three crypto backends with lazy AKV imports, worker-side synchronous lookup contract, `accessToken: null` as universal absence signal, structured outcome family with pinned reason codes, portal-side ~5-min near-expiry refresh, single-tenant assumption, FR-014 trust boundary (worker tools must not synthesize principals from CMS owner), Bicep `oboKekUamiPrincipalIds` array contract supporting both single- and dual-UAMI deployments, and the live-tenant smoke gate. pilotswarm-tui/SKILL.md: update line 52 — portal hosts no longer `may add` interactive re-auth affordances; the auto-reauth wire is implemented at the WebSocket transport layer (browser-transport.js) with per-session ~30s debounce and a global in-flight guard. Shared activity rendering remains identical across hosts. pilotswarm-release/SKILL.md: new `OBO Live-Tenant Smoke Gate` section — when a release touches the OBO surface, the examples/obo-smoke/ round-trip becomes a release-gate artifact alongside the standard checklist. Confirms KEK rotation safety (versioned wrapResult.keyID), AKV RBAC, and that PlaintextEnvelopeCrypto never appears in a production envelope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pto) Closes the gap flagged during final-review wrap-up: the loud console.warn fired by `selectEnvelopeCrypto` (envelope-crypto.ts:321) when a deployment selects `PlaintextEnvelopeCrypto` via OBO_ENVELOPE_PLAINTEXT_MODE=1 was previously only validated by the live-tenant smoke gate, post-build. A regression silencing that warning at the factory layer would have shipped to npm before any operator noticed. Adds three new cases to envelope-crypto.test.js: 1. Selecting plaintext backend emits exactly one console.warn whose payload mentions 'envelope-crypto', 'OBO_ENVELOPE_PLAINTEXT_MODE', and 'NOT encrypted'. 2. Selecting the AKV backend emits NO plaintext-mode warning (filters out unrelated warnings just in case). 3. Returning null (OBO disabled, no scope) emits no warnings at all. Pure unit tests, no live worker / no DB. 17 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sanity-checked the deploy code itself end-to-end on this branch — wiring is complete (template.env defaults, base-infra Bicep param + keyvault module + OUTPUT_ALIAS, overlay-contracts whitelisting, compose-env sentinel fallback, worker + 3 portal overlay .env defaults, configMapGenerator/envFrom flow). 33 deploy tests pass. The only gap was the operator-facing skills + agent quick-reference. pilotswarm-new-env-deploy/SKILL.md: T2 inventory line now mentions the conditional OBO KEK; new `User OBO Propagation` block in the Step 2 input-surface table covers OBO_ENABLED + PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE with explanatory text on the oboKekUamiPrincipalIds array contract (single-UAMI vs split-UAMI) and a pointer to docs/operations/obo-kek-runbook.md; Step 6 verification block adds az + kubectl checks for the KEK, the Crypto-User role assignment, and the OBO_KEK_KID ConfigMap projection on both portal and worker. pilotswarm-aks-deploy/SKILL.md: Core Learnings bullet documents that OBO lives on the npm/Bicep path, not the legacy bash path; rolling forward via scripts/deploy-aks.sh leaves the worker in non-OBO mode (FR-002 backwards-compat) until the operator manually wires OBO_KEK_KID + PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE into .env.remote. pilotswarm-npm-deployer.agent.md: service-redeploy quick-reference table gains a row for `Toggle OBO User Context on a stamp`, calling out the base-infra bicep step + manifests,rollout re-render needed after editing OBO_ENABLED in the per-stamp .env. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 7 (FR-025/026/027/028, SC-017/018/019):
- examples/obo-smoke rewritten to use @azure/msal-node CCA with
handler-time backend selection (FIC > client-secret precedence),
per-call FIC clientAssertion re-read.
- New pilotswarm smoke <stamp> CLI driver (packages/cli/src/smoke/)
with injectable deps, --auth {device-code,from-env},
--skip-kube-bootstrap, structured JSON result envelope,
exit-code semantics (0 pass / 1 smoke-failure / 2 preflight).
- New workflow_dispatch GitHub Actions scaffold for live OBO smoke
(.github/workflows/live-smoke-obo.yml) with explicit kubeconfig
bootstrap before az aks get-credentials.
- Worker bootstrap loads obo-smoke plugin BEFORE worker.start() when
OBO_SMOKE_ENABLED=true (avoids tool-registration race with poller).
- Deploy plumbing: OBO_SMOKE_ENABLED propagated through template.env,
worker overlay sentinel, compose-env sentinel-fill, and Dockerfile.
- Ops doc docs/operations/live-smoke.md + cross-link from OBO KEK
runbook.
- Test fixtures: stage-manifests + foundry-substitute now seed
OBO_KEK_KID / PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE / OBO_SMOKE_ENABLED
to keep substituteOverlayEnv fail-closed gate passing.
Final-review must-fixes:
- Finding 1 (FR-024 spec violation): session-proxy.ts envelope
decrypt now uses bounded transient-retry-with-backoff (3 attempts,
500ms/2s/5s) before falling through to the structured
service_unavailable outcome. Previously every decrypt failure
immediately became service_unavailable with no retry, contradicting
the spec's "transient retry, then structured outcome" requirement.
- Finding 2 (FR-011 / SC-006 live-path bug): browser-transport
maybeTriggerInteractiveReauth() now reads
sessionEvent.eventType || sessionEvent.type. Previously it only
read .type, silently missing every interaction_required event
delivered over the live websocket (canonical SDK shape uses
.eventType). New regression test
packages/sdk/test/local/portal-interactive-reauth.test.js (6 tests)
pins both shapes plus debounce and outcome-filtering behavior.
Test status:
- packages/sdk/test/local/portal-interactive-reauth.test.js: 6/6
- packages/sdk/test/local/obo-smoke-driver.test.js: 8/8
- packages/sdk/test/local/obo-smoke-auth-backend.test.js: 10/10
- packages/sdk/test/local/obo-smoke-plugin-loadable.test.js: 10/10
- packages/sdk/test/local/tool-outcomes-*.test.js + structured-outcomes-stats: clean
- deploy-scripts npm script: 197/197
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…K URL doc consistency Finding 4: Add INTERACTION_REQUIRED_REASON_CODES set and InteractionRequiredReasonCode union; interactionRequired() now rejects reason codes outside the pinned taxonomy (reauth_required, mfa_refresh, conditional_access, consent_required) at helper-call time, preventing downstream consumers from fragmenting the portal's behavior contract. Finding 6: Update .env.example to show un-versioned OBO_KEK_KID with explanatory comment, aligning with docs/configuration.md and docs/operations/obo-kek-runbook.md (which were already correct). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r overlay + agent/skill docs
Closes the gap where flipping OBO_SMOKE_ENABLED=true on a stamp would register the obo_smoke_* tools but the per-stamp downstream-app identity was not projected into the worker ConfigMap, leaving the smoke plugin's auth backend to fast-fail with serviceUnavailable({ reasonCode: 'smoke_misconfigured' }).
- deploy/envs/template.env: add OBO_SMOKE_WORKER_APP_TENANT_ID/_CLIENT_ID/_GRAPH_SCOPE + OBO_SMOKE_TEST_USER_UPN with empty defaults and explanatory comments. AKS path is workload-identity FIC; CLIENT_SECRET is local-dev only.
- deploy/gitops/worker/overlays/default/.env: project the four new keys with __PS_UNSET__ sentinel defaults so substitute-env passes on non-smoke stamps.
- deploy/scripts/lib/compose-env.mjs: extend the OBO_SMOKE_ENABLED fallback loop to cover the new keys.
- deploy/scripts/test/foundry-substitute.test.mjs: add the new keys to both fixture envs to keep substitute-env happy.
- .github/agents/pilotswarm-npm-deployer.agent.md: new task row for OBO_SMOKE_ENABLED toggle workflow.
- .github/skills/pilotswarm-new-env-deploy/SKILL.md: add the new env keys to the per-stamp .env table, an explanatory paragraph mirroring the OBO Phase 6 paragraph, and post-deploy verification kubectl/pilotswarm-smoke snippet.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lumbing notes - CHANGELOG.md: clarify that interactionRequired reason codes are pinned (helper throws on unknown values), and document the Phase 7 deploy-pipeline plumbing (template.env / compose-env / worker overlay / npm-deployer agent). - docs/operations/live-smoke.md: add a paragraph after the per-stamp env table explaining how compose-env + worker overlay project the OBO_SMOKE_WORKER_APP_* keys into the worker ConfigMap, and call out the production-stamp safety invariant. - docs/sdk/user-context.md: clarify reason-code enforcement, mention the new exported INTERACTION_REQUIRED_REASON_CODES set + InteractionRequiredReasonCode union. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add env-file schema rows for OBO_KEK_KID (sourced from base-infra bicep output via the FR-022 alias map) and the optional OBO_SMOKE_* worker-overlay block (default off, AKS uses workload-identity FIC, never enable on production stamps). - Add cross-reference links to docs/operations/obo-kek-runbook.md and docs/operations/live-smoke.md so operators discover the runbooks from the canonical npm-deploy README. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds Setup-OboSmokeWorkerApp.ps1 (sidecar-only, never edits .env), new pilotswarm-obo-smoke-app-reg skill, npm-deployer Step 0.b wiring + tightened POSIX-ERE grep gate, and docs sync (live-smoke, obo-kek-runbook, auth/README, SMOKE_CHECKLIST, CHANGELOG, new-env-deploy skill). Closes the last manual gap in the Phase 7 live-smoke harness -- OBO_SMOKE_ENABLED=true is now a true one-line opt-in. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dReasonCode to single source of truth Two findings from paw-final-review on the user-obo-propagation work: 1. .github/copilot-instructions.md:179 still referenced the early-draft name getCurrentUserContextForSession; the locked, shipped API is getUserContextForSession (verified across packages/sdk/src/index.ts, docs/sdk/user-context.md, docs/configuration.md, examples/obo-smoke/index.js, CHANGELOG.md). Self-inflicted drift in this PR. Fix renames the contributor doc to the correct symbol — downstream consumers (microsoft/waldemort) read PilotSwarm's contributor docs as the source-of-truth handshake. 2. packages/sdk/src/types.ts declared InteractionRequiredReasonCode (union type) and INTERACTION_REQUIRED_REASON_CODES (ReadonlySet) as independent literals. They agreed today but were vulnerable to future drift — adding a code to one but not the other would silently break the contract (the helper's runtime set.has() would reject codes the type system accepts, or vice versa). Refactored to derive both from a single private as-const tuple. Public export shape (ReadonlySet) preserved — no docs/CHANGELOG updates needed, no breaking change to downstream consumers. Tests: tool-outcomes-helpers + tool-outcomes-enrichment + structured-outcomes-stats all green (33/33). Build clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User feedback: comments and docs referring to internal PAW phases (e.g., 'Phase 6', 'Phase 7 — FR-026') don't mean anything to future readers or downstream consumers, who have no context on the planning artifacts where those phases were defined.
Rewrites all PAW-phase references in shipped surfaces to one of: the underlying feature name ('User OBO'), the relevant spec FR (e.g., 'FR-011', 'FR-026'), or nothing when the label was the only content. Two test files renamed: phase3-runtime-envelope-encrypt → obo-runtime-envelope-encrypt and phase3-server-auth-body → obo-server-auth-body.
Left intentionally unchanged:
- main.bicep 'Phase 4' references — enterprise deploy roadmap, not PAW
- deploy/scripts/README.md Foundry-Entra 'Phase 2/3' note — proposal phases tracked in docs/proposals/, not PAW
- docs/configuration.md + CHANGELOG.md '(Phase 1)' on Authorization engine — pre-existing portal-authz roadmap, not OBO
- session-manager.ts dehydrate() 'Phase 1: Destroy / Phase 2: Persist' — algorithm-internal step labels, not PAW
Build clean. 166 tests pass (142 OBO unit + 18 deploy invariants + 6 portal-reauth).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Persist User OBO Propagation spec to docs/specs/user-obo-propagation.md so FR-XXX / SC-XXX citations across source and docs resolve to a real document in the repo. - Add spec pointers from live-smoke.md and the new-env deploy skill. - Replace internal-product/cluster identifiers across docs, skills, proposals, fixtures, builder templates, and source comments with neutral placeholders (downstream consumer / ExampleApp / <aks-cluster> / <resource-group>) so the OSS surface stays product-neutral. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The shipped .github/workflows/live-smoke-obo.yml could not actually run: it loaded deploy/envs/local/<stamp>/.env (gitignored, not present on the branch) and required AZURE_* OIDC secrets that this repo does not currently provision. Keeping a workflow that can't run was a foot-gun. Removed: - .github/workflows/live-smoke-obo.yml - deploy/scripts/test/live-smoke-workflow.test.mjs (its actionlint-shape test) - SC-019 (asserted the workflow parsed) from docs/specs/user-obo-propagation.md - P7 acceptance-scenario #4 (workflow_dispatch run) - 'Workflow trigger surface stays narrow' invariant from live-smoke.md - 'Workflow scaffold' section from live-smoke.md (replaced with brief 'CI workflow (future work)' note explaining the gating prerequisites for adding one later) Reworded FR-028 to 'deferred — future work' explicitly calling out the gitignored-env and no-CI-subscription prerequisites. The CLI driver ('pilotswarm smoke <stamp> --profile obo') remains the supported local-operator path; adding a workflow_dispatch-only workflow later when operators have a CI environment is still a one-file addition. Updated CHANGELOG, npm-deployer agent, new-env-deploy skill, package.json deploy-scripts test list to match. 205/205 deploy-script tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SMOKE_CHECKLIST.md collapsed 220→75 lines. The old Steps 1–8 (manual
AAD app registration, manual portal/worker .env edits, manual tool
invocation, manual CMS-event inspection, manual token-leak grep) are
all superseded by the shipped automation:
- Setup-OboSmokeWorkerApp.ps1 provisions the worker app + FIC
- Setup-PortalAuth.ps1 provisions the portal app
- deploy.mjs projects the env into the worker ConfigMap
- pilotswarm smoke <stamp> --profile obo drives the tools and
emits the JSON pass/fail record
Checklist is now a release-gate sign-off form pointing at
docs/operations/live-smoke.md for operational detail. Kept:
pre-flight (unit tests + build), AKS-deployed flow (promoted to the
canonical path), local-developer variant, sign-off, after-smoke
cleanup.
README.md: dropped the 'Backend auto-selection' note that restated
the env-presence table directly above it (87→78 lines).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extend the worker plugin contract with a `tools` field so plugins can
declaratively register their tools at worker.start(), and migrate the
OBO smoke harness to a first-class workspace package that loads via
the new contract.
Plugin contract (PilotSwarmWorker):
- Add `tools` field to PluginManifest (plugin.json) alongside
forward-compat `agents` / `skills` fields.
- registerTools(tools, contributor?) is now atomic — pre-validates the
whole batch for collisions before mutating the registry. Tags every
registration with a contributor label (`worker-builtin`,
`app-inline`, or the plugin name) so collision errors name the
source on both sides.
- New ToolNameCollisionError (exported) with contributor-aware message.
- `tools` field is app-tier only; warns + ignores on system/management
tier.
- Missing pluginDirs entries now hard-fail at construction (was
warn-and-skip) — operator misconfig should not silently no-op.
- New _registerPluginTools() runs at worker.start() before duroxide init,
with wrapped errors for all five failure modes (missing module, import
failure, no export, sync throw, async reject).
OBO smoke plugin (packages/obo-smoke-plugin/):
- Moved from examples/obo-smoke/ to a workspace package picked up by
the existing packages/* glob.
- plugin.json declares tools: `./tools.js`; loads end-to-end via
PilotSwarmWorker({ pluginDirs: [...] }).
- tools.js exports the new registerTools(worker) contract entry point
while preserving all legacy exports.
- README rewritten to teach the PLUGIN_DIRS / pluginDirs contract and
cross-reference Setup-OboSmokeWorkerApp.ps1.
- Setup-OboSmokeWorkerApp.ps1 paste-block now emits
PLUGIN_DIRS=/app/packages/obo-smoke-plugin alongside the smoke AAD
app env keys (4 lines -> 5).
- packages/sdk/examples/worker.js: removed OBO_SMOKE_ENABLED conditional
dynamic-import block — smoke is loaded via the plugin contract now.
- deploy/Dockerfile.worker: COPY updated to packages/obo-smoke-plugin
(multi-stage refactor to follow in a later change).
Tests:
- New plugin-tools-contract.test.js (17 tests) covers atomic batch
registration, collision diagnostics, all five plugin-load failure
modes, app-tier-only enforcement, and worker-builtin auto-registration.
- obo-smoke-plugin-loadable.test.js rewritten to load the smoke plugin
end-to-end through the real plugin contract and assert tools land on
worker.toolRegistry tagged with contributor `obo-smoke`.
- obo-smoke-auth-backend.test.js: import path updated.
- deploy-scripts test: paste-block bumped to 5 lines + PLUGIN_DIRS regex
assertion.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the OBO smoke harness fully opt-in by removing its env keys from
the default deploy surface and giving operators a clearly-scoped
opt-in template instead.
Default surface (smoke-free):
- deploy/envs/template.env: dropped the 5-key OBO_SMOKE_* block
(TENANT_ID, CLIENT_ID, GRAPH_SCOPE, TEST_USER_UPN, OBO_SMOKE_ENABLED)
and its surrounding documentation block.
- deploy/gitops/worker/overlays/default/.env: dropped the matching
OBO_SMOKE_* sentinel block; default-deploy worker ConfigMap no
longer references smoke keys.
- deploy/scripts/lib/compose-env.mjs: removed the OBO_SMOKE_*
sentinel-fallback loop; compose-env no longer injects smoke keys
on any deploy. Core OBO sentinels (OBO_KEK_KID,
PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE) remain — those are the default
OBO surface, not the smoke surface.
Opt-in path:
- deploy/envs/template.smoke.env (new): contains the 5 OBO_SMOKE_*
keys + PLUGIN_DIRS=/app/packages/obo-smoke-plugin with explicit
opt-in documentation. Consumed only by operators running the OBO
smoke harness against a dedicated smoke stamp; not loaded by
default deploys.
Tests:
- deploy/scripts/test/foundry-substitute.test.mjs: dropped the now-
unneeded OBO_SMOKE_* __PS_UNSET__ placeholders from the two
stage-manifests test inputs.
- deploy/scripts/test/compose-env.test.mjs: added 3 invariant tests:
* compose-env never injects OBO_SMOKE_* keys into a default env
* OBO_SMOKE_* keys provided by the operator pass through untouched
* no file in deploy/scripts/lib/ contains an OBO_SMOKE_ string
literal (directory-walk invariant — guards against regression)
Verification: 209/209 deploy-script tests pass (was 206 + 3 new);
lint clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split Dockerfile.worker into a shared �ase stage plus two runtime targets: - untime (LAST stage, the implicit default for bare docker build) — no obo-smoke-plugin directory in the image. Production-safe by default. - untime-smoke — adds COPY packages/obo-smoke-plugin. Callers must opt in explicitly via --target runtime-smoke. build-image.mjs gains a �ariant parameter (default | smoke). Smoke variant is worker-only, appends -smoke to the image tag (no registry collisions), and passes --target runtime-smoke to buildx. Existing callers (deploy.mjs, deploy-aks.sh, reset-local.sh, docker-local-start.sh) use the default and continue to produce smoke-free images unchanged. Two new static tests enforce the convention since live docker builds aren't part of unit-test CI: - dockerfile-worker.test.mjs (5 cases): both stages exist, untime is last, untime doesn't COPY the smoke plugin, untime-smoke places it at the canonical PLUGIN_DIRS path, untime doesn't inherit from untime-smoke. - build-call-sites.test.mjs: walks scripts/, deploy/scripts/, .github/workflows/, .github/skills/ for docker build invocations against Dockerfile.worker and asserts every --target is empty/ untime/ untime-smoke. Asserts invocation count > 0 to prevent silent no-op. Defense in depth: even if an operator sets PLUGIN_DIRS to the smoke plugin path against a default image, the plugin loader hard-fails at startup because the directory isn't present. Smoke cannot accidentally activate against production. Tests: 215/215 deploy-script tests pass (was 209, +5 dockerfile-worker, +1 build-call-sites). Lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t-in flow Sweep all operator-facing and plugin-author-facing docs, skills, and agent prompts to match the compartmentalized OBO smoke flow: - New 'In-process tool plugins' section in docs/plugin-architecture-guide.md describing the plugin.json ools field, registerTools(worker) export, atomic registration, and fail-closed loader behavior. Reference example points at packages/obo-smoke-plugin/. - docs/operations/live-smoke.md rewritten: smoke is opt-in via three pieces — --variant smoke worker image build, deploy/envs/template.smoke.env overlay composed into the per-stamp .env, and PLUGIN_DIRS containing /app/packages/obo-smoke-plugin. OBO_SMOKE_ENABLED is no longer a worker boot gate; it's a stamp marker the smoke driver preflight checks. - docs/specs/user-obo-propagation.md FR-025/026/027 reworded to reference the new contract path. - Path renames everywhere: examples/obo-smoke/ → packages/obo-smoke-plugin/. - Schema-doc surfaces (configuration, system-reference, getting-started, building-apps for cli + sdk) gain a brief ools field mention next to the existing �gents/skills mentions, with a forward link to the plugin architecture guide. - Skills updated: pilotswarm-new-env-deploy, pilotswarm-obo-smoke-app-reg, pilotswarm-aks-deploy, pilotswarm-release. Agent updated: pilotswarm-npm-deployer. - Setup script READMEs updated: deploy/scripts/README.md, deploy/scripts/auth/README.md (Setup-OboSmokeWorkerApp.ps1 emitted env block now includes PLUGIN_DIRS=/app/packages/obo-smoke-plugin). - CHANGELOG.md gains an entry calling out the operator-visible posture change; historical entry path references corrected. - packages/obo-smoke-plugin/README.md and SMOKE_CHECKLIST.md path/version refreshed; README adds an opt-in section noting the smoke image variant. - .github/copilot-instructions.md path reference updated. Verified: zero �xamples/obo-smoke matches in shipped surfaces, zero internal/consumer-name leaks, no spec-only references (FR-XXX/SC-XXX/'Phase N') outside docs/specs/user-obo-propagation.md itself. tsc lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Promote the PluginManifest interface from internal-only to a public SDK type. Plugin authors writing in TypeScript can now import the typed manifest shape from pilotswarm-sdk to validate their plugin.json contents at compile time:
import type { PluginManifest } from \\pilotswarm-sdk\\;
const manifest: PluginManifest = { name, version, tools: \\./tools.js\\ };
Changes:
- packages/sdk/src/types.ts: drop the @internal JSDoc tag on PluginManifest and add a public-facing JSDoc with an @example block. Field shape is unchanged so the existing internal user (worker.ts loader) is unaffected.
- packages/sdk/src/index.ts: re-export the type next to the existing defineTool re-export, with a comment pointing authors at the plugin architecture guide.
- docs/plugin-architecture-guide.md (§7 In-process Tool Plugins): add a TypeScript snippet showing PluginManifest usage. Also fix a pre-existing doc bug — the example used object form (\\ ools: { module: \\./index.js\\ }\\) which the loader rejects; the loader only accepts string form (\\ ools: \\./tools.js\\\\), now corrected to match worker.ts behavior and the in-repo reference plugin.
- packages/sdk/test/local/plugin-manifest-type.test.js (new, 2 tests): verifies the public re-export from index.ts and validates the runtime shape of every checked-in plugin.json under packages/ and examples/ against the typed contract. Anti-no-op assertion guards against silently no-op'ing if the discovery walk breaks.
Tests: 31/31 pass on the targeted plugin-contract test suites. tsc --noEmit clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…js comment Final-review nit: a comment in the smoke plugin handler referenced internal PAW phase numbering that's not meaningful to users or future agent sessions reading the shipped source. Reword to describe the structured-outcome family without the numeric tag. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new-env-deploy skill, npm-deployer agent, live-smoke.md, and template.smoke.env all implied the OBO live smoke required dedicated test-user tokens. In practice the smoke driver's default --auth device-code mode does an interactive Entra sign-in: the operator signs in as themselves and that token is what flows through the OBO chain. The --auth from-env mode (with OBO_SMOKE_USER_*_TOKEN env vars) is a CI-only fallback, not the default. Reframe consistently: - new-env-deploy SKILL: drop `dedicated test-user tokens` framing; describe device-code default and OBO_SMOKE_TEST_USER_UPN as an optional UPN-assertion knob - live-smoke.md: rename `Test user` section to `Sign-in user`; lead with `you sign in as yourself`; call out dedicated test users as optional/situational - template.smoke.env: reword TEST_USER_UPN comment as an optional canary (empty = accept whichever user signs in) - pilotswarm-npm-deployer agent: correct the service-redeploy table row; bump version 1.1.0 -> 1.1.1 (patch, wording correction) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PilotSwarm itself does not target any specific downstream resource; consumer apps that build on PilotSwarm decide which Entra-protected resource to call. Reframe copilot-instructions.md and the OBO spec accordingly: - copilot-instructions.md: replace `ADO is the first consumer` with `PilotSwarm itself does not call any specific downstream resource; consumer apps that build on PilotSwarm do.` Examples list Microsoft Graph, Azure DevOps, etc. neutrally. - specs/user-obo-propagation.md: drop the `Azure DevOps is the first anticipated consumer` sentence and the parenthetical `(covered in the consumer spec for ADO)`. Remaining ADO mentions are illustrative (alongside Microsoft Graph), out-of-scope clarifications, or pre-existing unrelated docs (bug reports, child-contract proposal). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two cleanups bundled: 1. Restore pre-existing internal-product references that an earlier commit (df01c88) over-eagerly neutralized. The intent of this PR was to avoid *adding* new references in OBO content, not to scrub historical references repo-wide. Files restored to origin/main wording where they only differed by neutralization: bug reports, proposals, builder templates, a portal-builder agent/skill, a session-refresh test, and three lines in the aks-deploy skill (cluster context, resource group, downstream-deployment guard). 2. Genericize local test-env identifiers introduced by this PR. live-smoke.md JSON examples and the OBO spec narrative used a real stamp name; Setup-OboSmokeWorkerApp.ps1 .EXAMPLE blocks used real local paths. Reframed as <stamp> / <env-name> placeholders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split Setup-OboSmokeWorkerApp.ps1 into two phases so nothing has to wait for bicep: - `-Mode app-shell` (no OIDC dependency): creates the app, mints OAuth2 scope, declares Graph User.Read, pre-authorizes the portal app, emits .env paste block. Runs alongside portal app-reg, before bicep. - `-Mode patch-fic` (requires bicep outputs): looks up existing app, create-or-patches the AKS workload-identity FIC against the emitted OIDC issuer. No .env changes. - `-Mode all` (default; back-compat): both phases in one shot, requires bicep to have already run. Mirrors how Setup-PortalAuth.ps1 patches SPA redirect URIs after AFD is known. Sidecar is phase-aware: app-shell writes ficIssuer=null, patch-fic merges it in. Doc surfaces updated: pilotswarm-obo-smoke-app-reg/SKILL.md, pilotswarm-new-env-deploy/SKILL.md, pilotswarm-npm-deployer.agent.md (bump 1.1.1 -> 1.2.0), deploy/scripts/auth/README.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The AKS workload-identity FIC lives on the Entra app, not in k8s. The worker pod boots fine without it; AAD only consults the FIC at OBO-exchange time. So patch-fic can simply run at the end of the deploy pipeline (after bicep + manifests + rollout), right before 'pilotswarm smoke', with no pod restart required. Updated sequencing wording in pilotswarm-obo-smoke-app-reg/SKILL.md, pilotswarm-new-env-deploy/SKILL.md, pilotswarm-npm-deployer.agent.md, and the script's own stdout hints. The mid-deploy sandwich is gone. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e skill The OBO smoke worker app skill is the auto-provisioning path; the npm-deployer agent should drive it hands-free. Telling the agent to skip when the operator pre-pasted .env values created a manual escape hatch and risked stamps where the .env points at an app that still lacks the FIC / scope / pre-auth. Reframed both surfaces: the wrapper is idempotent (re-runs are no-ops). To point at a manually-managed app, use -ExistingAppId rather than skipping the wrapper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ional The prior docs implied tenant admin consent was required and per-user consent was an edge case. Reality is the opposite for most tenants (including Microsoft's): each user can consent to Graph User.Read on the worker SP at portal sign-in, and OBO works for that user thereafter. Admin consent is just a shortcut that skips the per-user prompt for shared stamps. Reframed consent text across: pilotswarm-obo-smoke-app-reg/SKILL.md (Consent section + AADSTS65001 troubleshooting row + Mode help), pilotswarm-npm-deployer.agent.md, docs/operations/live-smoke.md, deploy/scripts/auth/README.md, and Setup-OboSmokeWorkerApp.ps1 (param help, failure warning, and final stdout note). Also clarified that Cloud Application Administrator role can grant consent for a single app without being Global Admin, since not every team has GA delegation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep keyvault.bicep focused on the vault + secret-tier RBAC. Move OBO Key Encryption Key creation, the Key Vault Crypto User role-assignment loop, and the OBO KEK output into a new single-responsibility module obo-kek.bicep. main.bicep instantiates it conditionally on oboEnabled and routes the existing oboKekKid output through it (preserving the __PS_UNSET__ sentinel for non-OBO stamps so the overlay .env substitution stays satisfied). Behavior is unchanged: same role definition (Key Vault Crypto User), same role-assignment GUID (guid(keyVault.id, principalId, kvCryptoUserRoleId, 'obo-kek')), same key shape (RSA-2048, wrapKey/unwrapKey, 365-day rotation, 30-day notify, 730-day expiry), same OBO_KEK_KID output URL. No deployment-name change for the AKV module; the new module deploys as <prefix>-obo-kek-<dTime>. Bicep build clean: exit 0, no new warnings (pre-existing Rotate/Notify casing + postgres password warnings unchanged). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Setup-OboSmokeWorkerApp.ps1 -Mode patch-fic now writes an eSTS-issuer
FIC with the worker pod's UAMI as subject, instead of an AKS-direct
FIC. Microsoft CORP tenant rejects AKS-direct FIC on 3P apps despite
the published aka.ms/PTMFICWiki doc. MSI-as-FIC is the supported
CORP path and is also the SFI-aligned pattern for non-CORP tenants,
so it's the new default. -FicPattern aks-direct fallback retained
for tenants that allow it.
- Split combined Graph PATCH (define-scope + pre-authorize) into two
PATCHes so a fresh worker app can be provisioned in one wrapper run
(Graph rejects the combined call because the new scope id isn't yet
persisted when pre-auth references it).
- Centralize --variant {default|smoke} tag suffixing in
effectiveImageTag() (lib/service-info.mjs) so the build, push, and
manifest steps all agree on the smoke-tagged image.
- Smoke plugin auth-backend selector switches FIC detection from
AZURE_FEDERATED_TOKEN_FILE to WORKLOAD_IDENTITY_CLIENT_ID; MSAL
clientAssertion callback now resolves a fresh UAMI access token via
ManagedIdentityCredential against api://AzureADTokenExchange/.default
on every invocation.
- Update spec FR-025 / FR-026 / SC-018, smoke checklist, deploy and
auth READMEs, new-env / obo-smoke-app-reg skills, npm-deployer agent
(1.3.0), and live-smoke ops doc to describe the MSI-as-FIC topology.
Live-validated end-to-end on a fresh chkrawps10 stamp in CORP tenant.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… FIC limitation docs, deploy ergonomics Live-tenant validation on chkrawps10 (CORP) proved the User OBO Propagation contract end-to-end: portal MSAL downstream-scope acquisition, envelope encryption + per-RPC forwarding, getCurrentUserContextForSession lookup, and the structured tool-outcome contract. This commit captures the fixes discovered during that validation pass. Tool reachability (native overlay pattern) - packages/obo-smoke-plugin/agents/default.agent.md: declares the smoke tools in the agent overlay. Picked up by the plugin loader via _appDefaultToolNames and auto-inherited by every chat session on --variant smoke worker builds (canonical session-manager.ts visibility pipeline; pre-existing, not introduced here). - packages/sdk/src/worker.ts: _warnOrphanPluginTools() warns at startup when a plugin registers a handler whose tool name is not claimed by any agent overlay. Catches the "registered but invisible" gap. - packages/sdk/test/local/plugin-tools-contract.test.js + new plugin-with-claimed-tools fixture: regression coverage for the warn-on-orphan and silent-when-claimed paths (19/19 tests pass). - docs/plugin-architecture-guide.md, packages/obo-smoke-plugin/README.md: document the two-half plugin contract — handler registration AND name declaration in an overlay. OBO + FIC limitation documentation - docs/operations/obo-fic-limitations.md: new operator runbook capturing that MSI-as-FIC works for direct downstream resource access from a workload UAMI, but NOT as client_assertion for an OBO grant when the source UAMI is itself federated via AKS workload identity. AAD rejects FIC-derived tokens (xms_ficinfo claim) on the next federation with AADSTS700231, independent of FIC config correctness. FIC audience must be the URI form (api://AzureADTokenExchange); the GUID form yields AADSTS700214. Includes diagnostic recipe and unblock options (client secret, AKS-direct FIC where tenant policy allows, certificate). This is an AAD policy, not a PR #51 concern — the propagation contract is agnostic to client-credential shape. - .github/skills/pilotswarm-obo-smoke-app-reg/SKILL.md: warning callout pointing operators at the new runbook. Plugin OBO error enrichment - packages/obo-smoke-plugin/tools.js: catch block for the OBO grant surfaces errorCode + subError + correlationId + truncated errorMessage so AADSTS codes are visible in the structured tool result rather than collapsed to a generic message. Deploy ergonomics carried with this validation pass - deploy/scripts/lib/common.mjs: unique-per-build dirty image tags (timestamp suffix) so each rebuild produces a fresh tag, eliminating the "image push appears to have no effect" pitfall. - deploy/scripts/lib/substitute-env.mjs + test: __PS_UNSET__ sentinel removes a key from a generated env file rather than emitting a blank. - deploy/gitops/worker/overlays/default/.env: smoke-profile env keys added (driven from .env.remote via the substitute-env path above). - packages/portal/src/auth/providers/entra.js: MSAL admission accepts id_token-only sign-ins for portal auth; OBO acquireTokenSilent uses the configured downstream scope list with consent-prompt hint when the cache is empty. - packages/cli/src/smoke/cli.js: backtick fix in usage banner. Validation summary - All 19 plugin-tools-contract tests pass. - TypeScript build clean. - Live smoke on chkrawps10: P-1..P-4 propagation chain green; obo_smoke_whoami returns the user principal + access token to the plugin and emits the structured obo_failed outcome with the AAD error code visible. The OBO grant itself is gated by tenant FIC policy (see new runbook) — not in PR #51 scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Brings in v0.2.0 release: VPN P2S ingress (#53), dual redirect URIs, client-profile helper, gitignore .claude/, proposals doc revisions. Conflict resolution (all 10 conflicts): - package.json x3, packages/{cli,sdk}/package.json: took main's 0.2.0 version bumps (our 0.1.36 bump was branch-local, never released). - package.json (test:deploy-scripts): merged — kept our 3 test additions (build-call-sites, dockerfile-worker, setup-obo-smoke-worker-app) and picked up main's appgw-waf-rules.test.mjs. - package-lock.json: took theirs + npm install reconcile. - CHANGELOG.md: our entry restructured under '## Unreleased'; main's '## 0.2.0 — 2026-06-19' entry inserted below. - deploy/envs/template.env, both base-infra bicep files, two skill markdowns: pure additive blocks (OBO vs VPN sections / params / bullets); both halves retained, ours-first ordering. Validation: - pilotswarm-sdk@0.2.0: tsc build clean. - packages/sdk/test/local/plugin-tools-contract.test.js: 19/19 pass. - npm run test:deploy-scripts: 281/281 pass (was 248 on main; +33 from this branch's smoke/OBO test additions). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six-phase delivery enabling downstream OBO consumers to perform OAuth2 On-Behalf-Of flows from worker tool handlers as the signed-in portal engineer rather than as the worker UAMI. ADO is the first anticipated consumer; future Graph/other consumers benefit from the same surface.
PilotSwarm now propagates a per-RPC user envelope from portal sign-in (MSAL acquisition with configurable downstream scope +
offline_access) through an envelope-encrypted carrier (AKV-wrapped DEK + AES-256-GCM ciphertext) into a worker-side synchronous lookup (getUserContextForSession(sessionId)) plus a structured tool-outcome family (interactionRequired/serviceUnavailable) for principled error propagation back to the portal UI, including transport-level auto re-auth.Subsequent compartmentalization work generalized the in-process tool extension surface:
plugin.jsonnow supports atoolsfield for app-tier tool plugins (atomic, fail-fast collision policy;PluginManifestexported as a public SDK type). The OBO live-smoke harness consumes this contract and is fully detached from the default deploy surface — production deploys carry zero smoke code, zero smoke env keys, and zero smoke-only configuration.Architecture invariants
Codified in
.github/copilot-instructions.mdunder "User OBO (User-On-Behalf-Of) Propagation":envelope(carrying plaintextprincipalclaims plus optionalaccessTokenCipher), notenvelopeCipher. Plaintext principal claims flow on every worker-bound RPC; only the access token is encrypted.packages/sdk/src/envelope-crypto.tsselected byselectEnvelopeCrypto(env):AkvEnvelopeCrypto(production; AKV SDKs lazy-loaded so non-OBO consumers don't pull deps),InMemoryEnvelopeCrypto(tests),PlaintextEnvelopeCrypto(dev-only, with loud startup warning, sentinelkekKid: "plaintext-mode").wrapResult.keyID(versioned URL) so prior-version retention covers in-flight envelopes when the KEK is rotated.accessToken: nullis the universal absence signal (no token configured, system/orchestration session, AKV unwrap failure).packages/sdk/src/tool-outcomes.ts) with pinned reason codes (reauth_required|mfa_refresh|conditional_access|consent_required) — three-way machine-distinguishable from generic tool failure. Theclaimsblob never reaches the LLM transcript; portal re-auth UI keys offreasonCode.PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPEis acquired).serviceUnavailable/interactionRequired.Plugin contract & smoke compartmentalization
Folded in late in the cycle as a generalization of the OBO live-smoke harness:
plugin.jsongains an optionaltoolsfield — a string path to a JS module that exportsregisterTools(worker). The worker plugin loader invokes this hook atstart()for app-tier plugins (pluginDirs); system/management tier ignore-with-warning.ToolNameCollisionErrorfrom any contributor names both contributors, and no partial set of the offending plugin's tools remains registered. Loader is fully fail-closed (missing dirs, missing modules, missing exports, sync throw, async reject all halt startup).PluginManifestis exported frompilotswarm-sdkas a public TypeScript type so plugin authors can validateplugin.jsonat compile time. Seedocs/plugin-architecture-guide.md§7.examples/obo-smoke/topackages/obo-smoke-plugin/and consumes the same contract as any other app-tier plugin — no special-casing.--variant smokeworker image build (selectsDockerfile.worker'sruntime-smoketarget);deploy/envs/template.smoke.envoverlay composed into the per-stamp env;PLUGIN_DIRS=/app/packages/obo-smoke-pluginset on the worker.OBO_SMOKE_ENABLEDis no longer a worker boot gate — it's a stamp marker the smoke driver preflights.deploy/scripts/test/pin the multi-stage Dockerfile contract and prevent reintroduction of smoke env keys into the default surface.Implementation history
2b7a80fenvelope-crypto backends +UserContextStorefoundation62987b9orchestration plumbing + portal/CLI envelope wiring on every worker-bound RPC;f9a7a95carrier JSDoc fix0f40d84worker-affinedgetCurrentUserContextForSessionlookup with chain resolution (sub-agent → root portal-bound parent at lookup time, with re-rooting support per FR-021)72de0d1portal MSAL downstream-scope acquisition + envelope encryption end-to-end42d995einteractionRequired/serviceUnavailablestructured tool outcomes (FR-008, FR-008b)c1c9bc4examples/obo-smoke/reference plugin (whoami via Graph/me,force_reauthagainst CA-protected scope) +docs/operations/obo-kek-runbook.mdlive-tenant smoke checklistc5c48dfAKV KEK provisioning Bicep + deploy wiring + version bump 0.1.36;c327312review fixes (missing@azure/keyvault-keysdep, worker overlay scope key, KID versioning, builder template)5c5bae0smoke principal-shape access (F1), worker.stop finally block (F4), portal transport-level auto re-auth wire (F3) — debounced ~30s/session, global in-flight guard, firesgetDownstreamToken({ interactive: true })oninteraction_requiredoutcomes3338fc8new "User OBO Propagation" section incopilot-instructions.md;pilotswarm-tuiSKILL updated to capture portal auto-reauth wire;pilotswarm-releaseSKILL adds OBO live-tenant smoke gatef05e1a43 new unit cases assertingconsole.warnfires fromselectEnvelopeCryptowith the expected payload, and that AKV / null paths stay quiete74f996AKS-compatible smoke plugin backend (auto-selects client-secret vs workload-identity FIC),OBO_SMOKE_ENABLEDdeploy toggle,pilotswarm smoke <stamp> --profile oboCLI driver with JSON pass/fail,.github/workflows/live-smoke-obo.ymlworkflow_dispatchscaffold, and consolidateddocs/operations/live-smoke.md. Closes the FR-015 "FIC out of scope for smoke plugin" gap and makes the npm-publish release gate (FR-018) a one-command operatione74f996: (1) AKV unwrap in-activity retry loop with [500ms, 2s, 5s] backoff before falling through toservice_unavailable(FR-024 transient-retry compliance); (2) portal transporteventType/typenormalization inbrowser-transport.js maybeTriggerInteractiveReauthwith 6 new regression tests inportal-interactive-reauth.test.jsc37d978: pinnedInteractionRequiredReasonCodeunion +INTERACTION_REQUIRED_REASON_CODESset with reject-on-unknown enforcement in theinteractionRequired()helper;.env.exampleOBO_KEK_KIDexample normalized to un-versioned form with explanatory comment matchingdocs/configuration.mdanddocs/operations/obo-kek-runbook.md6c97b7b: projectOBO_SMOKE_WORKER_APP_TENANT_ID/_CLIENT_ID/_GRAPH_SCOPE+OBO_SMOKE_TEST_USER_UPNthroughdeploy/envs/template.env,compose-env.mjssentinel-fallback, and the worker overlay so flippingOBO_SMOKE_ENABLED=trueon a stamp lands the smoke plugin's per-stamp downstream-app config in the worker ConfigMap automatically (no manual Secret edits, no worker image rebuild). On AKS the plugin uses workload-identity FIC; CLIENT_SECRET stays a local-dev escape hatch.pilotswarm-npm-deployer.agent.mdandpilotswarm-new-env-deploy/SKILL.mdupdated with toggle workflow + post-deploy verification snippetfdd41c7:CHANGELOG.md/docs/sdk/user-context.md/docs/operations/live-smoke.mdupdated to capture pinned reason-code enforcement and the deploy-pipeline plumbing for the smoke harness698213c:deploy/scripts/README.mddocumentsOBO_KEK_KID+OBO_SMOKE_*knobs so operators see them in the canonical script reference, not just the per-stamp env23b731c: newdeploy/scripts/auth/Setup-OboSmokeWorkerApp.ps1(idempotent app-registration + service-principal + workload-identity FIC + un-versioned scope) takes the FR-018 release-gate smoke from "operator manually pre-creates the worker app" to a one-command bring-up for a fresh stamp. Wired into the npm-deployer agent +pilotswarm-new-env-deploySKILL as a pre-step gated onOBO_SMOKE_ENABLED=true. Newdeploy/scripts/test/setup-obo-smoke-worker-app.test.mjsexercises 18 invariants (idempotency, FIC subject shape, scope un-versioned form, cross-file bicep contract). Fixes a real cross-file bug found by impl-review cycle 1:deploy/services/base-infra/bicep/main.bicepconsumedAks.outputs.oidcIssuerUrlinternally but never emitted it as a top-level output, so the script'sResolve-OidcIssuerFromEnvwould have thrown on every fresh stamp — INV-9 pins the contract going forward8d0aa9b: applied two findings frompaw-final-review(PASS-WITH-OBSERVATIONS): (1).github/copilot-instructions.mdself-inflicted drift to the early-draft namegetCurrentUserContextForSessioncorrected to the locked, shipped API namegetUserContextForSession; (2)packages/sdk/src/types.tsInteractionRequiredReasonCodeunion andINTERACTION_REQUIRED_REASON_CODESReadonlySetre-derived from a single privateas consttuple so the type and runtime check can no longer silently drift. Public export shape unchanged (stillReadonlySet) — no breaking change to downstream consumersd807965: rewrites comments, docs, skills, and test describe blocks that referenced internal PAW workflow phases (e.g. "Phase 6", "Phase 7 — FR-026") to use the feature name ("User OBO"), the relevant spec FR, or nothing. Two test files renamed (phase3-*→obo-*). Surfaces unrelated to this PR (enterprise deploy roadmap inmain.bicep, portal-authz roadmap, Foundry proposal phases, algorithm-internaldehydrate()step labels) deliberately left intactdf01c88: persists the User OBO Propagation spec todocs/specs/user-obo-propagation.mdso the FR-XXX / SC-XXX citations scattered across source comments, docs, tests, and skills resolve to a real document in this repo. Replaces internal-product/cluster identifiers across 18 files (docs, skills, proposals, fixtures, builder templates, source comments) with neutral placeholders (downstream consumer / ExampleApp /<aks-cluster>/<resource-group>) so the OSS surface stays product-neutral71a812e: the shipped.github/workflows/live-smoke-obo.ymlcould not actually run (loaded gitignored per-stamp.env, required Azure subscription / OIDC trust this repo does not provision). Removed the workflow + its actionlint-shape test, reworded FR-028 to "deferred — future work" with the gating prerequisites documented inline, and added a brief "CI workflow (future work)" section todocs/operations/live-smoke.mdexplaining how an operator can add one later. The CLI driver remains the supported local-operator path22fd8f9: introduces thetoolsfield onplugin.jsonand theregisterTools(worker)export contract (atomic, collision-safe, fail-closed); migrates the live-smoke harness fromexamples/obo-smoke/topackages/obo-smoke-plugin/consuming the new contract end-to-end. 17 new unit cases pin the contract's failure modes4d4b484: removesOBO_SMOKE_*blocks fromdeploy/envs/template.env,deploy/gitops/worker/overlays/default/.env, and thedeploy/scripts/lib/compose-env.mjssentinel-fallback. Addsdeploy/envs/template.smoke.envas the opt-in overlay (operators copy these keys into a per-stamp env when running smoke). 3 new compose-env tests including a directory-walk invariant that preventsOBO_SMOKE_*from being reintroduced intodeploy/scripts/lib/cb0a915: splitsDockerfile.workerintobase→runtime-smoke→runtimewithruntimeas the LAST stage, so baredocker buildproduces a smoke-free image.deploy/scripts/lib/build-image.mjsacceptsvariant: "default" | "smoke"; the smoke variant appends-smoketo the image tag and passes--target runtime-smoketo buildx. 5 static-parse tests (dockerfile-worker.test.mjs) + 1 call-site audit test (build-call-sites.test.mjs) pin the convention against future driftf28459f: 22-file sweep across operator docs (docs/operations/live-smoke.md,docs/specs/user-obo-propagation.mdFR-025/026/027 reframing), schema-doc surfaces (plugin-architecture-guide, configuration, system-reference, building-apps for cli + sdk), skills (pilotswarm-new-env-deploy,pilotswarm-obo-smoke-app-reg,pilotswarm-aks-deploy,pilotswarm-release), agents (pilotswarm-npm-deployer), setup READMEs, plugin-internal docs, CHANGELOG, and.github/copilot-instructions.md. Allexamples/obo-smokereferences updated topackages/obo-smoke-plugin;OBO_SMOKE_ENABLEDreframed as a stamp marker rather than a worker boot gatePluginManifesttype —12fe822: re-exports the typed manifest interface frompilotswarm-sdkso plugin authors get TypeScript-level validation of theirplugin.jsoncontents. Corrects a pre-existing doc bug indocs/plugin-architecture-guide.md§7 (the example used object form fortools; the loader only accepts string form). 2 new unit cases verify the public re-export and validate the runtime shape of every checked-inplugin.json878113a: rewrites a comment inpackages/obo-smoke-plugin/tools.jsthat referenced internal PAW phase numbering (Phase-4 outcome family); the shipped source carries no internal workflow phase labelsCoordination with downstream consumer specs
Lookup contract and outcome taxonomy were locked through coordinated review with downstream consumer per-user-delegation work:
{ principal, accessToken, accessTokenExpiresAt }) andaccessToken: nullsemantics confirmed.reasonCode.oboKekUamiPrincipalIdsarray contract supports both single-UAMI and dual-UAMI (PilotSwarm reference shape) deployments via parameter file alone — no template fork.wrapKey/unwrapKeyonly, 365-day auto-rotate with prior-version retention; un-versionedOBO_KEK_KIDaccepted, encrypt-time records versioned URL.Operator-visible config
PORTAL_AUTH_PROVIDER=entra,PORTAL_AUTH_ENTRA_TENANT_ID,PORTAL_AUTH_ENTRA_CLIENT_ID,PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE(e.g.api://<worker-app>/.default offline_access).OBO_KEK_KID(full or unversioned AKV key URL),WORKLOAD_IDENTITY_CLIENT_ID.Key Vault Crypto Useron the OBO KEK AKV.--variant smoke, composedeploy/envs/template.smoke.envinto the per-stamp env, setPLUGIN_DIRS=/app/packages/obo-smoke-plugin. Seedocs/operations/live-smoke.md.Testing
envelope-crypto,obo-envelope-shape,obo-envelope-roundtrip,obo-no-plaintext-in-queue,obo-smoke-plugin-loadable,plugin-tools-contract,plugin-manifest-type— all pass.dockerfile-worker.test.mjs,build-call-sites.test.mjs, and the directory-walk invariant guarding the default deploy surface againstOBO_SMOKE_*reintroduction) — all pass.tsc --noEmitclean across packages.docs/operations/live-smoke.md) — operator-run release gate; must be exercised on the publishing stamp before npm publish for any OBO-touching release. Thepilotswarm-releaseSKILL now codifies this. Full feature spec lives atdocs/specs/user-obo-propagation.md.Pre-existing flakes (NOT caused by this PR)
force-module.test.mjs— Node 24mock.moduleAPI change, pre-dates this branch.obo-envelope-roundtrip.test.js— require live LLM provider env, expected to be skipped/red in offline CI.Breaking changes
None. The envelope is additive on the wire (existing fields unchanged; new
accessTokenCipheris optional). Workers without OBO crypto config fall back gracefully (selectEnvelopeCryptoreturnsnull, plaintext principal-only envelopes continue to flow). Existing non-portal callers (local TUI hosts) are unaffected —getUserContextForSessionreturnsnull. PublicgetDownstreamToken({ interactive })option is backwards-compatible (existing callers pass no args). The plugintoolsfield is purely additive; existing plugins without it keep working unchanged. The smoke harness path moved (examples/obo-smoke/→packages/obo-smoke-plugin/) — operators are not expected to import from the old path; theSetup-OboSmokeWorkerApp.ps1paste-block emits the newPLUGIN_DIRSline automatically.Deployment
pilotswarm-sdk,pilotswarm-cli,pilotswarm-web.@azure/keyvault-keys: ^4.10.0(required when OBO is active; lazy-loaded otherwise).OBO_KEK_KID,PORTAL_AUTH_ENTRA_DOWNSTREAM_SCOPE(sentinel__PS_UNSET__keeps non-OBO deployments unchanged).oboKekUamiPrincipalIds.Dockerfile.worker); defaultdocker buildresolves to the smoke-freeruntimetarget. Smoke stamps must build with--variant smoke(docker buildx build --target runtime-smoke).docs/operations/obo-kek-runbook.md.Deferred (intentional, tracked in plan candidates)
email,displayName) ride on every worker-bound RPC envelope; data category is not new (already incms.sessions.ownerat rest), but the new wire surface increases the audit footprint from 3 → ~50 sites. Deferred so a future workstream can land redaction uniformly across both old and new sites rather than scope-creeping the OBO feature.pilotswarm-smokepackage — deferred to its own spec; the driver currently lives insidepilotswarm-cli. Compartmentalization at the worker-image level (this PR) is sufficient for the OBO release.init(worker)/dispose()) — deferred to its own spec; thetoolsfield landed in this PR covers the immediate OBO need without committing to a broader lifecycle API.🐾 Generated with PAW