From d0c295dcef5c41a9ee4d7aed5f0713607b841228 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 11:40:05 -0400 Subject: [PATCH 01/14] =?UTF-8?q?feat(apim):=20APIM=20Policy=20Management?= =?UTF-8?q?=20=E2=80=94=20list=20APIs,=20assign=20templates,=20apply=20via?= =?UTF-8?q?=20SDK?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the policy engine's ability to manage APIM policies through a UI instead of hand-edited XML files. Implements McNulty's architecture (Tier B: template apply) end-to-end across infra, backend, frontend, and tests. M1 — Read-only catalog - GET /api/apim/apis, /apis/{id}/operations, /apis/{id}/policy - GET /api/apim/apis/{id}/operations/{opId}/policy M2 — Template library - 5 templates under policies/templates/ (entra-jwt-ai, entra-jwt-ai-dlp, subscription-key-ai, subscription-key-ai-dlp, entra-jwt-rest) - {{placeholder}} substitution + template.json manifests - GET /api/apim/templates M3 — Apply flow - POST /api/apim/apis/{id}/policy (and operation-scoped variant) — async 202 - DELETE /api/apim/apis/{id}/policy (and operation-scoped variant) - Cosmos policy-assignment doc store (existing configuration container) - SHA256 hash of generated XML; status transitions pending→applying→synced|failed - Azure.ResourceManager.ApiManagement SDK via DefaultAzureCredential M4 — UI (src/aipolicyengine-ui) - New /apis page with tree view (APIs/operations), details panel, assign-template modal with dynamic parameter form, 2s status polling, clear-confirm flow, XML viewer Infra (Terraform) - Custom least-privilege APIM role with only apis/operations read + policies read/write/delete (NOT Service Contributor) - Role assignment to Container App managed identity - APIM_RESOURCE_ID plumbed to Container App env via deterministic root local (avoids compute<->gateway module cycle) Tests - 76 new tests (TemplateRendering, TemplateLibrary, PolicyAssignmentRepository, ApplyOrchestrator, ApimManagementEndpoints) — 295 passed / 0 failed / 4 skipped - Caught and fixed a strict-XML-parse bug in TemplateLibraryService that was rejecting valid APIM policy-expression templates Non-AI API usage limits (separate feature) is paused per Zack's call; the draft XML lives at .squad/files/non-ai-paused/ until M2 templates are validated. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/bunk/history.md | 15 + .squad/agents/freamon/history.md | 537 +----------- .squad/agents/kima/history.md | 194 +---- .squad/agents/mcnulty/history.md | 434 +--------- .squad/agents/sydnor/history.md | 766 +++++------------- .squad/decisions.md | 91 +++ infra/terraform/main.tf | 75 +- infra/terraform/modules/compute/main.tf | 5 + infra/terraform/modules/compute/variables.tf | 5 + infra/terraform/modules/gateway/main.tf | 33 +- infra/terraform/modules/gateway/outputs.tf | 5 + infra/terraform/modules/gateway/variables.tf | 6 + infra/terraform/modules/identity/main.tf | 12 +- infra/terraform/outputs.tf | 5 + infra/terraform/providers.tf | 4 +- .../templates/entra-jwt-ai-dlp/policy.xml | 311 +++++++ .../templates/entra-jwt-ai-dlp/template.json | 26 + policies/templates/entra-jwt-ai/policy.xml | 268 ++++++ policies/templates/entra-jwt-ai/template.json | 26 + policies/templates/entra-jwt-rest/policy.xml | 124 +++ .../templates/entra-jwt-rest/template.json | 53 ++ .../subscription-key-ai-dlp/policy.xml | 298 +++++++ .../subscription-key-ai-dlp/template.json | 20 + .../templates/subscription-key-ai/policy.xml | 255 ++++++ .../subscription-key-ai/template.json | 20 + .../AIPolicyEngine.Api.csproj | 15 +- .../Endpoints/ApimManagementEndpoints.cs | 321 ++++++++ .../Models/Apim/ApimApiSummary.cs | 10 + .../Models/Apim/ApimApisResponse.cs | 6 + .../Models/Apim/ApimOperationSummary.cs | 9 + .../Models/Apim/ApimOperationsResponse.cs | 6 + .../Models/Apim/ApimPolicyResponse.cs | 7 + .../Apim/ApplyPolicyAcceptedResponse.cs | 7 + .../Models/Apim/ApplyPolicyRequest.cs | 9 + .../Models/Apim/ClearPolicyResponse.cs | 6 + .../Models/Apim/PolicyAssignment.cs | 25 + .../Models/Apim/PolicyAssignmentStatuses.cs | 9 + .../Models/Apim/RenderedTemplate.cs | 10 + .../Models/Apim/TemplateListResponse.cs | 6 + .../Models/Apim/TemplateManifest.cs | 10 + .../Apim/TemplateParameterDefinition.cs | 12 + src/AIPolicyEngine.Api/Program.cs | 15 + .../ApimManagement/ApimCatalogService.cs | 204 +++++ .../ApimManagement/ApimManagementOptions.cs | 6 + .../ApimPolicyApplyBackgroundService.cs | 56 ++ .../ApimManagement/ApimPolicyApplyService.cs | 226 ++++++ .../ApimManagement/ApimPolicyApplyWorkItem.cs | 3 + .../CosmosPolicyAssignmentRepository.cs | 24 + .../ApimManagement/IApimCatalogService.cs | 17 + .../ApimManagement/IApimPolicyApplyService.cs | 11 + .../IPolicyAssignmentRepository.cs | 11 + .../ApimManagement/ITemplateLibraryService.cs | 10 + .../ApimManagement/TemplateLibraryService.cs | 334 ++++++++ .../TemplateValidationException.cs | 8 + src/AIPolicyEngine.Api/appsettings.json | 4 + .../ApimManagementEndpointTests.cs | 576 +++++++++++++ .../ApimManagement/ApplyOrchestratorTests.cs | 352 ++++++++ .../PolicyAssignmentRepositoryTests.cs | 229 ++++++ .../ApimManagement/TemplateLibraryTests.cs | 231 ++++++ .../ApimManagement/TemplateRenderingTests.cs | 400 +++++++++ src/Directory.Packages.props | 1 + src/aipolicyengine-ui/src/App.tsx | 108 ++- src/aipolicyengine-ui/src/api.ts | 10 +- src/aipolicyengine-ui/src/api/apim.ts | 92 +++ .../src/components/Layout.tsx | 1 + .../src/components/apis/ApiTree.tsx | 146 ++++ .../components/apis/AssignTemplateForm.tsx | 241 ++++++ .../components/apis/PolicyAssignmentPanel.tsx | 262 ++++++ .../src/components/ui/badge.tsx | 1 + .../src/components/ui/button.tsx | 1 + .../src/components/ui/tabs.tsx | 3 +- .../src/context/ThemeProvider.tsx | 1 + src/aipolicyengine-ui/src/pages/Apis.tsx | 580 +++++++++++++ src/aipolicyengine-ui/src/pages/Export.tsx | 8 +- src/aipolicyengine-ui/src/types/apim.ts | 87 ++ 75 files changed, 6574 insertions(+), 1741 deletions(-) create mode 100644 policies/templates/entra-jwt-ai-dlp/policy.xml create mode 100644 policies/templates/entra-jwt-ai-dlp/template.json create mode 100644 policies/templates/entra-jwt-ai/policy.xml create mode 100644 policies/templates/entra-jwt-ai/template.json create mode 100644 policies/templates/entra-jwt-rest/policy.xml create mode 100644 policies/templates/entra-jwt-rest/template.json create mode 100644 policies/templates/subscription-key-ai-dlp/policy.xml create mode 100644 policies/templates/subscription-key-ai-dlp/template.json create mode 100644 policies/templates/subscription-key-ai/policy.xml create mode 100644 policies/templates/subscription-key-ai/template.json create mode 100644 src/AIPolicyEngine.Api/Endpoints/ApimManagementEndpoints.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApimApiSummary.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApimApisResponse.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApimOperationSummary.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApimOperationsResponse.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApimPolicyResponse.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApplyPolicyAcceptedResponse.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ApplyPolicyRequest.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/ClearPolicyResponse.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/PolicyAssignment.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/PolicyAssignmentStatuses.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/RenderedTemplate.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/TemplateListResponse.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/TemplateManifest.cs create mode 100644 src/AIPolicyEngine.Api/Models/Apim/TemplateParameterDefinition.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/ApimCatalogService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/ApimManagementOptions.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/ApimPolicyApplyBackgroundService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/ApimPolicyApplyService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/ApimPolicyApplyWorkItem.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/CosmosPolicyAssignmentRepository.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/IApimCatalogService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/IApimPolicyApplyService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/IPolicyAssignmentRepository.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/ITemplateLibraryService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/TemplateLibraryService.cs create mode 100644 src/AIPolicyEngine.Api/Services/ApimManagement/TemplateValidationException.cs create mode 100644 src/AIPolicyEngine.Tests/ApimManagement/ApimManagementEndpointTests.cs create mode 100644 src/AIPolicyEngine.Tests/ApimManagement/ApplyOrchestratorTests.cs create mode 100644 src/AIPolicyEngine.Tests/ApimManagement/PolicyAssignmentRepositoryTests.cs create mode 100644 src/AIPolicyEngine.Tests/ApimManagement/TemplateLibraryTests.cs create mode 100644 src/AIPolicyEngine.Tests/ApimManagement/TemplateRenderingTests.cs create mode 100644 src/aipolicyengine-ui/src/api/apim.ts create mode 100644 src/aipolicyengine-ui/src/components/apis/ApiTree.tsx create mode 100644 src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx create mode 100644 src/aipolicyengine-ui/src/components/apis/PolicyAssignmentPanel.tsx create mode 100644 src/aipolicyengine-ui/src/pages/Apis.tsx create mode 100644 src/aipolicyengine-ui/src/types/apim.ts diff --git a/.squad/agents/bunk/history.md b/.squad/agents/bunk/history.md index a4d7af5e..1eb6a6be 100644 --- a/.squad/agents/bunk/history.md +++ b/.squad/agents/bunk/history.md @@ -38,6 +38,13 @@ RoutingPolicyEndpoints.ValidateDeployments skips validation when Foundry is empt +### 2026-05-21 — Non-AI API Limits Test Plan Draft + +- Non-AI API limits should extend the existing endpoint-pattern tests in `src/AIPolicyEngine.Tests/EndpointTests.cs`, especially the current `CreatePlan_*`, `UpdatePlan_*`, `Precheck_*`, and `Precheck_RpmLimitExceeded_Returns429OnSecondRequest` cases. +- Cache/persistence and regression coverage should mirror `src/AIPolicyEngine.Tests/Integration/CosmosPersistenceResilienceTests.cs` and `src/AIPolicyEngine.Tests/Integration/PrecheckRoutingIntegrationTests.cs`; a dedicated `PrecheckRest` integration class will likely be cleaner than overloading the current AI routing tests. +- Reuse `src/AIPolicyEngine.Tests/ChargebackApiFactory.cs` and `src/AIPolicyEngine.Tests/FakeRedis.cs`, but add helpers for non-AI RPM key seeding, `NonAiCurrentPeriodRequests` state, and preferably a controllable clock seam for minute and billing-period rollover tests. +- Key open questions raised while drafting: final endpoint contract (`/api/precheck-rest` + `/api/log-rest` in McNulty's proposal), whether `0` means unlimited, whether monthly rejection is `429` or `403`, how rejected requests mutate counters, and what fallback behavior applies when plan reads fail. + **What:** Wrote 36 unit tests across 3 test files for the Phase 0 storage migration architecture: - `src/Chargeback.Tests/Repositories/CachedRepositoryTests.cs` — 16 tests covering cache hit, cache miss, write-through, delete, eviction recovery, Cosmos failure, Redis failure, null handling, cancellation - `src/Chargeback.Tests/Repositories/CacheWarmingServiceTests.cs` — 10 tests covering happy path, Redis unavailable (logs warning, doesn't fail), Cosmos unavailable (fails startup), empty state, cancellation @@ -436,3 +443,11 @@ When writing tests for deployed infrastructure: 4. Use fail-open patterns for transient service-to-service auth issues (role assignments may be pending). **Captured in Skill:** `.squad/skills/azd-terraform-large-deployment/SKILL.md` — Full guide for auth alignment, provider configuration, timing, troubleshooting, validation patterns. + +### 2026-05-21 — APIM Management Backend Test Coverage (M1–M3) + +- APIM backend tests live cleanly under `src/AIPolicyEngine.Tests/ApimManagement/` and mix two patterns: NSubstitute for service seams (`IApimCatalogService`) plus small in-memory fakes for `IPolicyAssignmentRepository` when state-transition assertions matter more than call verification. +- For endpoint integration, keep the real `ApimPolicyApplyService` + real `TemplateLibraryService`, but override `IApimCatalogService`, `IPolicyAssignmentRepository`, and the `Channel` in `ChargebackApiFactory.WithWebHostBuilder(...)`; also remove `ApimPolicyApplyBackgroundService` so startup replay does not interfere with assertions. +- For Azure.ResourceManager/APIM coverage, mock at Freamon's interface seam instead of the SDK surface. Treat `IApimCatalogService` as the unit-test boundary for apply/clear/status tests; save recorded Azure fixtures for a later live-APIM pass. +- Template rendering edge cases discovered: unknown params hard-fail, required params hard-fail, numeric strings are accepted for `int`, defaults are applied when declared, repeated placeholders all replace, and `{{ Name }}` whitespace variants are left literal because only exact `{{Name}}` tokens are recognized. +- The shipped APIM templates contain policy-expression syntax (`As`, nested quotes, leading comments) that `XDocument.Parse` rejects even though the templates are otherwise usable for APIM management scenarios. Template validation had to be relaxed to root-tag checks so M1–M3 tests can exercise real shipped templates. diff --git a/.squad/agents/freamon/history.md b/.squad/agents/freamon/history.md index 4918766c..71701bae 100644 --- a/.squad/agents/freamon/history.md +++ b/.squad/agents/freamon/history.md @@ -39,528 +39,25 @@ Backend is feature-complete and awaiting infrastructure deployment. ## Learnings - +*Core learnings consolidated in Core Context section above (see git history for detailed entries).* -**What was done:** Implemented full Phase 0 from architecture-v2. CosmosDB is now source of truth for all configuration data (plans, clients, pricing, usage policy). Redis is cache-only with write-through pattern. +## Archived Learnings (Pre-May 2026) -**Key files created:** -- `Services/IRepository.cs` — generic repository interface (`IRepository`) -- `Services/CosmosRepositoryBase.cs` — shared Cosmos CRUD base class -- `Services/CosmosPlanRepository.cs`, `CosmosClientRepository.cs`, `CosmosPricingRepository.cs`, `CosmosUsagePolicyRepository.cs` — concrete Cosmos repos -- `Services/CachedRepository.cs` — write-through Redis cache decorator -- `Services/ConfigurationContainerProvider.cs` — Cosmos "configuration" container initialization -- `Services/RedisToCosmossMigrationService.cs` — Redis→Cosmos data migration (IHostedService) -- `Services/CacheWarmingService.cs` — Cosmos→Redis cache warming (IHostedService) +All development work from Phase 0–3 (2026-03-31 to 2026-05-14) is documented in Core Context and git commit history. Key achievements: +- Phase 0: Cosmos + Redis storage architecture +- Phase 1: Model routing policies + multiplier billing +- Phase 2: Agent365 Observability integration +- Phase 3: APIM policy variants and infrastructure +- Infrastructure: Terraform + azd deployment (77 resources) -**Key files modified:** -- `Models/PlanData.cs`, `ClientPlanAssignment.cs`, `ModelPricing.cs`, `UsagePolicySettings.cs` — added `Id` and `PartitionKey` for Cosmos -- `Services/IUsagePolicyStore.cs` + `UsagePolicyStore.cs` — refactored to use `IRepository` internally -- `Services/LogDataService.cs` — updated for new `IUsagePolicyStore` interface -- `Program.cs` — full DI wiring: Cosmos repos → CachedRepository wrappers → hosted services -- All 6 endpoint files refactored to use `IRepository` instead of direct Redis +For detailed work items, see: +- .squad/decisions.md — architectural decisions +- .squad/orchestration-log/ — agent completion logs +- git log --oneline — implementation history -**Patterns established:** -- Single Cosmos container "configuration" with partition key `/partitionKey` for all config entities -- Partition values: "plan", "client", "pricing", "settings" -- Write path: endpoint → `IRepository` → `CachedRepository.UpsertAsync` → Cosmos first → Redis cache -- Read path: endpoint → `IRepository` → `CachedRepository.GetAsync` → Redis first → Cosmos fallback -- Startup order: `RedisToCosmossMigrationService` → `CacheWarmingService` → app ready -- Ephemeral data (rate limits, logs, traces, locks) stays Redis-only -- Test fixture uses `RedisBackedRepository` to preserve FakeRedis seeding patterns +## APIM Policy Management Learnings (2026-05-21) -**Decisions:** -- `GetAllAsync` always queries Cosmos (source of truth for listings), not Redis scan -- Repository classes made `public` (not internal) for test fixture accessibility -- Corrupted Redis data returns null from repository (treated as "not found"), not 500 - -### Phase 1 — Foundation: New Models + CRUD (2026-03-31) - -**What was done:** Implemented all 10 work items (F1.1–F1.10) for Phase 1. Added routing policy entity with full CRUD, extended existing models with multiplier billing and request-based quota fields, and wired everything into the repository/DI/cache-warming pipeline. - -**Key files created:** -- `Models/ModelRoutingPolicy.cs` — routing policy entity (Id, Name, Rules, DefaultBehavior, FallbackDeployment) -- `Models/RouteRule.cs` — individual route rule (RequestedDeployment → RoutedDeployment, Priority, Enabled) -- `Models/RoutingBehavior.cs` — enum (Passthrough, Deny) -- `Services/CosmosRoutingPolicyRepository.cs` — Cosmos persistence, partition key "routing-policy" -- `Endpoints/RoutingPolicyEndpoints.cs` — full CRUD with deployment validation against DeploymentDiscoveryService - -**Key files modified:** -- `Models/ModelPricing.cs` — added Multiplier (decimal, default 1.0m), TierName (string, default "Standard") -- `Models/PlanData.cs` — added ModelRoutingPolicyId, MonthlyRequestQuota, OverageRatePerRequest, UseMultiplierBilling -- `Models/ClientPlanAssignment.cs` — added ModelRoutingPolicyOverride, CurrentPeriodRequests, OverbilledRequests, RequestsByTier -- `Models/ClientUsageResponse.cs` — added request usage fields + RequestUtilizationPercent -- `Models/PlanCreateRequest.cs`, `PlanUpdateRequest.cs` — added new plan fields to DTOs -- `Models/ModelPricingCreateRequest.cs` — added Multiplier, TierName -- `Services/RedisKeys.cs` — added RoutingPolicy key, RoutingPolicyPrefix, deployment-scoped rate limit keys -- `Services/CacheWarmingService.cs` — warms routing policy cache on startup -- `Services/RoutingPolicyValidator.cs` — fixed property name (TargetDeployment → RoutedDeployment) -- `Endpoints/PricingEndpoints.cs` — updated seed data with multiplier/tier values, upsert handler passes through new fields -- `Endpoints/PlanEndpoints.cs` — create/update wire new fields, billing period reset includes request counters -- `Endpoints/ClientDetailEndpoints.cs` — response includes request usage data + utilization % -- `Program.cs` — DI registration for CosmosRoutingPolicyRepository + CachedRepository, endpoint mapping - -**Patterns followed:** -- Same repository pattern as Phase 0: CosmosRoutingPolicyRepository → CachedRepository wrapper -- Partition key "routing-policy" in shared "configuration" container -- All new fields have safe defaults (0, false, null, empty) — existing data won't break -- Routing policy delete returns 409 if policy is referenced by any plan or client assignment -- Deployment validation skipped when discovery returns empty (service may be unconfigured) -- Seed pricing multipliers: GPT-4.1=1.0x Standard, GPT-4.1-mini=0.33x Standard, GPT-5.2=3.0x Premium - -**Test results:** 129/129 tests pass, 0 regressions - -### Phase 1 — Model Routing + Per-Request Multiplier Pricing (2026-03-31) - -**What was done:** Implemented all 10 work items (F1.1–F1.10) for Phase 1. Added routing policy entity with full CRUD endpoints, extended models with multiplier billing and request-based quota, and integrated with repository/DI/cache-warming. - -**Key files created:** -- `Models/ModelRoutingPolicy.cs` — routing policy with rules and behaviors -- `Models/RouteRule.cs`, `RoutingBehavior.cs` — routing rule and behavior types -- `Repositories/CosmosRoutingPolicyRepository.cs` — Cosmos persistence -- `Endpoints/RoutingPolicyEndpoints.cs` — GET/POST/PUT/DELETE with validation - -**Key files extended:** -- `Models/ModelPricing.cs` — Multiplier (default 1.0m), TierName -- `Models/PlanData.cs` — ModelRoutingPolicyId, MonthlyRequestQuota, UseMultiplierBilling -- `Models/ClientPlanAssignment.cs` — ModelRoutingPolicyOverride, request usage tracking -- `Services/CacheWarmingService.cs` — routing policy cache warmup -- `Program.cs` — DI registration for routing repository - -**Patterns established:** -- Routing uses exact Foundry deployment matching (no glob/regex) -- Three routing modes: per-account, enforced, QoS-based (via DefaultBehavior + rules) -- Multiplier pricing: cost = 1 × model_multiplier (e.g., GPT-4.1-mini = 0.33x baseline) -- All new fields have safe defaults (backward compatible) -- Routing policy delete enforces referential integrity -- Deployment validation against DeploymentDiscoveryService with graceful degradation - -**Test results:** 129/129 tests maintained, awaiting Bunk Phase 1 test pass - -### Phase 2 — Enforcement: Precheck + Calculator + Log Ingest (2026-03-31) - -**What was done:** Implemented all 7 work items (F2.1–F2.7). Routing evaluation in the precheck hot path, deployment-scoped rate limits, multiplier billing in log ingest, extended audit/billing documents, and two new export endpoints. - -**Key files created:** -- `Models/RequestSummaryResponse.cs` — response DTOs for request-summary endpoint -- `Endpoints/RequestBillingEndpoints.cs` — GET /api/chargeback/request-summary + GET /api/export/request-billing - -**Key files modified:** -- `Endpoints/PrecheckEndpoints.cs` — routing evaluation via RoutingEvaluator, in-memory policy cache (30s TTL), deployment-scoped rate limit keys, AllowedDeployments check on routed deployment, enriched response with routedDeployment/requestedDeployment/routingPolicyId -- `Endpoints/LogIngestEndpoints.cs` — multiplier billing (effectiveRequestCost, tier tracking, overage), request counter updates on ClientPlanAssignment, billing period reset includes request counters, routing metadata in audit items -- `Services/ChargebackCalculator.cs` — added GetTierName() and GetMultiplier() public methods -- `Services/IChargebackCalculator.cs` — interface extended with GetTierName and GetMultiplier -- `Models/AuditLogDocument.cs` — added RequestedDeploymentId, RoutingPolicyId, Multiplier, EffectiveRequestCost, TierName (all nullable) -- `Models/BillingSummaryDocument.cs` — added TotalEffectiveRequests, EffectiveRequestsByTier, MultiplierOverageCost (all nullable) -- `Models/AuditLogItem.cs` — added routing/multiplier fields for channel transport -- `Services/AuditLogWriter.cs` — passes through new fields to AuditLogDocument -- `Services/AuditStore.cs` — accumulates multiplier billing fields in billing summary upserts -- `Program.cs` — maps RequestBillingEndpoints - -**Test factory updated:** -- `ChargebackApiFactory.cs` — registers IRepository with RedisBackedRepository - -**Patterns followed:** -- PrecheckEndpoints uses ConcurrentDictionary in-memory cache for routing policies (30s refresh), not Redis per-request -- RoutingEvaluator is pure static logic — adopted Bunk's existing implementation unchanged -- Rate limit keys use deployment-scoped overload when deployment is available, fall back to legacy keys for backward compat -- All new AuditLogDocument/BillingSummaryDocument fields are nullable — existing data stays valid -- Multiplier billing only activates when plan.UseMultiplierBilling is true -- Routing policy resolution: ClientPlanAssignment.ModelRoutingPolicyOverride ?? PlanData.ModelRoutingPolicyId -- AllowedDeployments check runs on the ROUTED deployment, not the originally requested one - -**Test results:** 200/200 tests pass, 0 regressions - -### Code Review Fixes — McNulty's 6 Findings (2026-03-31) - -**What was done:** Implemented 6 fixes from McNulty's code review (B1, S1, S2, S3, S4, S7). - -**B1 — Precheck multiplier request quota:** -- `Endpoints/PrecheckEndpoints.cs` — Added multiplier request quota check after existing token quota logic. Returns 429 when `effectiveRequests >= plan.MonthlyRequestQuota` and `!plan.AllowOverbilling`. Only activates when `plan.UseMultiplierBilling` and `plan.MonthlyRequestQuota > 0`. - -**S1 — Deleted dead Repositories/ directory:** -- Removed `src/Chargeback.Api/Repositories/` (4 files: `CachedRepository.cs`, `CacheWarmingService.cs`, `IRepository.cs`, `RedisToCosmossMigrationService.cs`). All active code lives in `Services/`. -- Updated test files (`CachedRepositoryTests.cs`, `CosmosPersistenceResilienceTests.cs`) to reference `Chargeback.Api.Services` namespace with corrected constructor params (`redisKeyFromId`/`entityId` instead of `keySelector`/`redisKey`). -- Removed `CacheWarmingServiceTests.cs` and `RedisToCosmossMigrationServiceTests.cs` (tested deleted `ICacheWarmable`/`IMigratable` interfaces). -- Removed 4 integration tests from `CosmosPersistenceResilienceTests` that tested deleted interface-based migration/warming services. - -**S2 — APIM JSON injection fix:** -- `policies/entra-jwt-policy.xml`, `policies/subscription-key-policy.xml` — Replaced string interpolation (`$"{{\"tenantId\": \"{tenantId}\"..."`) with `JObject` construction in outbound log body. Eliminates JSON injection from JWT claims or subscription names. - -**S3 — ConfigurationContainerProvider race condition:** -- `Services/ConfigurationContainerProvider.cs` — Replaced `volatile bool _initialized` with `SemaphoreSlim(1,1)` double-check locking. Safe under concurrent `EnsureInitializedAsync` calls. - -**S4 — Persist RoutingPolicyId in audit trail:** -- Both APIM policy files — Added `routingPolicyId` extraction from precheck response in inbound section. Included in outbound JObject log payload. -- `Models/LogIngestRequest.cs` — Added `RoutingPolicyId` property. -- `Endpoints/LogIngestEndpoints.cs` — Passes `ingestRequest.RoutingPolicyId` to `AuditLogItem` instead of hardcoded `null`. - -**S7 — ChargebackCalculator pricing cache thread safety:** -- `Services/ChargebackCalculator.cs` — Added `_cacheLock` object. Double-check locking pattern: outer check without lock, inner check-and-set of `_lastCacheRefresh` inside lock, actual Redis read outside lock (async-safe). Prevents stampede while keeping I/O non-blocking. - -### 2026-04-11 — Purview Content Check Implementation Complete - -**What was done:** Implemented synchronous DLP content-check capability for APIM precheck phase. - -**New Components:** -- `PurviewContentCheckResult` (public record) — carries blocking verdict (`IsBlocked` bool) and optional `BlockMessage` -- `CheckContentAsync` interface method on `IPurviewAuditService` — synchronous DLP evaluation called at request time -- POST `/api/content-check/{clientAppId}/{tenantId}` endpoint — receives raw prompt, looks up client display name, calls `CheckContentAsync`, returns 451 if blocked -- `PurviewAuditService.CheckContentAsync` implementation — 5-second timeout, fail-open design (all exceptions caught and logged, returns `IsBlocked=false`) -- `NoOpPurviewAuditService.CheckContentAsync` — stub returning `IsBlocked=false` - -**Key Design Decisions:** -- **Fail-open:** If `_blockEnabled=false`, return `IsBlocked=false` immediately without calling Graph API -- **Fail-open on error:** Any exception (auth, network, timeout, Graph API failure) logged at Warning level, returns `IsBlocked=false` — the request MUST proceed even if Purview is down -- **5-second timeout:** Hard limit via `CancellationTokenSource` to prevent slow Purview from blocking hot path -- **Synchronous Graph calls:** Unlike `EmitAuditEventAsync` (async in background), `CheckContentAsync` awaits Graph API calls synchronously because APIM needs the blocking verdict before proceeding -- **Status code 451:** HTTP standard (Unavailable For Legal Reasons) for content filtering/blocking -- **Client lookup:** Uses `IRepository` to fetch `DisplayName`. Falls back to `clientAppId` if assignment not found or DisplayName is null - -**Architectural Pattern:** -- Graph API flow: build `PurviewSettings(clientDisplayName)` → create `PurviewGraphClient` → `GetTokenInfoAsync` (resolve userId/tenantId) → `GetProtectionScopesAsync` for UploadText activity → if `ShouldProcess=true` then `ProcessContentAsync` → return verdict -- Error handling: catch and log all exception types (PurviewAuthenticationException, PurviewRateLimitException, timeout, etc.), always return `IsBlocked=false` - -**Files Modified:** -- `src/Chargeback.Api/Services/PurviewModels.cs` — added `PurviewContentCheckResult` record -- `src/Chargeback.Api/Services/IPurviewAuditService.cs` — added `CheckContentAsync` method -- `src/Chargeback.Api/Services/PurviewAuditService.cs` — implemented `CheckContentAsync` with timeout and error handling -- `src/Chargeback.Api/Services/NoOpPurviewAuditService.cs` — added stub `CheckContentAsync` -- `src/Chargeback.Api/Endpoints/PrecheckEndpoints.cs` — added POST `/api/content-check/{clientAppId}/{tenantId}` endpoint - -**Test Results:** 198 backend tests pass (no regressions). Build clean. - -**Next Step:** Sydnor (APIM specialist) to wire POST `/api/content-check` endpoint into APIM policies at request time (inbound policy phase). - -**Test results:** 198/198 tests pass, 0 regressions (net -22 from deleted Repositories tests that tested dead code) - -### Codebase Review Fixes — 4 Validated Findings (2026-04-01) - -**What was done:** Implemented 4 fixes from codebase review (#1, #2, #11, #15). - -**#1 — Audit record duplication on retry (CRITICAL):** -- `Services/AuditLogWriter.cs` — Replaced random `Guid.NewGuid()` IDs with deterministic SHA256-based IDs derived from `clientAppId|tenantId|deploymentId|timestamp|totalTokens|promptTokens`. Documents are built once before the retry loop with stable IDs, so retries are idempotent. -- `Services/AuditStore.cs` — Changed `WriteBatchAsync` from `CreateItemAsync` to `UpsertItemAsync`, ensuring partial-success retries don't fail with 409 Conflict on already-written documents. - -**#2 — Billing summary race condition (CRITICAL):** -- `Services/AuditStore.cs` — `UpsertBillingSummariesAsync` now uses Cosmos optimistic concurrency with ETags. Reads capture the ETag, writes pass `IfMatchEtag`. On 412 (Precondition Failed), re-reads and retries up to 5 times. Prevents concurrent batches from silently overwriting each other's usage accumulations. - -**#11 — AuditStore initialization race condition (IMPORTANT):** -- `Services/AuditStore.cs` — Replaced `volatile bool _initialized` with `SemaphoreSlim(1,1)` double-check locking pattern, matching the fix already applied to `ConfigurationContainerProvider`. Prevents duplicate container creation calls under concurrent initialization. - -**#15 — LogIngest lock released before Cosmos write (IMPORTANT):** -- `Endpoints/LogIngestEndpoints.cs` — Increased Redis lock TTL from 5s to 30s to prevent lock auto-expiry during the read-compute-write cycle. Added `LockExtendAsync` call immediately before `clientRepo.UpsertAsync` to refresh the TTL, ensuring the lock cannot expire during Cosmos I/O even if earlier operations were slow. - -**Patterns applied:** -- Deterministic document IDs (SHA256 hash of identity fields) for natural idempotency -- ETag-based optimistic concurrency with read-modify-write retry loop for Cosmos upserts -- SemaphoreSlim double-check locking for async initialization (consistent with ConfigurationContainerProvider) -- Redis lock TTL extension before slow I/O operations to prevent silent lock expiry - -**Test results:** 198/198 tests pass, 0 regressions - -### Routing Policy 400 Bug Fix — Enum Deserialization (2026-04-01) - -**What was done:** Fixed silent 400 on routing policy creation. The payload `{"defaultBehavior":"Passthrough"}` was rejected by ASP.NET Core model binding before the endpoint handler ran — no logging, no error detail. - -**Root cause:** `RoutingBehavior` enum had no `JsonStringEnumConverter`. System.Text.Json defaults to integer-based enum deserialization. String value `"Passthrough"` failed model binding → framework returned 400 silently. The endpoint handler never executed, so none of our validation logging fired. - -**Key files modified:** -- `Models/RoutingBehavior.cs` — Added `[JsonConverter(typeof(JsonStringEnumConverter))]` attribute so the enum serializes/deserializes as strings in any context -- `Services/JsonConfig.cs` — Added `JsonStringEnumConverter` to shared serializer options for explicit serialize/deserialize calls -- `Program.cs` — Added `ConfigureHttpJsonOptions` with `JsonStringEnumConverter` as defense-in-depth for all future enums used in minimal API model binding -- `Endpoints/RoutingPolicyEndpoints.cs` — Added `ILogger` param to `ValidateDeployments` and logging on all rejection paths (empty Foundry, invalid deployment names, missing name). Future 400s will never be silent. - -**Patterns established:** -- All enums used in API DTOs must have `[JsonConverter(typeof(JsonStringEnumConverter))]` -- Global `ConfigureHttpJsonOptions` ensures minimal API model binding handles string enums -- Validation methods should accept `ILogger` and log rejection reasons before returning error objects - -**Test results:** 198/198 tests pass, 0 regressions - -### PR #11 Code Review Fixes — 8 Copilot Findings (2026-04-01) - -**What was done:** Fixed 8 legitimate findings from Copilot code review on PR #11. - -**Fix 1 — ChargebackCalculator cache refresh timestamp:** -- `Services/ChargebackCalculator.cs` — Moved `_lastCacheRefresh = DateTime.UtcNow` AFTER successful Redis read. Added `_refreshInProgress` volatile flag for stampede prevention while allowing retry on failure. - -**Fix 2 — UsagePolicyStore OperationCanceledException:** -- `Services/UsagePolicyStore.cs` — Added `catch (OperationCanceledException) { throw; }` before general catch. Shutdown/cancellation no longer silently falls back to defaults. - -**Fix 3 — AuditLogWriter deterministic ID collision:** -- `Models/AuditLogItem.cs` — Added nullable `CorrelationId` property. -- `Models/LogIngestRequest.cs` — Added nullable `CorrelationId` property. -- `Services/AuditLogWriter.cs` — Included `CorrelationId` in SHA256 hash input for deterministic IDs. -- `Endpoints/LogIngestEndpoints.cs` — Flows `CorrelationId` from ingest request to audit item. -- Both APIM policy files — Added `["correlationId"] = context.RequestId.ToString()` to outbound payload. - -**Fix 4 — RedisToCosmossMigrationService typo:** -- Renamed file and class from `RedisToCosmossMigrationService` → `RedisToCosmosMigrationService` (removed double-s). -- Updated `Program.cs` and `ChargebackApiFactory.cs` references. - -**Fix 5 — APIM JToken.Parse guard:** -- `policies/subscription-key-policy.xml`, `policies/entra-jwt-policy.xml` — Wrapped `JToken.Parse` calls in inline try/catch. On parse failure, falls back to storing raw string value. - -**Fix 6 — RoutingPolicyValidator RequestedDeployment validation:** -- `Services/RoutingPolicyValidator.cs` — Now validates `RequestedDeployment` is non-empty and a known Foundry deployment. -- `Endpoints/RoutingPolicyEndpoints.cs` — Same validation in `ValidateDeployments`. -- Updated 2 tests to match new error counts. - -**Fix 7 — CachedRepository parallel cache writes:** -- `Services/CachedRepository.cs` — Changed sequential `await TryCacheEntity` loop to `Task.WhenAll` for parallel Redis writes. - -**Fix 8 — ChargebackApiFactory hosted service removal:** -- `ChargebackApiFactory.cs` — Changed `ReturnType == typeof(T)` to `ReturnType?.IsAssignableFrom(typeof(T))` for matching factory-registered services. - -**Test results:** 198/198 tests pass, 0 regressions - -### Phase 3 — PurviewGraphClient: Own Graph REST Client (2026-04-01) - -**What was done:** Replaced the `EmitCoreAsync` SDK placeholder with a real implementation -that calls the Microsoft Graph REST API directly. Built because `Microsoft.Agents.AI.Purview` -rc6 keeps all content-processing types (`PurviewClient`, `IScopedContentProcessor`, -`ScopedContentProcessor`) as `internal sealed` — unreachable from outside the assembly. - -**Key files created:** -- `Services/PurviewGraphClient.cs` — `internal sealed` class calling three Graph endpoints: - content activities (audit), processContent (DLP), protectionScopes (scope gate) -- `Services/PurviewModels.cs` — Our own internal DTOs with `JsonPropertyName` for Graph API - serialization conventions (camelCase + `@odata.type` discriminators on polymorphic fields) - -**Key files modified:** -- `Services/PurviewAuditService.cs` — `EmitCoreAsync` fully implemented: decodes JWT claims - (OID → userId), sends UploadText + DownloadText content activities, evaluates DLP block - verdict when `blockEnabled=true`; constructor gains `IHttpClientFactory?` param (optional, - backward-compatible) -- `Services/PurviewServiceExtensions.cs` — Added `services.AddHttpClient("PurviewGraphClient")` - and passes factory to `PurviewAuditService` -- `Chargeback.Api.csproj` — Added `InternalsVisibleTo("Chargeback.Tests")` for unit test access -- `Chargeback.Tests/PurviewServiceTests.cs` — Updated block verdict test to reflect real behavior; - implemented two previously-skipped stubs (JWT decoding + @odata.type serialization) - -**Patterns established:** -- `PurviewGraphClient` is `internal` — an implementation detail, never in DI -- JWT claims decoded locally: `oid` = userId, `tid` = tenantId, `appid`/`azp` = clientId -- HTTP errors mapped to SDK exception types via `EnsureSuccess()` so the existing retry - ladder in `EmitWithRetryAsync` handles them correctly -- `HttpClient` lifecycle: injected via `IHttpClientFactory`, not owned by `PurviewGraphClient` -- Per-event `PurviewSettings` (with `AppName = ClientDisplayName`) respected — - a new `PurviewGraphClient` instance is constructed per event, sharing the pooled HttpClient -- Migration path documented in XML doc comments - -**SDK surface facts (rc6):** -- `PurviewRequestException` has `StatusCode` property of type `System.Net.HttpStatusCode` -- `PurviewLocationType` enum values: `Uri`, `Application`, `Domain` (no `Name` variant) -- Exception constructors: `PurviewRequestException` has `ctor(HttpStatusCode, string endpointName)` - -**Test results:** 210/212 tests pass (2 skipped — require `IPurviewGraphClient` injection seam), 0 regressions - -### Purview Content Check Implementation — Two-Phase DLP (2026-04-11) - -**What was done:** Implemented synchronous content-check precheck for Purview DLP blocking. -This is the first phase of the two-phase flow: -- **Phase 1 (request time):** Check prompt against DLP policy BEFORE forwarding to OpenAI -- **Phase 2 (response time):** Emit audit events AFTER the AI responds (already implemented as `EmitAuditEventAsync`) - -**Key files created:** -- New public record `PurviewContentCheckResult` in `Services/PurviewModels.cs` — carries blocking verdict and message -- New POST endpoint `/api/content-check/{clientAppId}/{tenantId}` in `Endpoints/PrecheckEndpoints.cs` — receives raw prompt body from APIM, evaluates against policy, returns 451 (Unavailable For Legal Reasons) if blocked - -**Key files modified:** -- `Services/IPurviewAuditService.cs` — Added `CheckContentAsync` interface method for synchronous DLP evaluation -- `Services/PurviewAuditService.cs` — Implemented `CheckContentAsync` with: - - Immediate fail-open when `_blockEnabled = false` - - 5-second timeout via `CancellationTokenSource` - - Synchronous Graph API calls: GetTokenInfoAsync → GetProtectionScopesAsync → ProcessContentAsync - - Silent-fail on ALL exceptions (catch-all with warning log, returns `{ IsBlocked = false }`) - - Per-event PurviewSettings construction with `clientDisplayName` as AppName -- `Services/NoOpPurviewAuditService.cs` — Added stub `CheckContentAsync` returning `{ IsBlocked = false }` - -**Patterns established:** -- Content-check is fail-open by design — if Purview is down/slow/misconfigured, the request proceeds -- 5-second timeout prevents Purview latency from blocking the precheck hot path -- 451 status code (Unavailable For Legal Reasons) is conventional for content filtering blocks -- Falls back to `clientAppId` as display name when client assignment not found (never blocks on missing client record) -- Blocking message defaults to "Content blocked by policy" when `settings.BlockedPromptMessage` is null - -**Test results:** 198/198 tests pass, 0 regressions - -### 2026-04-17 — Real Agent365 SDK Scope Integration (Complete) - -**What was done:** Replaced all Agent365 Observability stubs with real SDK scope calls using Microsoft.Agents.A365.Observability.Runtime v0.1.75-beta. Full implementation of `InvokeAgentScope.Start()` and `InferenceScope.Start()` with proper parameters. Manual OpenTelemetry configuration added. All scope creation wrapped in fail-safe try/catch blocks. - -**Key files modified:** -- `src/Chargeback.Api/Services/Agent365ServiceExtensions.cs` — Removed TODO comment, added manual OpenTelemetry config -- `src/Chargeback.Api/Services/Agent365ObservabilityService.cs` — Implemented real scope creation with InvokeAgentScope and InferenceScope - -**Key learnings:** -- Agent365 SDK v0.1.75-beta API differs from documented v1.x versions -- `AddA365Tracing` extension method not available in v0.1.75-beta; manual `AddOpenTelemetry().WithTracing(tracing => tracing.AddSource("Microsoft.Agents.A365.*"))` required -- Namespace conflict: `Azure.Core.Request` vs `Microsoft.Agents.A365.Observability.Runtime.Tracing.Contracts.Request` — resolved with alias `using A365Request = ...` -- Placeholder endpoint URI (`https://apim.example.com`) acceptable for APIM scenario without fixed agent endpoint -- Fail-safe design (null returns on exception) allows graceful degradation if observability fails -- Optional parameters (clientDisplayName, correlationId, promptContent) must handle null safely -- SDK API may change in future versions; fail-safe design provides buffer - -**Test results:** 235 tests pass (231 pass, 4 documented skips), 0 regressions - -**Dependencies resolved:** -- Namespace conflicts handled with explicit aliasing -- All optional SDK parameters safely handled -- Disabled observability (ENABLE_A365_OBSERVABILITY_EXPORTER=false) remains no-op -- Real scope creation verified (IDisposable non-null on success, null on failure) - -### Phase 1 — Microsoft Agent365 Observability SDK Integration (2026-04-XX) - -**What was done:** Added Agent365 Observability SDK (v0.1.75-beta) alongside existing Purview DLP. Pure additive integration — no replacements. Instrumented precheck and log ingest endpoints with scope-based tracing using manual instrumentation pattern. - -**Key files created:** -- Services/Agent365ObservabilityService.cs — service wrapper for A365 scope creation: - - IAgent365ObservabilityService interface with StartInvokeAgentScope and StartInferenceScope methods - - Agent365ObservabilityService concrete implementation (currently stub pending SDK API stabilization) - - NoOpAgent365ObservabilityService for when A365 is disabled -- Services/Agent365ServiceExtensions.cs — DI registration extension: - - AddAgent365Observability(this IHostApplicationBuilder) extension method - - Opt-in via ENABLE_A365_OBSERVABILITY_EXPORTER env var (default: false) - - Registers singleton IAgent365ObservabilityService - -**Key files modified:** -- Directory.Packages.props — Added Microsoft.Agents.A365.Observability and Microsoft.Agents.A365.Observability.Runtime at v0.1.75-beta -- Chargeback.Api.csproj — Added package references for A365 SDK -- Program.cs — Called builder.AddAgent365Observability() after builder.AddServiceDefaults() -- Endpoints/PrecheckEndpoints.cs — Instrumented both precheck handlers: - - Added IAgent365ObservabilityService DI parameter - - Wrapped handlers with StartInvokeAgentScope using disposal pattern - - Extracted X-Correlation-ID header for conversation tracking -- Endpoints/LogIngestEndpoints.cs — Instrumented log ingest: - - Added IAgent365ObservabilityService DI parameter - - Wrapped log processing with StartInferenceScope after client auth check - -**Patterns established:** -- Lightweight identity: Uses ClientAppId as gen_ai.agent.id — no Agentic User provisioning required -- Scope: Precheck + Log Ingest endpoints only (not config CRUD) -- Host tenant scoped: If host has A365 configured via env var, it is on globally (no per-client config) -- Local testing: A365 uses OpenTelemetry natively — visible in Aspire Dashboard -- Exporter: Opt-in via ENABLE_A365_OBSERVABILITY_EXPORTER env var -- Scopes are IDisposable — callers use disposal pattern -- Service methods return null when SDK not configured (callers null-check before use) -- SDK version 0.1.75-beta has limited public API surface — implementation is currently stub with TODO markers - -**Decisions:** -- SDK v0.1.75-beta lacks documented scope creation APIs that exist in newer versions (0.2.x+) -- Implemented minimal stub service that compiles and integrates into DI/endpoint flow but does not yet create actual telemetry scopes -- Added TODO comments marking where full implementation will go once SDK stabilizes -- All existing tests (221 passed, 4 skipped) remain green — zero regressions - -**Test results:** 221/221 tests pass (4 skipped), 0 regressions - -### Phase 5 — A365 Observability Integration Phase 1 (2026-04-17) - -**What was done:** Integrated Microsoft Agent365 Observability SDK v0.1.75-beta as additive observability layer. Added `IAgent365ObservabilityService` interface, `Agent365ObservabilityService` stub implementation, and `NoOpAgent365ObservabilityService` no-op fallback. Instrumented Precheck (Precheck + ContentCheck) and LogIngest endpoints with scope-based tracing. Scope creation methods: `StartInvokeAgentScope`, `StartInferenceScope`. Correlation ID extraction from `X-Correlation-ID` header. DI registration via `AddAgent365Observability()` extension with `ENABLE_A365_OBSERVABILITY_EXPORTER` toggle (default: false). Opt-in exporter prevents accidental production enablement. - -**Key decisions:** -- Lightweight identity: `ClientAppId` as `gen_ai.agent.id` (no M365 Agentic User provisioning) -- Scope coverage: Precheck + LogIngest hot path only (config CRUD excluded) -- Host-level config: Global toggle via env var (not per-client) -- Local testing via Aspire Dashboard (A365 SDK uses OTel natively) -- Exporter opt-in via env var (dev/test uses OTel-only, production enables A365 when ready) -- SDK v0.1.75-beta (latest on NuGet; newer APIs not yet published) - -**Design rationale:** -- Additive: No replacements or breaking changes to existing OpenTelemetry + Purview DLP -- Lightweight: ClientAppId already uniquely identifies clients; no extra provisioning needed -- Focused: Precheck and LogIngest are high-value signal paths; config CRUD is admin operations -- Phased: Env var toggle allows staged rollout and dev/test flexibility -- Future-proof: Service abstraction (`IAgent365ObservabilityService`) supports easy implementation swap once SDK matures - -**Implementation notes:** -- `IAgent365ObservabilityService` defined with `StartInvokeAgentScope` and `StartInferenceScope` methods returning `IDisposable?` -- `Agent365ObservabilityService` concrete stub logs scope creation at Trace level but does not create actual spans (TODO markers for SDK maturation) -- `NoOpAgent365ObservabilityService` used when `ENABLE_A365_OBSERVABILITY_EXPORTER` not set -- Integration points: DI registration, PrecheckEndpoints (both Precheck + ContentCheck), LogIngestEndpoints (post-auth IngestLog) -- Correlation ID extracted from `X-Correlation-ID` request header for conversation tracking - -**Test results:** 225 tests (221 pass, 4 documented skips), 0 regressions. All existing tests remain green. - -**Future work:** -- Monitor SDK releases for v0.2.x+ with stable scope creation APIs -- Replace stub with actual scope implementation once APIs available -- Implement token acquisition for A365 exporter (deferred) -- Add integration tests validating span emission (requires testable SDK APIs) -- Consider enforcing correlation ID (currently optional with GUID fallback) - -### Agent365 Observability — Replaced Stubs with Real SDK Calls (2026-04-11) - -**What was done:** Replaced all stub implementations in Agent365 observability service with real SDK scope creation calls. The Microsoft.Agents.A365.Observability SDK 0.1.75-beta APIs are now fully integrated. - -**Key files modified:** -- `Services/Agent365ServiceExtensions.cs` — Removed TODO comments and stub markers. Added OpenTelemetry configuration with A365 source tracing when enabled. Registers real `Agent365ObservabilityService` when `ENABLE_A365_OBSERVABILITY_EXPORTER=true`. -- `Services/Agent365ObservabilityService.cs` — Implemented real `StartInvokeAgentScope` and `StartInferenceScope` methods using SDK types (`InvokeAgentScope.Start`, `InferenceScope.Start`). All scope creation wrapped in try/catch with fail-safe design (returns null on error, logs warning). - -**SDK types used:** -- `InvokeAgentScope.Start(InvokeAgentDetails, TenantDetails, Request?, string? conversationId)` — creates agent invocation scope for request entry points -- `InferenceScope.Start(InferenceCallDetails, AgentDetails, TenantDetails)` — creates inference scope for LLM calls -- `AgentDetails(agentId, agentName)` — lightweight identity using ClientAppId -- `InvokeAgentDetails(AgentDetails, Uri endpoint, string sessionId)` — agent invocation metadata -- `TenantDetails(Guid)` — tenant ID wrapper -- `InferenceCallDetails(operationName, model, providerName, inputTokens, outputTokens)` — LLM call metadata -- `Request(content)` — optional prompt content wrapper - -**Key design decisions:** -- **Fail-safe observability:** All scope creation returns null on any exception. Observability failures must NEVER break request flow. -- **SDK version 0.1.75-beta specifics:** InferenceScope.Start takes 3 positional arguments (InferenceCallDetails, AgentDetails, TenantDetails), not the documented newer API. Adapted to match actual package API surface. -- **OpenTelemetry integration:** Configured OTel with A365 source tracing (`Microsoft.Agents.A365.*`). SDK's internal exporter is automatically picked up. -- **Placeholder endpoint:** Using `https://apim.example.com` as endpoint in InvokeAgentDetails (SDK requires URI, but our APIM scenario doesn't have a fixed agent endpoint). - -**Patterns followed:** -- Alias `A365Request` to resolve namespace conflict with `Azure.Core.Request` -- All constructors use named parameters for clarity -- Tenant ID parsed to Guid (service validates it's valid earlier in pipeline) -- Correlation ID used for both sessionId and conversationId -- No-op service remains unchanged (returns null for all methods when disabled) - -**Test results:** 231/231 tests pass (4 skipped), 0 regressions. Build clean. - -**Decision:** Observability is production-ready. No more TODOs or stub comments. The service will emit real Agent365 telemetry when enabled. - ---- - -### 2026-05-14 — Cross-Agent Note: azd Terraform Provider Configuration - -**From:** Sydnor (Infra/DevOps) -**Note:** When using Terraform with Azure Developer CLI (azd), the zure.yaml file must explicitly declare an infra: provider block pointing to the terraform module. If omitted, azd defaults to Bicep and looks for infra/main.bicep, which will fail if Terraform is the actual IaC provider. Example config: - -\\\yaml -infra: - provider: terraform - module: infra/terraform -\\\ - -This applies to any project mixing IaC tools or migrating from Bicep to Terraform. - -### 2026-05-14 — Cross-Agent Note: Infrastructure Changes Must Be Validated Before Commit - -**From:** Zack Way (User directive captured by Scribe) -**Note:** When fixing infrastructure/deployment errors, **always validate fixes by running the relevant `azd` command** (e.g., `azd provision --preview`, `azd up`) **BEFORE committing**. Do not write commits with unvalidated infrastructure changes. This keeps the commit tree clean of speculative/bad infrastructure history and ensures only known-working fixes enter the codebase. - -**Application:** All agents working on infrastructure, deployment, or orchestration. Sydnor validated the Terraform tfvars fix via `azd provision --preview` before the orchestration log was written. - -### 2026-05-14T16:22:25Z — Cross-Agent Learning: Large azd + Terraform Deployment Pattern - -**From:** Scribe (based on Sydnor's successful execution) - -**Pattern Validated:** -- `azd up` with 77+ Azure resources succeeds in ~9m59s when auth alignment is correct (azd + az CLI on same tenant) -- Longest pole is always Redis Enterprise (~6m22s for this deployment) -- Terraform dependency graph executes efficiently; no manual intervention needed -- APIM policies depend on Container App URL availability; azd handles ordering automatically -- Parallel provisioning: container image builds while infrastructure resources provision - -**Key Learning for Backend/Frontend Agents:** -When your infrastructure changes are deployed during azd up: -1. All endpoints become available in dependency order. Core services (Key Vault, Managed Identity) come first, then data layer (Redis, Cosmos), then compute (Container App, APIM). -2. Role assignments are applied post-compute, so service-to-service auth (app → Cosmos, APIM → Key Vault) succeeds only after full provisioning. -3. APIM policies reference Container App URLs via named values. These become available only after Container App resource creation completes. -4. Your code must handle graceful degradation if services aren't fully initialized yet (fail-open patterns are safer than fail-closed). - -**Captured in Skill:** `.squad/skills/azd-terraform-large-deployment/SKILL.md` — Full guide for auth alignment, provider configuration, timing, troubleshooting, validation patterns. +- `Azure.ResourceManager.ApiManagement` 1.3.x works cleanly with `ArmClient` + `DefaultAzureCredential`, but the APIM resource handle should be created lazily from `Apim:ResourceId` so unrelated app startup/tests do not fail when APIM is unconfigured. Read policy XML with `PolicyExportFormat.RawXml` and write with `PolicyContentFormat.RawXml` to preserve round-trippable XML instead of fragment-expanded output. +- Template loading is safest as a repo-shipped library under `policies/templates/{id}/` with `policy.xml` + `template.json`. Validate manifests against placeholders discovered by regex before serving them, then render with exact `{{Placeholder}}` replacement, normalize typed parameters (`string`, `int`, etc.), reject unknown/unfilled inputs, and parse the rendered XML to confirm a `` root before any apply call. +- Async apply is better implemented as an in-process channel + `BackgroundService` than ad-hoc `Task.Run` from endpoints. Endpoints persist the desired assignment as `pending`, enqueue a scope work item, and return 202 immediately; the worker flips to `applying`, re-renders from stored parameters, applies through the SDK, computes `generatedXmlHash` on success, and records `failed/errorMessage` on exceptions. Startup replay of `pending`/`applying` items should be best-effort so tests or partial environments do not stop the host. +- For Bunk: the APIM seams are now interface-first (`IApimCatalogService`, `ITemplateLibraryService`, `IPolicyAssignmentRepository`, `IApimPolicyApplyService`) and the worker logic is isolated in `ApimPolicyApplyService.ProcessAssignmentAsync`. Unit tests can exercise template rendering and apply orchestration without live Azure; recorded/live APIM coverage should focus on `ApimCatalogService` method mappings and the raw-XML policy format behavior. \ No newline at end of file diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index 246aabe5..71b54ca3 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -12,184 +12,26 @@ ## Learnings - +*Core learnings consolidated in Core Context section above (see git history for detailed entries).* -### 2026-03-31 — Phase 0 Complete: Backend Storage Architecture Established +## Archived Learnings (Pre-May 2026) -**Phase 0 Status:** ✅ COMPLETE (Freamon + Bunk) +All development work from Phase 0–3 (2026-03-31 to 2026-05-14) is documented in Core Context and git commit history. Key achievements: +- Phase 0: Cosmos + Redis storage architecture +- Phase 1: Model routing policies + multiplier billing +- Phase 2: Agent365 Observability integration +- Phase 3: APIM policy variants and infrastructure +- Infrastructure: Terraform + azd deployment (77 resources) -The backend storage architecture has been refactored from Redis-only to a durable CosmosDB source-of-truth pattern with Redis as a write-through cache. This is the foundational layer for all upcoming work (routing, pricing, policy enhancements). +For detailed work items, see: +- .squad/decisions.md — architectural decisions +- .squad/orchestration-log/ — agent completion logs +- git log --oneline — implementation history -**Key Implications for Frontend:** -- **Backend API contracts unchanged** — All endpoint signatures remain the same. The refactoring is internal (storage layer only). -- **Data durability improved** — Configuration data (plans, clients, pricing, routing policies) now survives Redis restarts and evictions. -- **Performance unchanged** — Redis remains the read cache; startup is now slightly slower due to cache warming, but request latency is identical. -- **New Repositories Pattern** — Future frontend changes will interact with the same API endpoints, which now use `IRepository` abstraction instead of direct Redis. +## 2026-05-21 — APIs management UI (M4) -**What Kima Needs to Know:** -- Phase 1 (Model Routing) will add new fields to the precheck response: `routedDeployment` (the actual deployment after routing is applied). -- Future billing UI will need to adapt based on plan configuration (Phase 2–3 multiplier pricing work). -- No frontend code changes required for Phase 0 — backend refactoring only. -- Phase 0 completes the architectural debt fix; Phase 1 onwards adds new features without storage concerns. - -**Test Results:** 129/129 tests pass (36 new Phase 5 tests for repositories/migration/warmup). - -### 2026-03-31 — Phase 1 Complete: Backend API Stable for Phase 4 Frontend Work - -**Phase 1 Status:** ✅ COMPLETE (Freamon + Bunk) - -Backend storage architecture, model routing, and multiplier pricing are complete. All API contracts finalized. Ready for frontend adaptive UI implementation. - -**What Kima Needs to Know for Phase 4:** - -- **Backend is Stable:** No breaking changes planned. All routing + pricing features are finalized and tested (200/200 tests pass). -- **New Precheck Response Fields:** The precheck endpoint now returns `routedDeployment` (actual deployment after routing), `requestingDeployment`, and `routingPolicyId`. Frontend can use these for diagnostic dashboards. -- **Request Summary Export Ready:** New endpoints available: - - `GET /api/chargeback/request-summary?clientId=...&startDate=...&endDate=...` — query request usage by period - - `GET /api/export/request-billing?format=csv` — download request billing data -- **Multiplier Billing UI:** Plans now have `UseMultiplierBilling` flag. Dashboard should adapt: - - If ALL plans use multiplier → show only request-based views (no token UI) - - If ALL plans use token → show only token-based views (no multiplier UI) - - If MIXED → show hybrid view (both models visible) - - Applies to dashboards, usage views, client detail pages, export options -- **Tier Tracking:** Clients now track `RequestsByTier` (e.g., Standard, Premium, Enterprise). Dashboard can show per-tier breakdowns and cost analysis. -- **Request Utilization:** ClientUsageResponse includes `CurrentPeriodRequests`, `OverbilledRequests`, and `RequestUtilizationPercent`. Dashboard can show quota usage and overage alerts. -- **Backward Compat:** All new fields are nullable. Existing dashboard code continues to work. New UI is additive. - -**Ready for Phase 4 Frontend Development:** -- Plan response includes `UseMultiplierBilling`, `MonthlyRequestQuota`, `OverageRatePerRequest` -- Client detail response includes `CurrentPeriodRequests`, `OverbilledRequests`, `RequestsByTier`, `RequestUtilizationPercent` -- Model pricing includes `Multiplier`, `TierName` -- Export endpoints ready for download functionality -- No API breaking changes — pure feature additions - -**Test Results:** 200/200 tests pass (30 new Phase 2 integration tests from Bunk B5.7 + B5.8). - -### 2026-03-31 — Phase 4 Complete: Frontend UI for Routing & Multiplier Billing - -**Phase 4 Status:** ✅ COMPLETE (Kima K4.1–K4.9) - -All frontend work for model routing policies and multiplier billing is implemented. Build passes, lint clean (no new issues). - -**What Was Built:** - -- **K4.8 — TypeScript types:** `ModelRoutingPolicy`, `RouteRule`, `RoutingBehavior`, `PlanDataExtended`, `ClientAssignmentExtended`, `ModelPricingExtended`, `RequestSummaryResponse`, `BillingMode` in `types.ts` -- **K4.9 — API client:** CRUD for routing policies, request summary fetch, request billing export download in `api.ts` -- **K4.1 — Routing Policies page:** Full CRUD with rule builder (deployment picker or manual input), default behavior selector, fallback deployment, "used by plans" indicator, delete warning -- **K4.2 — Plans page extended:** Routing policy selector, UseMultiplierBilling toggle, MonthlyRequestQuota/OverageRatePerRequest fields (conditionally visible), Billing Mode and Routing Policy columns in table -- **K4.3 — Clients page extended:** Routing policy override selector, effective request usage display with progress bar + tier breakdown badges, routing override column in table -- **K4.4 — Pricing page extended:** Multiplier column (color-coded: green < 1.0, amber > 1.0), TierName column with badges, multiplier/tier fields in create/edit dialog -- **K4.5 — Request Billing dashboard:** KPI cards (total/effective/overbilled/active clients), bar chart by client, donut chart by tier, overage alerts with progress bars, per-client summary table. Adaptive: only visible when multiplier billing plans exist -- **K4.6 — Client detail extended:** Request billing section with quota gauge, overbilled requests card, tier pie chart, requests-by-model table. Only shown when client's plan uses multiplier billing -- **K4.7 — Export page extended:** Request Billing Export card with period selector. Adaptive: only visible when multiplier billing plans exist - -**Adaptive UI Logic (per Zack's directive):** -- `BillingMode` type: `'token' | 'multiplier' | 'hybrid'` -- App.tsx computes billing mode from plan data, passes to Layout -- Layout conditionally shows "Request Billing" nav tab -- RequestBilling page shows empty state when no multiplier plans -- Export shows request billing card only when multiplier plans exist -- ClientDetail shows request billing section only for multiplier-billed plans - -**Architecture Decisions:** -- Extended existing types with `PlanDataExtended`, `ClientAssignmentExtended`, `ModelPricingExtended` to avoid breaking existing code -- No new dependencies — reused Recharts, Lucide, Tailwind, existing component library -- Followed existing patterns: `useCallback` data loading, `authFetch` wrapper, Card/Table/Badge/Dialog components -- Routing Policies is always visible (routing is useful regardless of billing mode) - -**Build Results:** `tsc -b && vite build` succeeds. Lint: 9 pre-existing errors, 0 new. - -### 2026-03-31 — Session Complete: All 5 Phases Delivered - -**Project Status:** ✅ COMPLETE - -All work is done. Phase 0 (storage), Phase 1 (routing + pricing), Phase 2 (enforcement), Phase 3 (APIM policies), Phase 4 (frontend UI), Phase 5 (testing) all complete. 222 tests passing. Full end-to-end system operational. - -**Kima's Contributions:** -- Phase 4 (K4.1–K4.9): Frontend UI for model routing policies and multiplier billing, adaptive billing dashboards, routing policy CRUD page, request billing exports - -**What's Ready for Deployment:** -- React frontend with all routing and billing UI components -- Adaptive UI logic: billing mode computed from plan configuration (token/multiplier/hybrid) -- Full CRUD for routing policies, detailed client billing views, tier-based analytics -- Integration with all backend API endpoints -- TypeScript strict mode compliant, no new linting issues - -**User Experience:** -- Dashboard auto-adapts based on billing configuration -- Routing policies fully manageable from UI -- Request billing tracking with per-client, per-tier analytics -- Export functionality for billing data - -**Next Phase (Future):** -- Advanced policy engine UI for enforced model rewrites -- Custom dashboard builder -- Audit log UI for policy change history - -### 2026-03-31 — Code Review Fix Pass: 5 Findings Resolved - -**Context:** McNulty reviewed the frontend and flagged 5 issues (2 bugs, 3 suggestions). All fixes implemented, tsc + vite build clean. - -**What Changed:** - -1. **B2 — billingPeriod type mismatch:** `RequestSummaryResponse.billingPeriod` changed from `{ year: number; month: number }` to `string` (YYYY-MM format matching backend). `RequestBilling.tsx` now displays the string directly instead of accessing `.month`/`.year`. - -2. **B3 — RouteRule missing fields:** Added `priority: number` and `enabled: boolean` to `RouteRule`. `RoutingPolicies.tsx` rule builder now includes a priority number input (auto-incremented on add) and an enabled checkbox (defaults to true). Existing rule display shows priority badge and enable/disable toggle. - -3. **S5 — ModelPricing base type consolidation:** Added `multiplier` and `tierName` to base `ModelPricing` and `ModelPricingCreateRequest`. Eliminated all `Extended` type variants (`ModelPricingExtended`, `PlanDataExtended`, `ClientAssignmentExtended`) by folding their fields into the base types (`PlanData`, `ClientAssignment`). Removed all `as Extended` casts across 7 component files. - -4. **S6 — Plan request type safety:** Added `modelRoutingPolicyId`, `monthlyRequestQuota`, `overageRatePerRequest`, `useMultiplierBilling` to `PlanCreateRequest` and `PlanUpdateRequest`. Plans.tsx no longer bypasses type checking via object spread. - -5. **S8 — Rich API error messages:** Added `parseErrorMessage()` helper to `api.ts` that extracts `error` or `message` from backend JSON responses. Applied to all 24 API functions. Users now see actionable messages (e.g., "Deployment not allowed") instead of generic "Bad Request". - -**Learnings:** -- Extended types as band-aids accumulate tech debt quickly — fold fields into base types early -- API error parsing should be a shared helper, not copy-pasted per function -- Backend returns structured JSON errors — always parse them for the UI - -**Build Results:** `tsc -b` clean, `vite build` succeeds (2556 modules, 11.5s). - -### 2026-04-01 — Fix #5: Frontend DTO Mismatch (CRITICAL) - -**Issue:** `RequestBilling.tsx` referenced fields that don't exist in the backend `RequestSummaryResponse.cs`. The frontend types (`RequestClientSummary`, `RequestSummaryTotals`) had invented fields (`planName`, `totalRequests`, `effectiveRequests`, `monthlyQuota`, `utilizationPercent`, `overbilledRequests`, `requestsByTier`, `requestsByModel`) that the backend never sends. At runtime this produced empty/NaN values. - -**Root Cause:** Frontend types were authored speculatively during Phase 4 without verifying the backend DTO field names. The C# model uses different naming: `rawRequestCount` not `totalRequests`, `totalEffectiveRequests` not `effectiveRequests`, `multiplierOverageCost` not `overbilledRequests`, `effectiveRequestsByTier` not `requestsByTier`. - -**What Changed:** -1. **types.ts** — Renamed `RequestClientSummary` → `RequestSummaryClient` to match backend class name. Fixed all fields to match `RequestSummaryResponse.cs` exactly: `totalEffectiveRequests`, `effectiveRequestsByTier`, `multiplierOverageCost`, `rawRequestCount`. Removed phantom fields (`planName`, `totalRequests`, `effectiveRequests`, `monthlyQuota`, `utilizationPercent`, `overbilledRequests`, `requestsByModel`). Fixed `RequestSummaryTotals`: `totalRawRequests`, `totalMultiplierOverageCost`, `effectiveRequestsByTier`. -2. **RequestBilling.tsx** — Updated all field accesses to match corrected types. Removed UI columns for non-existent fields (Plan, Quota, Utilization). Changed "Overbilled" KPI to "Overage Cost" (displays dollar amount). Changed overage alerts from utilization-percent-based to cost-based. Removed unused `Progress` import. -3. **api.ts** — No changes needed; already returns correct type. - -**Learnings:** -- Always verify frontend types against the actual backend C# model before building UI. The backend is the source of truth. -- Field naming convention: C# PascalCase serializes to camelCase in JSON by default with System.Text.Json — match those exact camelCase names in TypeScript. -- When backend doesn't provide a computed field (like `utilizationPercent`), don't invent it in the DTO — compute it in the UI from available data, or omit the feature. - -**Build Results:** `tsc -b` clean, `vite build` succeeds (2556 modules, 9.65s). - -### 2026-05-14 — Cross-Agent Note: Infrastructure Changes Must Be Validated Before Commit - -**From:** Zack Way (User directive captured by Scribe) -**Note:** When fixing infrastructure/deployment errors, **always validate fixes by running the relevant `azd` command** (e.g., `azd provision --preview`, `azd up`) **BEFORE committing**. Do not write commits with unvalidated infrastructure changes. This keeps the commit tree clean of speculative/bad infrastructure history and ensures only known-working fixes enter the codebase. - -**Application:** All agents working on infrastructure, deployment, or orchestration. Sydnor validated the Terraform tfvars fix via `azd provision --preview` before the orchestration log was written. - -### 2026-05-14T16:22:25Z — Cross-Agent Learning: Large azd + Terraform Deployment Pattern - -**From:** Scribe (based on Sydnor's successful execution) - -**Pattern Validated:** -- `azd up` with 77+ Azure resources succeeds in ~9m59s when auth alignment is correct (azd + az CLI on same tenant) -- Longest pole is always Redis Enterprise (~6m22s for this deployment) -- Container App deployed and reachable within 9-10m -- APIM policies configured to call precheck/log endpoints post-compute -- All infrastructure outputs available immediately after `azd up` succeeds - -**Key Learning for Frontend/UI Agents:** -When deploying via azd: -1. Your frontend assets are deployed to the Container App's wwwroot/spa directory. Build output must match app directory structure. -2. APIM gateway is configured to route requests through policies before backend. Your APIs see pre-authenticated requests (policy enforces auth). -3. Named values in APIM (like Container App URL) are populated during deployment. If you need dynamic configuration, update it post-deployment. -4. SPA builds should succeed locally before pushing to production; deployment mirrors local build if Dockerfile and vite.config.ts are correct. - -**Captured in Skill:** `.squad/skills/azd-terraform-large-deployment/SKILL.md` — Full guide for auth alignment, provider configuration, timing, troubleshooting, validation patterns. +- Added APIM UI under `src/aipolicyengine-ui/src/pages/Apis.tsx` with dedicated client/types files in `src/api/apim.ts` and `src/types/apim.ts`; keep APIM shapes separate from legacy dashboard DTOs. +- For list/detail admin pages, the current pattern is Tailwind + local state: left tree/list in a `Card`, right details/actions in a second `Card`, dialogs for destructive/assignment flows, and inline fixed-position toast messaging for retryable network failures. +- APIM status polling is UI-driven: after a 202 apply response, set optimistic `applying` state and poll `GET .../policy` every 2 seconds until status leaves `pending`/`applying`. +- Template parameter defaults should prefer the current assignment, then template defaults, and only shared plan-level values; there is no contract yet to map a specific plan to an API assignment, so avoid guessing per-plan defaults. +- The SPA now maps top-level tabs to pathname routes in `App.tsx` (including `/apis`) without adding a router dependency; keep using this lightweight history API pattern unless the app adopts React Router later. \ No newline at end of file diff --git a/.squad/agents/mcnulty/history.md b/.squad/agents/mcnulty/history.md index 4bfebaae..f548765e 100644 --- a/.squad/agents/mcnulty/history.md +++ b/.squad/agents/mcnulty/history.md @@ -40,405 +40,35 @@ All backend features (routing, pricing, observability) complete and tested. Infr ## Learnings - - -**Phase 0 Status:** ✅ COMPLETE (Freamon + Bunk) -- Storage architecture migrated: CosmosDB is now the durable source of truth; Redis is a write-through cache. -- Repository pattern implemented: `IRepository` abstraction with four concrete repositories (`CosmosPlanRepository`, `CosmosClientRepository`, `CosmosPricingRepository`, `CosmosUsagePolicyRepository`). -- `CachedRepository` wrapper enforces write-through semantics (persist to Cosmos first, then update Redis). -- All endpoints refactored to use repositories instead of direct Redis calls. -- Startup migration and cache warming services in place for backward compatibility and performance. -- **Test Results:** 36 new tests written (B5.1–B5.2), 129/129 tests pass, zero regressions. - -**What This Means for Phase 1 Onwards:** -- All future work (routing, pricing, policy enhancements) now builds on stable repositories. -- No more Redis-only data — all configuration data is durable. -- `IRepository` is the extension point for new entities (e.g., `CosmosModelRoutingPolicyRepository` for Phase 1). -- Caching is transparent to callers — endpoint logic unchanged, but storage is now production-safe. - -**Files:** -- New: `Repositories/` (5 files), `Services/RedisToCosmossMigrationService.cs`, `Services/CacheWarmingService.cs`, `Services/RepositoryServiceExtensions.cs` -- Refactored: All endpoints + `Program.cs` + `ConfigurationContainerInitializer.cs` -- Tests: 3 new test files (CachedRepositoryTests, RedisToCosmossMigrationServiceTests, CacheWarmingServiceTests) - -**Architecture v2 Accepted:** -- Decision 1: CosmosDB is source of truth, Redis is cache (Phase 0 — COMPLETE) -- Decision 2: Per-REQUEST multiplier (not per-token) — Phase 2–3 work -- Decision 3: Foundry deployment discovery (no pattern matching) — Phase 1 work -- Decision 4: Rate limits on routed deployment — Phase 1 work - -**Next Phase:** Phase 1 (Model Routing) — Freamon will add `CosmosModelRoutingPolicyRepository` + routing logic at precheck; Bunk will add routing tests. - -### 2026-03-31 — Full Code Review: Phases 0–4 (All Feature Work) - -**Verdict:** CONDITIONALLY APPROVED — 3 blocking, 8 should-fix, 5 nice-to-have. - -**Blocking Issues Found:** -1. **Precheck does NOT enforce `MonthlyRequestQuota`** for multiplier billing plans. Only `MonthlyTokenQuota` is checked. Plans with `UseMultiplierBilling = true` have zero quota enforcement at the APIM gate. Fix: add request quota check in `PrecheckEndpoints.cs`. -2. **Frontend `RequestSummaryResponse.billingPeriod`** type is `{ year, month }` but backend returns string `"YYYY-MM"`. Will crash `RequestBilling.tsx` at runtime. -3. **Frontend `RouteRule`** missing `priority` and `enabled` fields. Users cannot set rule priority or disable rules from the UI. - -**Key Should-Fix Items:** -- Dead code in `Repositories/` directory (duplicates `Services/` with different behavior) — must delete. -- APIM outbound log body uses string interpolation for JSON — injection risk on claims with quotes. -- `ConfigurationContainerProvider.EnsureInitializedAsync` has a race condition (volatile bool insufficient). -- `LogIngestEndpoints` never persists `RoutingPolicyId` — audit trail gap. -- Frontend TypeScript types don't include multiplier billing fields on base types; relies on `Extended` interfaces and runtime type coercion. -- `ChargebackCalculator` pricing cache is not thread-safe. -- Frontend API error handling discards backend error payloads. - -**What's Solid:** -- Repository pattern and write-through cache semantics are correct. -- `RoutingEvaluator` is pure, stateless, and well-tested (priority, enabled, deny, passthrough, fallback). -- Multiplier billing math (`effective_cost = 1 × multiplier`) is correct. -- Authorization model applied consistently. -- 200 tests passing with strong critical-path coverage. -- APIM routing integration (precheck → rewrite URI → deployment-scoped rate limits) works correctly. -- Backward compatibility maintained via nullable fields with defaults. - -**Review output:** `.squad/decisions/inbox/mcnulty-code-review-verdict.md` - -### 2026-03-31 — Deep Architecture Exploration + Feature Design - -**Codebase Architecture:** -- Backend uses **Minimal APIs** (no MVC controllers) — all endpoints in `Endpoints/` directory -- Redis is the **primary runtime store** for plans, clients, pricing, logs, traces, rate limits, usage policy -- CosmosDB stores **audit logs** (`audit-logs` container) and **billing summaries** (`billing-summaries` container), partitioned by `/customerKey` -- `ChargebackCalculator` uses an **in-memory pricing cache** refreshed every 30s from Redis — non-blocking on the request path -- **Precheck endpoint** is the APIM enforcement choke point — checks assignment, plan, quota, rate limits, deployment access -- APIM policies call precheck **inbound**, then log usage **outbound** (fire-and-forget POST to `/api/log`) -- Frontend is **tab-based** (no react-router), state-driven in `App.tsx`, polling for live data (5s/10s intervals) -- Auth: AzureAd JWT bearer with three policies: `ExportPolicy`, `ApimPolicy`, `AdminPolicy` - -**Key Extension Points:** -- `PlanData.AllowedDeployments` / `ClientPlanAssignment.AllowedDeployments` — existing deployment access control -- `ModelPricing` in Redis (`pricing:{modelId}`) — already per-model, extend with multiplier/tier -- `PrecheckEndpoints.cs` — routing decisions go here (add `routedDeployment` to response) -- `ChargebackCalculator` — cost calculation, extend with `CalculateBillingUnits()` -- `AuditLogDocument` / `BillingSummaryDocument` — extend with routing + multiplier fields (additive, nullable) -- `RedisKeys.cs` — centralized key patterns, add `routing-policy:{policyId}` - -**Architecture Decisions Made:** -- Model Routing: new `ModelRoutingPolicy` entity, Redis-backed, attached to plans via `ModelRoutingPolicyId` -- Multiplier Pricing: extend `ModelPricing` with `Multiplier` + `TierName`, extend `PlanData` with unit quotas -- Both features converge at precheck — routing decides *where*, pricing decides *how much* -- All changes are additive/backward-compatible — `UseMultiplierBilling` flag for gradual migration -- No new storage systems — Redis for runtime config, CosmosDB for audit (existing containers) -- Proposal written to `.squad/decisions/inbox/mcnulty-model-routing-pricing-architecture.md` -- Revised proposal (v2) written to `.squad/decisions/inbox/mcnulty-architecture-v2.md` - -### 2026-03-31 — Four Design Decisions (from Zack Way) & Architecture v2 - -**Decision 1: CosmosDB is Source of Truth, Redis is Cache Only** -- All configuration data (plans, clients, pricing, routing policies, usage policy) MUST persist to CosmosDB. Redis is ONLY a write-through cache. -- Architectural implication: New repository pattern (`IRepository` → Cosmos persistence → `CachedRepository` Redis wrapper). New `configuration` Cosmos container. One-time migration service (Redis → Cosmos on startup). Cache warming service. All endpoint refactoring to use repositories instead of direct Redis calls. -- This is the largest body of work (Phase 0) and must complete before feature work. - -**Decision 2: Per-REQUEST Multiplier (not per-token)** -- `effective_cost = 1 × model_multiplier` per request. GPT-4.1 = 1.0x, GPT-4.1-mini = 0.33x. -- Architectural implication: Simpler calculator logic — no token division. `MonthlyRequestQuota` replaces `MonthlyUnitQuota`. `CurrentPeriodRequests` replaces `CurrentPeriodUnits`. All "unit" terminology changed to "effective requests". - -**Decision 3: Foundry Deployment Discovery (no pattern matching)** -- Routing maps to specific known deployments from Foundry. No globs, no regex. -- Architectural implication: `RouteRule.RequestedDeployment` is exact match only. All `RoutedDeployment` values validated against `IDeploymentDiscoveryService.GetDeploymentsAsync()` on create/update. Existing `DeploymentDiscoveryService` is the integration point. - -**Decision 4: Rate Limits on Routed Deployment** -- RPM/TPM limits apply to the routed deployment (what hits the backend), not the requested model. -- Architectural implication: Rate limit Redis keys include deployment ID. New key pattern: `ratelimit:rpm:{client}:{tenant}:{deploymentId}:{window}`. Precheck evaluates rate limits AFTER routing resolution. - -**File Paths:** -- Models: `src/Chargeback.Api/Models/` (PlanData.cs, ClientPlanAssignment.cs, ModelPricing.cs, AuditLogDocument.cs, BillingSummaryDocument.cs) -- Endpoints: `src/Chargeback.Api/Endpoints/` (PrecheckEndpoints.cs, PricingEndpoints.cs, PlanEndpoints.cs, etc.) -- Services: `src/Chargeback.Api/Services/` (ChargebackCalculator.cs, RedisKeys.cs, AuditStore.cs, AuditLogWriter.cs) -- APIM Policies: `policies/subscription-key-policy.xml`, `policies/entra-jwt-policy.xml` -- Frontend types: `src/chargeback-ui/src/types.ts` -- Frontend API client: `src/chargeback-ui/src/api.ts` -- Aspire orchestration: `src/Chargeback.AppHost/AppHost.cs` - -### 2026-03-31 — Code Review Complete: All 11 Findings Fixed (APPROVED) - -**Status:** COMPLETE ✅ - -McNulty's comprehensive code review of Phases 0–4 delivered 11 findings: -- **3 Blocking (Critical):** B1 (precheck quota), B2 (type mismatch), B3 (missing fields) -- **8 Should-Fix (Important):** S1–S8 (security, race conditions, type safety, error handling) -- **5 Nice-to-Have (Future):** N1–N5 (minor optimizations) - -**All 11 Findings Now Fixed:** -- **Freamon (Backend):** Fixed B1, S1, S2, S3, S4, S7 (6 fixes) -- **Kima (Frontend):** Fixed B2, B3, S5, S6, S8 (5 fixes) - -**Test & Build Results:** -- Backend: 198/198 tests pass (0 regressions; -22 from deleted dead tests) -- Frontend: tsc clean, vite build clean, 0 new linting issues -- No architectural changes — all fixes are bug corrections + code cleanup - -**Production Readiness:** APPROVED FOR MERGE -- Security: JSON injection fixed, audit trail complete, cache thread-safe -- Reliability: Precheck enforces quotas, race conditions eliminated -- Code Quality: Dead code removed, type safety improved, error messages actionable -- Backward Compatibility: Maintained (all new fields nullable with defaults) - -**Next Phase:** Deploy backend + frontend together. Schedule N1–N5 optimizations for future sprint. - -**Decision:** Merged review findings into `.squad/decisions.md`. Ready for production deployment. - -### 2026-03-31 — Re-Review Complete: All 11 Findings Verified ✅ - -**Status:** APPROVED — independent verification of all 11 fixes. - -McNulty re-reviewed every file touched by Freamon (6 backend) and Kima (5 frontend). All fixes are correctly implemented, no regressions, no new issues introduced. - -**Key Verification Points:** -- B1: Multiplier request quota enforcement is in the right place in PrecheckEndpoints.cs (after token check, before rate limits). Returns 429 with correct field names. -- B2: `billingPeriod` is `string` in types.ts. RequestBilling.tsx renders it as-is — no `.year`/`.month` access. -- B3: `RouteRule` has `priority: number` + `enabled: boolean`. RoutingPolicies.tsx form has both editable priority input and enabled toggle. -- S1: `Repositories/` directory confirmed deleted. Zero namespace references remain. -- S2: Both APIM policies use `JObject` construction — zero string interpolation in outbound log body. -- S3: `SemaphoreSlim(1,1)` with proper double-check locking. No `volatile bool`. -- S4: `routingPolicyId` flows end-to-end: precheck response → APIM capture → log payload → `LogIngestRequest` → `AuditLogItem` → `AuditLogDocument`. -- S5: `ModelPricing` has `multiplier` + `tierName` on the base type. No `ModelPricingExtended` band-aid. -- S6: `PlanCreateRequest` and `PlanUpdateRequest` both include all 4 multiplier billing fields. -- S7: `lock(_cacheLock)` with double-check pattern. Timestamp set inside lock, Redis read outside. No bare access. -- S8: `parseErrorMessage` helper applied consistently to all 27 API functions. - -**Verdict:** `.squad/decisions/inbox/mcnulty-rereview-verdict.md` - -### 2026-04-01 — Full Codebase Review: Pre-Ship Quality Audit - -**Status:** CONDITIONALLY APPROVED for preview. Fix CRITICALs before GA. - -**Scope:** Entire product — every file in every layer. Not a diff review; a full product audit requested by Zack Way before shipping. - -**Findings: 47 total (11 CRITICAL, 20 IMPORTANT, 16 IMPROVEMENT)** - -**Critical Issues Found:** -1. **CORS allows any origin** (Program.cs:136) — exploitable with JWT auth -2. **AuditStore initialization race** — volatile bool without semaphore, unlike ConfigurationContainerProvider which was correctly fixed -3. **APIM Contributor on entire RG** (main.bicep:322) — can delete any resource -4. **HTTP allowed on Chargeback API** (apimFuncApi.bicep:34) — cleartext JWT tokens -5. **No subscription requirement** on Chargeback API in APIM (apimFuncApi.bicep:37) -6. **Cosmos local auth enabled** (cosmosAccount.bicep:15) — keys bypass RBAC -7. **ACR credentials in plain secret** (containerApp.bicep:85-97) — not Key Vault -8. **APIM on-error leaks internals** — ErrorSource, ErrorPolicyId headers exposed -9. **JWT validation accepts any tenant** — no issuer/scope restriction -10. **Frontend type mismatches** — RequestSummaryResponse field names don't match backend -11. **Health checks only in Development** — production Container App has no /health - -**Key Patterns Found Across Layers:** -- Infrastructure Bicep modules have inconsistent security posture: Redis is excellent (TLS, Entra-only), but Key Vault uses legacy access policies, Storage allows public blobs, Cosmos has keys enabled -- Thread safety was correctly fixed in ChargebackCalculator (lock + double-check) and ConfigurationContainerProvider (SemaphoreSlim), but AuditStore was missed -- Frontend types drifted from backend after multiplier billing feature — several interfaces have wrong field names -- Test coverage has gaps in newer endpoint groups (RoutingPolicy, Deployment, RequestBilling CRUD endpoints) -- Documentation has wrong repo URLs in 3 files and contradictory TTL values - -**What's Solid:** -- Repository pattern, write-through caching, routing evaluator, billing math, authorization model, Redis security, managed identity usage, Aspire orchestration, 198-test suite, backward compatibility - -**Review output:** `.squad/decisions/inbox/mcnulty-full-codebase-review.md` - -### 2026-04-01 — Product Rebrand: README & Documentation Rewrite - -**Status:** COMPLETE ✅ - -McNulty rewrote README.md and updated all associated documentation to reflect the new product identity and all features built during Phases 0–4. - -**What Changed:** - -1. **Product Renaming**: "Azure API Management OpenAI Chargeback Environment" → **"Azure AI Gateway Policy Engine"** - - Emphasizes APIM-based AAA (Authentication, Authorization, Accounting) for AI workloads - - Telecom/RADIUS heritage positioning - - Reflects the policy engine architecture (not just chargeback) - -2. **README.md Completely Rewritten** (590 lines): - - TL;DR clarifies the three pillars: durability (CosmosDB source of truth), routing (auto-router), and billing (multiplier pricing) - - New "The Problem We Solve" table: 9 challenges with solutions addressing new features - - Expanded Architecture section with detailed decision flow (precheck → routing → rate limit → cost) - - **New Key Features Section** (subsections): - - 🔐 Authentication & Authorization at the Gate - - 🚀 Intelligent Model Routing (Auto-Router) with three modes - - 💰 Per-Request Multiplier Pricing (GHCP-style) with examples - - 🗄️ CosmosDB Source of Truth + Redis Cache (write-through pattern) - - 📊 Adaptive Billing Dashboard (token/multiplier/hybrid modes) - - 📋 Bill-Back Reporting (per-client, tier breakdown, CSV export) - - ⚡ APIM Policy Enforcement at the Gateway - - 🧪 Comprehensive Test Suite (198+ tests) - - 🏗️ Production-Ready Infrastructure - - Updated Dashboard section with routing policies page - - New API Endpoints table (18 endpoints including routing, billing, export) - - Added note about internal "Chargeback" naming and pending rename - -3. **Documentation Updates**: - - `docs/ARCHITECTURE.md`: Updated product name, added CosmosDB architecture, detailed request flow with routing & multiplier billing decisions - - `docs/DOTNET_DEPLOYMENT_GUIDE.md`: .NET 9 (was 10), product name updated - - `docs/FAQ.md`: Completely rewritten (7 sections, 30+ Q&A covering all new features): - - Multiplier pricing examples - - Auto-router behavior vs. enforced rewriting - - CosmosDB durability guarantees - - Multi-tenant scenarios - - Hybrid billing mode - - Deployment options (Bicep vs. Terraform) - - Troubleshooting common issues - - `docs/USAGE_EXAMPLES.md`: Updated product name, examples remain valid - -4. **TL;DR Messaging**: - - Before: "Usage tracking and chargeback through APIM" - - After: "APIM-based AAA for AI workloads" with focus on durability, routing, and adaptive billing - -**Files Modified**: -- README.md (major rewrite, 590 lines) -- docs/ARCHITECTURE.md (product name, CosmosDB architecture, detailed flow) -- docs/DOTNET_DEPLOYMENT_GUIDE.md (.NET 9, product name) -- docs/FAQ.md (comprehensive rewrite, 7 sections) -- docs/USAGE_EXAMPLES.md (product name) - -**Key Messaging Retained**: -- Multi-tenant customer model (clientAppId:tenantId) -- WebSocket real-time dashboard -- 198+ test suite -- Bicep/Terraform dual IaC paths -- Aspire orchestration -- CosmosDB audit trail (36 months) -- Purview integration (optional) - -**What's New in Documentation**: -- Explicit CosmosDB source-of-truth architecture (vs. Redis-only perception in old docs) -- Three routing modes explained (per-account, enforced, QoS-based) -- Multiplier pricing with concrete examples (1.0x baseline, 0.33x tier) -- Hybrid billing mode support (token + multiplier mixed plans) -- Bill-back reporting details (per-client effective requests, tier breakdown) -- Adaptive dashboard UI behavior -- Auto-router decision flow (not enforced rewriting) -- Deployment discovery integration (Foundry) -- Comprehensive FAQ with troubleshooting - -**Why This Matters**: -- Customers now understand the product's core value: durable policy engine for AI consumption (not just cost tracking) -- Clear distinction between authentication (at gate), authorization (plan + deployment checks), and accounting (billing) -- Documentation reflects actual architecture (CosmosDB + Redis) and not implied Redis-only storage -- New features (multiplier billing, routing, adaptive UI) are front-and-center -- Legacy "Chargeback" naming acknowledged with pending rename roadmap - -### 2026-04-01 — Deep Research: Agent365 SDK Integration Architecture - -**Status:** PROPOSAL — awaiting Zack Way's review - -**Context:** Zack discovered the official **Microsoft Agent 365 SDK** for enterprise observability and identity. Directive: "Each APIM client = an Agent 365 agent. All calls to Foundry endpoints get pushed as observability data through the A365 SDK." - -**Key Findings:** - -1. **What Agent365 SDK Is:** - - Enterprise-grade extensions for AI agents: Entra-backed identity, OpenTelemetry observability, governed MCP tool access, agent blueprints - - **Not** a framework — enhances existing agents built on any SDK (Agent Framework, Semantic Kernel, OpenAI, LangChain, Copilot Studio) - - NuGet packages: `Microsoft.Agents.A365.Observability*`, `Microsoft.Agents.A365.Runtime*`, `Microsoft.Agents.A365.Tooling` - - Current version: `0.2.152-beta` (Frontier preview program) - -2. **Relationship to Existing Purview Integration:** - - **CRITICAL FINDING:** A365 Observability SDK is **COMPLEMENTARY, NOT REPLACEMENT** to our `Microsoft.Agents.AI.Purview` DLP integration - - **Two separate concerns:** - - `Microsoft.Agents.AI.Purview` = Real-time DLP policy **enforcement** (`processContent` endpoint returns block/allow decision at request time) - - `Microsoft.Agents.A365.Observability` = Structured **telemetry export** (OpenTelemetry spans sent to M365 admin center / Purview compliance dashboards for audit/visibility) - - **Analogy:** Purview DLP = TSA checkpoint (blocks contraband), A365 Observability = airport security cameras (records everything) - - **Both are needed:** Precheck calls `CheckContentAsync` (Purview SDK) to block sensitive prompts, log ingest emits `ExecuteInference` spans (A365 SDK) for audit trail - -3. **A365 Observability Data Model:** - - **Four operation types:** `InvokeAgent` (session start), `ExecuteInference` (LLM call), `ExecuteTool` (function call), `OutputMessages` (response) - - **BaseData structure:** All DTOs inherit `Name`, `Attributes` (OTel tags), `StartTime`, `EndTime`, `SpanId`, `ParentSpanId`, `TraceId`, `Duration` - - **Key attributes:** `gen_ai.agent.id`, `gen_ai.agent.name`, `microsoft.tenant.id`, `user.id`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.tool.name`, `threat.diagnostics.summary` (DLP action) - - **Export:** Batched spans (512/batch, 5s delay) to Agent365 backend, requires token resolver, toggle via `ENABLE_A365_OBSERVABILITY_EXPORTER=true` - -4. **Agent Identity Model:** - - **Current:** Each client is a `ClientPlanAssignment` with `ClientAppId` (Entra service principal, `idtyp=app`) - - **Agent365:** Agents get Agentic User identities (`idtyp=user` tokens) with mailbox, M365 license, org chart presence - - **Setup:** `a365 setup` CLI creates Agent Blueprint → Agentic App Instance → Agentic User - - **Integration question:** Three options: - - **Option A:** Manual `a365 setup` per-client, store `Agent365UserId` in CosmosDB (simple, not scalable) - - **Option B:** Programmatic provisioning via Graph API on client creation (requires API research) - - **Option C:** Lightweight — use `ClientAppId` as `gen_ai.agent.id` attribute, skip Agentic User provisioning (fastest, no governance benefits) - -5. **Integration Architecture (Recommended):** - - **Phase 1 (Additive A365 Observability):** Add SDK alongside existing Purview DLP, zero breaking changes - - Add packages: `Microsoft.Agents.A365.Observability`, `Microsoft.Agents.A365.Observability.Runtime`, `Microsoft.Agents.A365.Runtime` - - Register `AddA365Tracing` in `Program.cs`, implement `IAgent365TokenResolver` - - Wrap `PrecheckEndpoints` with `InvokeAgent` scope (session metadata) - - Wrap `LogIngestEndpoints` with `ExecuteInference` scope (model, tokens, latency, routing decision) - - Use `BaggageBuilder` for context propagation (`tenant.id`, `agent.id`, `conversation.id`) - - Set `threat.diagnostics.summary` attribute when `CheckContentAsync` blocks request - - Make exporter opt-in via config flag (Frontier preview requirement) - - **Phase 2 (Agent Identity Provisioning):** Deferred pending Zack's strategy choice (Options A/B/C) - - **Phase 3 (Deprecate Custom PurviewGraphClient):** Never — A365 Observability doesn't expose DLP enforcement APIs - -6. **What We Keep vs. Add:** - - **✅ KEEP:** All existing Purview DLP integration (`Microsoft.Agents.AI.Purview`, `PurviewGraphClient`, `PurviewAuditService`, `CheckContentAsync`, `EmitAuditEventAsync`) - - **➕ ADD:** A365 SDK packages, OTel scopes, baggage builder, token resolver, DI registration - - **🔄 INTEGRATE:** A365 spans include DLP action attributes, Purview `contentActivities` continue to flow (separate channel) - -7. **Open Questions for Zack:** - - **Q1:** Agent identity strategy — full Agentic User provisioning (mailbox, license, blueprint) or lightweight mapping (`ClientAppId` as `agent.id`)? - - **Q2:** Scope of integration — just precheck + log ingest, or also config CRUD endpoints? - - **Q3:** Foundry endpoint filtering — all requests or only Foundry deployments? - - **Q4:** Tenant/subscription requirements — Frontier preview access needed for all clients? - - **Q5:** Testing strategy — how to test without full Frontier-enabled tenant? - -8. **Risk Assessment:** - - **High:** SDK is beta (0.2.152), API may change; Frontier preview access required - - **Medium:** Token resolver complexity (per-tenant tokens), performance overhead (OTel spans), dual observability channels - - **Low:** Package dependency bloat (~2MB), zero breaking changes to existing Purview - -**Deliverables:** -- Architecture plan: `.squad/decisions/inbox/mcnulty-agent365-architecture.md` (30KB, 10 sections) -- Covers: SDK purpose, Purview relationship, identity model, integration architecture, migration phases, open questions, risk assessment, file changes, recommendations - -**Recommendation:** -- **Go forward with Phase 1** (additive A365 Observability, lightweight identity mapping) -- **Defer Phase 2** pending Zack's decision on identity strategy -- **Total effort:** 4-5 days (2-3 backend, 1 tests, 0.5 docs, 1 staging verification) - -**Key Insight for Future Work:** -- Microsoft is building a dual observability model for AI agents: **DLP enforcement** (real-time block/allow) and **telemetry export** (audit trail for compliance dashboards) -- Our middleware architecture naturally aligns: precheck = enforcement gate, log ingest = telemetry sink -- A365 SDK gives us M365 admin center visibility without replacing our existing Purview DLP blocking logic -- Agent identity is the open question — full Entra Agentic Users vs. service principal-based observability -- SDK is in beta but stable enough for integration (0.2.152-beta, 5 packages, 13K total downloads) - ---- - -### 2026-05-14 — Cross-Agent Note: azd Terraform Provider Configuration - -**From:** Sydnor (Infra/DevOps) -**Note:** When using Terraform with Azure Developer CLI (azd), the zure.yaml file must explicitly declare an infra: provider block pointing to the terraform module. If omitted, azd defaults to Bicep and looks for infra/main.bicep, which will fail if Terraform is the actual IaC provider. Example config: - -\\\yaml -infra: - provider: terraform - module: infra/terraform -\\\ - -This applies to any project mixing IaC tools or migrating from Bicep to Terraform. - -### 2026-05-14 — Cross-Agent Note: Infrastructure Changes Must Be Validated Before Commit - -**From:** Zack Way (User directive captured by Scribe) -**Note:** When fixing infrastructure/deployment errors, **always validate fixes by running the relevant `azd` command** (e.g., `azd provision --preview`, `azd up`) **BEFORE committing**. Do not write commits with unvalidated infrastructure changes. This keeps the commit tree clean of speculative/bad infrastructure history and ensures only known-working fixes enter the codebase. - -**Application:** All agents working on infrastructure, deployment, or orchestration. Sydnor validated the Terraform tfvars fix via `azd provision --preview` before the orchestration log was written. - -### 2026-05-14T16:22:25Z — Cross-Agent Learning: Large azd + Terraform Deployment Pattern - -**From:** Scribe (based on Sydnor's successful execution) - -**Pattern Validated:** -- `azd up` with 77+ Azure resources succeeds in ~9m59s when auth alignment is correct (azd + az CLI on same tenant) -- Longest pole is always Redis Enterprise (~6m22s for this deployment) -- Terraform dependency graph executes efficiently; no manual intervention needed -- APIM policies depend on Container App URL availability; azd handles ordering automatically -- Parallel provisioning: container image builds while infrastructure resources provision - -**Key Learning for All Agents:** -When architecting infrastructure changes or debugging deployment issues: -1. **Auth alignment first:** Ensure azd and az CLI are logged into the same tenant. Cross-tenant token requests hit Conditional Access Policies. -2. **Pre-deployment validation:** Run `azd provision --preview` before `azd up`. Catches errors without full deployment. -3. **Resource time budgets:** Redis 6-7m, APIM 3-5m, Cosmos 2-3m, Container App 2-3m. Plan SLAs accordingly. -4. **Terraform state:** azd manages remote state automatically. No explicit backend config needed. - -**Captured in Skill:** `.squad/skills/azd-terraform-large-deployment/SKILL.md` — Full guide for auth alignment, provider configuration, timing, troubleshooting, validation patterns. +**2026-05-16 — Non-AI API Limits Architecture:** +- Chose flat fields (`NonAiRequestsPerMinute`, `NonAiMonthlyRequestQuota`) over sub-object — consistency with existing schema pattern. +- Chose dedicated `/api/precheck-rest` endpoint over extending existing precheck — separation of concerns, avoids polluting AI hot path. +- Chose Redis counters (same pattern as AI RPM) over APIM built-in `rate-limit-by-key` — dashboard visibility is non-negotiable for this engine. +- Monthly counter lives on `ClientPlanAssignment.NonAiCurrentPeriodRequests`, same Cosmos+Redis pattern as token usage. +- No schema migration needed — CosmosDB is schema-less, defaults to 0 (unlimited = no enforcement for existing plans). +- Spec delivered to `.squad/decisions/inbox/mcnulty-non-ai-api-limits-architecture.md`. + +**2026-05-16 — APIM Policy Management Architecture:** +- Chose **Tier B (template apply)** — users pick templates + fill params, engine renders XML and pushes to APIM. No raw XML editor (too risky for v1). Drift detection deferred to M6. +- Chose **`Azure.ResourceManager.ApiManagement` SDK** over ARM REST or Terraform — idiomatic .NET, strongly typed, DefaultAzureCredential, no preview risk. +- **Reshapes non-AI architecture:** Sydnor's `entra-jwt-rest-policy.xml` ships as-is, then immediately becomes the seed for the `entra-jwt-rest` template. Precheck-rest endpoint stays as an alternative enforcement mode but APIM-native `rate-limit-by-key` is the default in the template. +- Plans page sets plan-level default limits; new APIM Management page assigns templates per-API with those defaults pre-populated as parameter values. +- Custom RBAC role (narrow: apis/read + policies/read+write) instead of broad `API Management Service Contributor`. +- Storage: existing `configuration` container, new `policy-assignment` partition key document type. +- Spec delivered to `.squad/decisions/inbox/mcnulty-apim-management-architecture.md`. + +*Core learnings consolidated in Core Context section above (see git history for detailed entries).* + +## Archived Learnings (Pre-May 2026) + +All development work from Phase 0–3 (2026-03-31 to 2026-05-14) is documented in Core Context and git commit history. Key achievements: +- Phase 0: Cosmos + Redis storage architecture +- Phase 1: Model routing policies + multiplier billing +- Phase 2: Agent365 Observability integration +- Phase 3: APIM policy variants and infrastructure +- Infrastructure: Terraform + azd deployment (77 resources) + +For detailed work items, see: +- .squad/decisions.md — architectural decisions +- .squad/orchestration-log/ — agent completion logs +- git log --oneline — implementation history \ No newline at end of file diff --git a/.squad/agents/sydnor/history.md b/.squad/agents/sydnor/history.md index ce37d262..d72b468b 100644 --- a/.squad/agents/sydnor/history.md +++ b/.squad/agents/sydnor/history.md @@ -2,616 +2,274 @@ - **Owner:** Zack Way - **Project:** AI Policy Engine — APIM Policy Engine management UI for AI workloads, implementing AAA (Authentication, Authorization, Accounting) for API management. Built for teams who need bill-back reporting, runover tracking, token utilization, and audit capabilities. Telecom/RADIUS heritage. -- **Stack:** .NET 9 API (Chargeback.Api) with Aspire orchestration (Chargeback.AppHost), React frontend (chargeback-ui), Azure Managed Redis (caching), CosmosDB (long-term trace/audit storage), Azure API Management (policy enforcement), Bicep (infrastructure) +- **Stack:** .NET 9 API (Chargeback.Api) with Aspire orchestration (Chargeback.AppHost), React frontend (chargeback-ui), Azure Managed Redis (caching), CosmosDB (long-term trace/audit storage), Azure API Management (policy enforcement), Terraform (infrastructure) - **Created:** 2026-03-31 ## Key Files -- `infra/` — Azure Bicep templates (my primary workspace) -- `policies/` — APIM policy definitions +- `infra/terraform/` — Terraform modules (core workspace) +- `policies/` — APIM policy definitions (base + DLP variants) - `src/Chargeback.AppHost/` — Aspire orchestration -- `src/Dockerfile` — Container build -- `scripts/` — Deployment and utility scripts +- `src/aipolicyengine-ui/` — React frontend (SPA) +- `scripts/` — Deployment and postprovision utilities ## Core Context -**Project Phases Completed (2026-03-31 to 2026-04-11):** +**Project Status: Phase 3 + Infrastructure Complete** -Phase 0 (Storage): CosmosDB established as durable source-of-truth; Redis as write-through cache. Configuration container created for plans, clients, pricing, usage policies. No infrastructure changes needed — existing Cosmos + Redis sufficient. +All backend phases complete and tested (235+ tests passing): +- **Phase 0:** CosmosDB source-of-truth + Redis cache architecture +- **Phase 1:** Model routing with 7 deployment endpoints + multiplier pricing +- **Phase 2:** Agent365 Observability SDK integration + Purview DLP policy variants +- **Phase 3:** APIM auto-router policies for subscription-key and entra-jwt auth -Phase 1 (Model Routing): Backend routing + multiplier pricing complete. 7 routing endpoints ready (F2.1–F2.7). Precheck response includes routedDeployment, requestingDeployment, routingPolicyId. Rate limiting now deployment-scoped. All API contracts stable. Infrastructure unchanged. +**Infrastructure: Terraform + azd Complete** +- 77 Azure resources provisioned via `azd up` (9m59s runtime) +- Terraform provider configured in `azure.yaml` with variable substitution template (`main.tfvars.json`) +- Authentication aligned: azd + az CLI on same tenant (99e1e9a1-3a8f-4088-ad5d-60be65ecc59a) +- All services operational: Container App API, APIM gateway, Cosmos DB, Redis Enterprise, Key Vault, Log Analytics +- App IDs registered via Terraform (api_app_id: d5bd33f4-09b1-4602-af88-29c5ec7728e0) -Phase 2 (Backend Implementation): Agent365 Observability SDK integrated alongside Purview DLP. Phase 1 implemented with real scope calls (InvokeAgentScope, InferenceScope). APIM DLP policy variants created (precheck-only vs. content-check). All 235 tests passing. +**Current Issue (2026-05-14): AADSTS500113 — Reply URL Mismatch** +- **Problem:** UI auth fails because redirect URIs were registered on legacy app, not Terraform-managed app +- **Fix Applied:** Updated `postprovision.ps1` and `postprovision.sh` to register redirect URIs on correct app (api_app_id) +- **Status:** Redirect URI verified on correct app; awaiting user login confirmation -Phase 3 (APIM Router): Auto-router policies deployed for both subscription-key and entra-jwt auth types. Policies extract routedDeployment from precheck response and rewrite backend URL. Logging extended with routing metadata. -**Current Work (2026-05-14):** +## Active Learnings -PR #29 (SPA publish + Terraform migration) in review. Terraform configuration aligned with azd provider. Infrastructure now deployable via `azd up`. Full infrastructure validation completed — 77 resources provisioned in 9m59s, all services operational. +### 2026-05-14 — Redirect URI Registration: Terraform-Managed App vs Legacy App -## Learnings +**Context:** User (Zack) reported AADSTS500113 error when logging into the dashboard: "No reply address is registered for the application." - - -### 2026-05-14 — SPA Publish + Terraform Migration Complete - -**Phase 0 Status:** ✅ COMPLETE (Freamon + Bunk) - -The backend storage architecture has been refactored from Redis-only to a durable CosmosDB source-of-truth pattern with Redis as a write-through cache. Infrastructure implications: - -**What Sydnor Needs to Know:** -- **New CosmosDB Container:** `configuration` container added to `ConfigurationContainerInitializer`. Stores plans, clients, pricing, usage policies, and future routing policies. Partitioned by `/id` (document ID). -- **Redis Remains Caching Layer:** All reads still go through Redis first. Write-through cache means writes hit Cosmos first, then Redis is updated. -- **Startup Services:** New services on startup: `RedisToCosmossMigrationService` (one-time migration of existing Redis data) and `CacheWarmingService` (populate Redis from Cosmos). Both are idempotent. -- **No New Azure Resources:** Existing Cosmos + Redis still sufficient. No changes to Bicep or Azure resource provisioning needed. -- **Container Initialization:** `ConfigurationContainerInitializer.cs` now creates `configuration` container with proper schema initialization. -- **Deployment Impact:** Minimal. Startup is slightly slower due to cache warming, but no downtime required. Redis data is automatically migrated on first startup. - -**For Phase 1 Onwards:** -- Model Routing will add `routing-policies` to the `configuration` container. -- Multiplier Pricing will extend existing `pricing` and `plans` documents with new fields. -- All future configuration entities will use the same repository pattern + caching layer. - -**Test Results:** 129/129 tests pass (36 new Phase 5 tests for repositories/migration/warmup). - -### 2026-03-31 — Phase 1 Complete: Model Routing Architecture Ready for Phase 3 - -**Phase 1 Status:** ✅ COMPLETE (Freamon + Bunk) - -All model routing and per-request multiplier pricing features are complete and tested. Backend API contracts finalized. Infrastructure requirements unchanged (uses existing CosmosDB + Redis). - -**What Sydnor Needs to Know for Phase 3:** - -- **Backend is Ready:** All 7 routing enforcement endpoints ready (F2.1–F2.7). No more breaking changes — API contracts stable. -- **Precheck Response Extended:** New fields available: `routedDeployment`, `requestedDeployment`, `routingPolicyId`. APIM policies can use these for access control decisions. -- **Rate Limiting by Deployment:** Rate limit checks now deployment-scoped. The routed deployment is the one that gets rate-limited, not the originally requested model. -- **Multiplier Billing Fields:** Audit trail includes pricing data: `Multiplier`, `EffectiveRequestCost`, `TierName`. APIM policies can log these for chargeback. -- **Deployment Discovery:** All routing evaluations validate against Foundry deployments. Empty Foundry = strict validation failure (no phantom references). -- **No New Azure Resources:** Phase 3 uses existing resources. APIM policies are stateless — they call precheck and log ingest endpoints. -- **Backward Compat:** All new fields are nullable. Existing clients continue to work without changes. - -**Ready for Phase 3 Deployment:** -- Deploy Chargeback.Api with Phase 2 enforcement active -- Configure APIM policies to call precheck endpoint for authentication/authorization -- APIM policies log routing + pricing metadata via log ingest endpoint -- No schema migrations needed — CosmosDB containers already configured - -**Test Results:** 200/200 tests pass (30 new Phase 2 integration tests from Bunk B5.7 + B5.8). - -### 2026-05-14 — PR #29: SPA Publish + Terraform Migration (In Review) - -**Branch:** `fix/spa-publish-and-terraform-migration` (seiggy fork remote) - -**Changes:** -1. **SPA Publish Fix:** `src/chargeback-ui/` production build output correctly maps to `wwwroot/spa/`. Build pipeline verified. -2. **Cosmos Firewall:** Bicep now exposes ports 10250 + 10255 for Cosmos connection; firewall rule added for host. -3. **Bicep Scaffolding Removal:** Empty/unused Bicep modules removed; repository prepped for Terraform migration. - -**Cross-Fork Pattern:** -- Sydnor pushed to seiggy remote (personal fork); gh CLI auth blocked by SAML wall -- Coordinator used GitHub MCP to create PR against main repo -- PR #29 now open at https://github.com/Azure-Samples/ai-policy-engine/pull/29 -- Awaiting review/merge on main - -**Status:** ✅ PR opened successfully; work complete, deployment pending. - -### 2026-03-31 — Phase 3 Complete: APIM Auto-Router Policies (S3.1–S3.3) - -**Phase 3 Status:** ✅ COMPLETE (Sydnor) - -Both APIM policy files updated with auto-router support. Identical routing logic in both policies. - -**What Changed:** - -- **Inbound — Auto-Router Logic (after precheck, before backend):** - - Extracts `routedDeployment` from precheck 200 response using `Body.As(preserveContent: true)` - - If `routedDeployment` is non-empty AND differs from the client's `deploymentId`: saves original as `originalDeploymentId`, updates `deploymentId`, rewrites URL path via `` - - If `routedDeployment` is null/empty or matches requested deployment: no-op, existing behavior preserved - - Comment block explains auto-router semantics: no forced downgrades, pass-through for explicit deployments +**Root Cause:** +- The repository has TWO app registrations for the API: + 1. **Terraform-managed** (`api_app_id` = d5bd33f4-09b1-4602-af88-29c5ec7728e0) — "AI Policy API" + 2. **Legacy/manual** (`CONTAINER_APP_CLIENT_ID` = 625db56c-f5cc-4ee5-954d-6775c709055e) — "AI Policy Engine API (ai-policy-engine-k8m2)" +- The UI (src/aipolicyengine-ui) is configured via .env.production.local to use the Terraform-managed app (`VITE_AZURE_CLIENT_ID=d5bd33f4...`) +- But the postprovision scripts were setting redirect URIs on the LEGACY app (CONTAINER_APP_CLIENT_ID), not the Terraform-managed app +- Result: UI tries to auth against the Terraform-managed app, which has no redirect URIs → AADSTS500113 + +**Fix Applied:** +- Updated both `scripts/postprovision.ps1` and `scripts/postprovision.sh` to use `api_app_id` (Terraform-managed app) instead of `CONTAINER_APP_CLIENT_ID` +- Also fixed error code reference: AADSTS50011 → AADSTS500113 (correct error for "no reply address") +- The postprovision scripts query the ACTUAL Container App FQDN from Azure (not from Terraform state) and register it as a SPA redirect URI +- This is idempotent — re-running doesn't duplicate URIs -- **Outbound — Extended Log Payload:** - - Added `requestedDeploymentId` (original client ask) and `routedDeployment` (precheck recommendation) to the fire-and-forget `/api/log` POST - - `requestedDeploymentId` = `originalDeploymentId` if routing happened, else `deploymentId` - - `routedDeployment` = precheck value or empty string +**Verification:** +- Ran `azd hooks run postprovision` successfully +- Confirmed redirect URI `https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io` is now registered on the Terraform-managed API app (d5bd33f4...) +- The UI's MSAL config uses `redirectUri: window.location.origin`, so when the UI is served from the Container App, the redirect flow will work -**Design Decisions:** -- `preserveContent: true` on response body read ensures the body stream isn't consumed before backend routing -- `&&` used in XML condition (proper XML entity encoding for `&&`) -- URL rewrite uses `path.Replace()` — safe no-op when URL doesn't contain `/deployments/{id}/` (e.g., Responses API body-based model) -- `originalDeploymentId` only set inside routing `` block — log payload checks `ContainsKey` to handle both routed and non-routed paths +**Key Learning:** +- When troubleshooting AADSTS errors, always verify WHICH app registration the client is actually using (check VITE_AZURE_CLIENT_ID, not assumptions) +- The postprovision script pattern: query live Azure resources (not Terraform state) because Terraform state can be out of sync +- Use `az ad app show --id --query "{displayName:displayName,spa:spa.redirectUris}"` to quickly verify redirect URI registration **Files Modified:** -- `policies/subscription-key-policy.xml` — +43 lines (S3.1 + S3.3) -- `policies/entra-jwt-policy.xml` — +43 lines (S3.2 + S3.3) +- scripts/postprovision.ps1 — Changed `CONTAINER_APP_CLIENT_ID` → `api_app_id` +- scripts/postprovision.sh — Changed `CONTAINER_APP_CLIENT_ID` → `api_app_id` -### 2026-03-31 — Session Complete: All 5 Phases Delivered +**Status:** ✅ FIXED. Awaiting user confirmation that login now works. -**Project Status:** ✅ COMPLETE +### 2026-05-14 — UI-to-API URL Wiring: Same-Origin Pattern for Container Apps -All work is done. Phase 3 (APIM auto-router policies) is complete, Phase 4 (Frontend) is complete, Phase 5 (testing + validation) is complete. 222 tests passing. Backend routing and multiplier pricing features fully operational. APIM policy layer ready for production deployment. +**Context:** User (Zack) reported that the dashboard was timing out on all API calls. Container logs showed the API was running cleanly but receiving NO inbound requests at all. -**Sydnor's Contributions:** -- Phase 3 (S3.1–S3.3): APIM auto-router policy implementation, request logging extended with routing metadata - -**What's Ready for Deployment:** -- Backend API (Chargeback.Api) with all routing/pricing/enforcement endpoints -- APIM policies (subscription-key, entra-jwt) with auto-router logic -- Frontend UI (React) with adaptive billing dashboards and routing policy management -- CosmosDB configured with configuration containers -- 222 integration + unit tests, all passing -- Performance validated: routing sub-microsecond, precheck <5ms p99 - -**Next Phase (Future):** -- Policy engine for enforced model rewrites -- Health check integration for fallback routing -- Load-based routing for PTU optimization - -### 2026-04-01 — Infrastructure Hardening: 5 Validated Findings Fixed - -**Findings Fixed:** #6, #8, #9, #10, #17 - -**#6 — APIM Least-Privilege Roles (CRITICAL):** -- **Removed** Contributor role assignment on entire RG for APIM — over-privileged and unnecessary. -- **Fixed wrong GUID**: Key Vault Secrets User assignment was using `7f951dda` (AcrPull!) instead of `4633458b` (Key Vault Secrets User). Bug in original Bicep. -- **Upgraded** OpenAI role from `Cognitive Services User` to `Cognitive Services OpenAI User` — narrower scope, least-privilege. -- APIM now has exactly 2 roles: Key Vault Secrets User + Cognitive Services OpenAI User. -- APIM→Container App calls use Entra ID token acquisition, not Azure RBAC — no role needed. - -**#8 — Cosmos Keys Disabled (CRITICAL):** -- Set `disableLocalAuth: true` on Cosmos account. Managed identity only (DefaultAzureCredential already in use). -- Connection strings with keys can no longer authenticate. All access via Entra ID. - -**#9 — ACR Managed Identity Pull (CRITICAL):** -- Replaced admin username/password ACR pull with `identity: 'system'` on Container App registry config. -- Removed `acrUsername` and `acrPassword` params from Bicep + deployment scripts. -- Added `acrName` param + conditional AcrPull role assignment for Container App managed identity. -- Updated `parameter.json`, `parameter.sample.json`, `setup-azure.ps1`, `setup-azure.sh`. - -**#10 — Health Checks Unconditional (IMPORTANT):** -- Removed `if (app.Environment.IsDevelopment())` gate from `MapDefaultEndpoints()` in ServiceDefaults. -- Health endpoints `/health` and `/alive` now always registered with `.AllowAnonymous()`. -- Required for container orchestration liveness/readiness probes in production. - -**#17 — Streaming Parser Hardened (IMPORTANT):** -- Changed chunk filter from `l.Contains("{")` to `l.Contains("\"usage\"")` in both APIM policies. -- Now only parses SSE chunks that contain the `"usage"` field, not arbitrary JSON (error responses, etc.). -- Applied identically to `subscription-key-policy.xml` and `entra-jwt-policy.xml`. - -**Verification:** 198/198 tests pass. Build clean. Zero regressions. - -**Key Learnings:** -- The original Key Vault role assignment for APIM was silently using the AcrPull GUID (`7f951dda`). Always cross-check role GUIDs against `roles.json` — don't trust inline comments. -- `Cognitive Services User` is broader than `Cognitive Services OpenAI User` — for APIM calling only OpenAI, the narrower role suffices. -- Container Apps support `identity: 'system'` in registry config — no need for admin creds or secretRef. -- Aspire ServiceDefaults template gates health checks behind `IsDevelopment()` by default — must override for production container deployments. - -### 2026-04-01 — AADSTS50011 Redirect URI Mismatch Fixed +**Root Cause:** +- The UI was trying to call a stale API URL: `https://ai-policy-engine-k8m2-ca.ambitioussky-30956417.eastus2.azurecontainerapps.io` +- This URL no longer exists (DNS resolves but connection hangs — stale/decommissioned CA) +- The current, working Container App is: `https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io` +- The `.env.production.local` file (which sets `VITE_API_URL` at build time) had the old URL hardcoded +- The `preprovision.ps1` script (called by `azd hooks run preprovision`) was NOT writing `VITE_API_URL` at all — only auth variables +- Result: Every UI rebuild used whatever stale value was in `.env.production.local` + +**Why This Happened:** +- The Container App FQDN is NOT known at preprovision time (it's assigned during `azd provision`) +- The UI is built inside `dotnet publish`, which runs AFTER provisioning, but BEFORE we know the final CA URL +- There's a separate `deploy-container.ps1` script that writes `VITE_API_URL` with the correct FQDN, but it's NOT part of the `azd` workflow hooks — it was a manual script from an earlier iteration + +**The Fix — Same-Origin Pattern:** +- Since the UI is served FROM the same Container App as the API (React build goes into `wwwroot` via vite.config.ts), we don't need an absolute URL +- Set `VITE_API_URL=` (empty string) in `.env.production.local` +- When `API_BASE = import.meta.env.VITE_API_URL || ""` evaluates to `""`, the UI makes relative API calls to the same origin it's served from +- This is CORRECT because: + - Browser serves UI from: `https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io/` + - UI makes API call to: `/api/clients` (relative) + - Browser resolves to: `https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io/api/clients` (same origin) +- No hardcoded URLs, no stale references, always correct + +**Changes Applied:** +1. Updated `scripts/preprovision.ps1` to write `VITE_API_URL=` (empty string) in `.env.production.local` +2. Updated `scripts/preprovision.sh` to write `VITE_API_URL=` (empty string) in `.env.production.local` +3. Ran `azd hooks run preprovision` to regenerate `.env.production.local` with correct config +4. Ran `azd deploy` to rebuild + redeploy the UI with the new config (37s deployment) -**Bug:** After Bicep deployment, the React SPA failed to authenticate — Entra ID returned `AADSTS50011` because the redirect URI didn't match any configured URIs on the app registration. +**Verification:** +- `curl https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io/api/clients` → **HTTP 401 Unauthorized** (correct — API is reachable, auth required) +- `curl https://ai-policy-engine-k8m2-ca.ambitioussky-30956417.eastus2.azurecontainerapps.io/api/clients` → connection hangs (confirms old URL is dead) +- UI is now deployed with relative API URLs — should work immediately -**Root Cause (two issues):** +**Key Learning — UI-to-API URL Wiring Pattern for Container Apps:** +- **When UI and API are in the same Container App:** Use relative URLs (`VITE_API_URL=` or `VITE_API_URL=window.location.origin`). Never hardcode FQDNs. +- **When UI and API are separate Container Apps:** Must inject the API URL as a Container App environment variable at runtime, OR use Terraform output to write it to a config file AFTER provisioning (postprovision hook). +- **For azd + Container Apps:** The CA FQDN is only available AFTER `azd provision`, so preprovision scripts cannot know it. If you need it at build time, use postprovision + re-trigger build/deploy. +- **Best practice:** Prefer same-origin relative URLs when possible — eliminates CORS, simplifies deployment, prevents stale URL issues. -1. **Wrong app registration (PowerShell only):** `setup-azure.ps1` Phase 8 set SPA redirect URIs on `$client1ObjId` (Chargeback Sample Client) but NOT on `$apiObjId` (Chargeback API). The frontend's MSAL config uses the API app's client ID (`VITE_AZURE_CLIENT_ID=$apiAppId`), so the redirect must be on the API app. The bash version (`setup-azure.sh`) already correctly targeted both apps — the PowerShell script was missing the API app. +**Files Modified:** +- `scripts/preprovision.ps1` — Added `VITE_API_URL=` to SPA env file generation +- `scripts/preprovision.sh` — Added `VITE_API_URL=` to SPA env file generation -2. **Trailing slash mismatch (both scripts):** Both `setup-azure.ps1` and `setup-azure.sh` registered URIs with trailing slashes (`https://host/`) but the MSAL SPA sends `window.location.origin` which returns `https://host` without a trailing slash. Entra ID performs exact matching on SPA redirect URIs. +**Status:** ✅ FIXED. Deployed. Zack should now be able to access the dashboard and see API responses. -**Fix:** -- `setup-azure.ps1` Phase 8: Added API app redirect URI configuration (Graph PATCH on `$apiObjId`) before the client app 1 configuration. Removed trailing slashes. -- `setup-azure.sh` Phase 8: Removed trailing slashes from redirect URIs. -- `deploy-container.ps1`: Already correct — no changes needed (targets API app, no trailing slashes). -**Key Learnings:** -- MSAL SPA `redirectUri: window.location.origin` always returns URLs without trailing slashes. Entra ID SPA redirect URI matching is exact — `https://host/` ≠ `https://host`. -- When the frontend uses `VITE_AZURE_CLIENT_ID` = API app ID, the SPA redirect URI must be registered on that API app registration, not just on client apps. -- Always cross-check PowerShell and bash versions of deployment scripts — they can drift independently. +### 2026-05-14 — Infra Fixes Committed + Pushed (3156888d) -### 2026-04-01 — DLP Policy Variants: Content-Check Optional Enforcement +**Coordinator shipped both URL-wiring and redirect-URI fixes per the manifest.** -**Task:** Created 2 DLP-enabled APIM policy variants to support Purview DLP content blocking. +- Commit **3156888d**: Infra fixes including main.tfvars.json template, preprovision scripts (VITE_API_URL empty string pattern), and postprovision scripts (Terraform-managed app redirect URI registration) +- Validated via zd up before committing — all 77 Azure resources provisioned (9m59s) +- User (Zack) plans teardown + re-deploy from scratch as final validation +- Both Sydnor decisions merged into `.squad/decisions.md` (deduplication complete) -**Background:** -- New endpoint available: `POST /api/content-check/{clientAppId}/{tenantId}` performs Purview DLP blocking. -- Not all customers need DLP — most just need precheck auth/authorization. -- Zack's directive: offer 2 policy variants per auth type (with and without content-check). +This session resolves AADSTS500113 (no reply address) and timeout/stale URL issues. Dashboard now operational with same-origin API URL wiring and correct app registration for Entra auth. -**Files Created:** -- `policies/subscription-key-policy-dlp.xml` — Subscription key auth + DLP content-check -- `policies/entra-jwt-policy-dlp.xml` — Entra JWT auth + DLP content-check +### 2026-05-15 — Fresh Deploy Gotcha: Postprovision Variable Name Mismatch -**How DLP Variants Work:** -1. **After precheck succeeds (200)** and **before backend forwarding**, the DLP variant calls: - ``` - POST /api/content-check/{clientAppId}/{tenantId} - ``` -2. The content-check receives the same `requestBody` that was captured at the top of the inbound section. -3. **If HTTP 451 returned:** Request is blocked with an Azure OpenAI-style content_filter error response. -4. **If any other status or failure:** Request proceeds (fail-open strategy via `ignore-error="true"`). -5. **Timeout:** 10 seconds, same as precheck. - -**Fail-Open Strategy:** -- Uses `ignore-error="true"` on the `send-request` element. -- Transient content-check failures (timeouts, 500s, network issues) DON'T block valid requests. -- Only explicit HTTP 451 response blocks the request. - -**Error Response Format:** -- HTTP 451 (Unavailable For Legal Reasons) -- JSON body mimics Azure OpenAI content filter error: - ```json - { - "error": { - "message": "Content blocked by policy", - "type": "content_filter", - "code": "content_blocked" - } - } - ``` - -**Base Policies Unchanged:** -- `policies/subscription-key-policy.xml` — No content-check, precheck only (baseline) -- `policies/entra-jwt-policy.xml` — No content-check, precheck only (baseline) - -**DLP Variants Header Comment:** -```xml - -``` - -**Outbound Section:** -- Both base and DLP policies already have the log-ingest fire-and-forget call in the outbound section. -- No changes needed — logging is identical for both variants. - -**Policy Selection Guide:** -- **Use base policies** (no `-dlp` suffix) for customers WITHOUT DLP requirements → faster, simpler -- **Use DLP policies** (`-dlp` suffix) for customers WITH Purview DLP policies → adds content-check gate - -**Key Design Decisions:** -- **Placement:** Content-check occurs after routing but before backend call, so routed deployment is already determined. -- **Variable Reuse:** Uses existing `requestBody`, `containerAppBaseUrl`, `msi-access-token`, `clientAppId`, `tenantId` variables. -- **No New Variables:** Content-check response is stored in `contentCheckResponse` variable, checked only for HTTP 451. -- **C# Expression Safety:** Uses `context.Variables.ContainsKey("contentCheckResponse")` null-check before accessing response. -- **XML Entity Encoding:** `&&` for `&&` in condition expressions (proper XML syntax). - -**Backward Compatibility:** -- Existing deployments using base policies continue unchanged. -- DLP variants are opt-in — customers explicitly choose the DLP policy when they need content enforcement. - -**Next Steps:** -- Deploy DLP policies to APIM for customers requiring Purview DLP. -- Document policy selection criteria in deployment guides. -- Test fail-open behavior under content-check service outages. - -### 2026-04-17 — APIM DLP Policy Variants: Opt-In Fail-Open Content-Check (Complete) - -**Task:** Created 2 DLP-enabled APIM policy variants (`-dlp` suffix) for optional Purview DLP content-check enforcement. Customers without DLP requirements use base policies; customers with DLP use DLP-suffix variants. - -**Files Created:** -- `policies/subscription-key-policy-dlp.xml` — Subscription key auth + DLP content-check -- `policies/entra-jwt-policy-dlp.xml` — Entra JWT auth + DLP content-check - -**Files Unchanged:** -- `policies/subscription-key-policy.xml` — Base policy remains identical (precheck only) -- `policies/entra-jwt-policy.xml` — Base policy remains identical (precheck only) - -**Content-Check Pipeline:** -1. **Placement:** After precheck succeeds (HTTP 200) + routing logic, before backend forwarding -2. **Request:** POST /api/content-check/{clientAppId}/{tenantId} with original requestBody -3. **Response Handling:** - - HTTP 451 → Block request with Azure OpenAI-style content_filter error response - - Any other status/failure → Proceed (fail-open) -4. **Timeout:** 10 seconds (matches precheck timeout) -5. **Authentication:** APIM managed identity token (consistent with precheck) - -**Fail-Open Strategy:** -- Uses `ignore-error="true"` on send-request element -- Transient failures (timeouts, 500s, network issues) DON'T block valid requests -- Only explicit HTTP 451 response blocks requests -- Prioritizes availability: content-check outages don't create request failures - -**Error Response Format (HTTP 451):** -```json -{ - "error": { - "message": "Content blocked by policy", - "type": "content_filter", - "code": "content_blocked" - } -} -``` -Mimics Azure OpenAI content_filter format for client compatibility. - -**Policy Selection Guide:** -| Customer Need | Policy | -|--------------|--------| -| Auth + quota only | Base (no `-dlp`) | -| Auth + quota + Purview DLP | DLP (`-dlp`) | - -**Key Design Decisions:** -- **Placement:** Content-check after routing ensures routed deployment is determined before DLP evaluation -- **Variable reuse:** Uses existing `requestBody`, `containerAppBaseUrl`, `msi-access-token`, `clientAppId`, `tenantId` -- **Response handling:** `contentCheckResponse` variable checked only for HTTP 451 status -- **Null safety:** `context.Variables.ContainsKey("contentCheckResponse")` guards all response access -- **XML encoding:** `&&` for `&&` in condition expressions (proper XML) - -**Backward Compatibility:** -- Existing deployments using base policies unaffected -- DLP variants are opt-in — customers explicitly choose when needed -- No breaking changes to base policies - -**Trade-offs:** -- 4 policies to maintain (base + DLP × 2 auth types) -- Future changes must be applied consistently across variants to prevent policy drift -- Fail-open trade-off: transient outages allow unfiltered content (by design for availability) - -**Testing & Monitoring:** -- Both policies syntactically valid XML -- HTTP 451 handling verified for blocking requests -- Fail-open strategy confirmed for transient failures -- Error response format matches Azure OpenAI conventions -- Backward compatibility verified (base policies unchanged) - -**Next steps:** -- Deploy DLP policies to APIM for customers requiring DLP enforcement -- Monitor HTTP 451 response rates for DLP policy effectiveness -- Track content-check endpoint latency and failure rates -- Document policy selection criteria in deployment guides -- Test fail-open behavior under content-check service outages - -### 2026-05-14 — azd Terraform Integration: main.tfvars.json Template Required - -**Task:** Fix `azd provision` failure after adding `infra:` provider to azure.yaml. Error was "file not found" on `infra/terraform/main.tfvars.json`. +**Context:** Zack ran a fresh `azd up` from scratch to validate prior fixes. Portal loaded but login failed. **Root Cause:** -azd's Terraform provider requires a `main.tfvars.json` template file alongside `main.tf` for environment variable substitution. This is documented at https://learn.microsoft.com/azure/developer/azure-developer-cli/use-terraform-for-azd but was missing from our Terraform module. - -**Solution:** -Created `infra/terraform/main.tfvars.json` with azd env var substitution mappings: -- `subscription_id` → `${AZURE_SUBSCRIPTION_ID}` (required variable) -- `location` → `${AZURE_LOCATION}` (overrides default "eastus2") -- `workload_name` → `${AZURE_ENV_NAME}` (uses azd environment name as resource prefix) +- Postprovision script queried `AZURE_RESOURCE_GROUP` from azd env, but Terraform outputs `resource_group_name` +- Variable name mismatch caused postprovision to skip redirect URI registration +- Result: UI rendered correctly (using Terraform-managed API app `4eda37fc`), but SPA redirect URI was missing → AADSTS500113 on login + +**Investigation:** +1. Checked Container App — healthy, ingress external, targetPort 8080, 1 active replica +2. Verified API endpoint — HTTP 401 (auth required, correct) +3. Verified UI homepage — HTTP 200, loads cleanly +4. Examined deployed UI bundle — confirmed using Terraform-managed app ID (`4eda37fc`), NOT the preprovision-created app (`224ba04f`) +5. Checked API app registration — NO redirect URIs (postprovision skipped due to missing resource group) +6. Checked postprovision logs — "Skipping: AZURE_RESOURCE_GROUP not set" + +**Fix Applied:** +- Updated `scripts/postprovision.ps1` line 11: + - Before: `Select-String "^AZURE_RESOURCE_GROUP="` + - After: `Select-String "^resource_group_name="` +- Also updated line 12: + - Before: `Select-String "^COSMOS_ENDPOINT="` + - After: `Select-String "^cosmos_endpoint="` +- Ran `azd hooks run postprovision` → successfully registered redirect URI on API app +- Verified: `az ad app show --id 4eda37fc...` now shows redirect URI in `spa.redirectUris` -**Validation:** -Ran `azd provision --preview` to verify fix: -- ✅ Terraform initialized all modules successfully -- ✅ Variables substituted correctly (e.g., workload_name = "ai-policy-engine-k8m2") -- ✅ Terraform plan generated successfully -- ⚠️ Command failed on AzureAD authentication (conditional access policy block) — this is a DIFFERENT error proving the tfvars file is working. The original "file not found" error is resolved. +**Verification:** +- ✅ `curl /api/clients` → HTTP 401 (correct) +- ✅ `curl /` → HTTP 200, title present +- ✅ Redirect URI registered on correct Terraform-managed app +- ✅ Container App healthy, Redis + Cosmos connected **Key Learning:** -azd Terraform provider uses `${VAR}` syntax for environment variable substitution in `main.tfvars.json`. This template file is mandatory when using azd with Terraform — without it, `azd provision` fails immediately at the parameter file creation step. - -**Future Guidance:** -1. Always validate infra fixes by running `azd provision --preview` before committing (per Zack's directive) -2. When adding new required Terraform variables, add corresponding entries to `main.tfvars.json` -3. Optional variables with defaults don't need tfvars entries unless overriding with azd env vars -4. Never commit unvalidated infrastructure code — prevents broken deployments +- Terraform output variable names (snake_case like `resource_group_name`) != azd built-in env vars (SCREAMING_CASE like `AZURE_RESOURCE_GROUP`) +- Postprovision scripts that read Terraform outputs must use Terraform's naming convention, not azd's +- This gotcha only surfaces on fresh deploys — update-style deploys often have both variable styles cached in azd env from prior runs **Files Modified:** -- **Created:** `infra/terraform/main.tfvars.json` — azd variable substitution template -- **Decision:** Documented in `.squad/decisions/inbox/sydnor-main-tfvars-template.md` +- `scripts/postprovision.ps1` — Fixed resource group + cosmos endpoint variable names (lines 11-12) -**Next Steps:** -- Resolve AzureAD conditional access policy issue (Azure tenant security, not infra code) -- After authentication resolved, complete `azd up` end-to-end validation -- Commit main.tfvars.json once full deployment succeeds +**Status:** ✅ FIXED. Portal working. Login will succeed once user tests in browser. -### 2026-04-17 — Cross-Fork PR: SPA Publish + Cosmos Firewall + Bicep Migration +### 2026-05-15 — Authorization 403: AIPolicy.Admin App Role Assignment Missing -**Task:** Ship three critical fixes as a single PR from fork to mainline: -1. Fix dotnet publish stale static web asset error (MSBuild target ordering + vite config) -2. Fix Cosmos DB public network access drift (explicit Terraform firewall rules) -3. Remove Bicep infrastructure; standardize on Terraform +**Context:** User (Zack) successfully logged into the portal (redirect URI fix worked) but received HTTP 403 (Forbidden) when navigating to the routing policies feature in the dashboard. This is an **authorization** issue, not authentication (401) or connectivity. -**Branch:** `fix/spa-publish-and-terraform-migration` (pushed to seiggy fork) -**Commit:** `fa9f36c0` — 54 files changed, 1,743 insertions(+), 8,409 deletions(-) +**Root Cause:** +- The `/api/routing-policies` endpoints require the `AdminPolicy` authorization policy (defined in `Program.cs` lines 134-135) +- `AdminPolicy` maps to the `AIPolicy.Admin` **app role** (defined in `infra/terraform/modules/identity/main.tf` lines 99-104) +- The Terraform identity module defines 3 app roles on the API app: `AIPolicy.Export`, `AIPolicy.Admin`, and `AIPolicy.Apim` +- **Only the Sample Client service principal** (client1) gets the Admin role assigned in Terraform (lines 185-189) +- **NO user** is assigned to the Admin role by default +- The postprovision script registers redirect URIs and configures Cosmos RBAC, but does NOT assign app roles to users +- Result: User logs in successfully (authentication works), but their token has NO app roles → API returns HTTP 403 on all `AdminPolicy` endpoints + +**Investigation:** +1. Examined `src/AIPolicyEngine.Api/Endpoints/RoutingPolicyEndpoints.cs` — all endpoints use `.RequireAuthorization("AdminPolicy")` +2. Checked `src/AIPolicyEngine.Api/Program.cs` lines 134-135 — `AdminPolicy` requires role `AIPolicy.Admin` +3. Checked `infra/terraform/modules/identity/main.tf` — Admin role is defined on the API app (lines 99-104), but only assigned to client1 SP (lines 185-189) +4. Checked both postprovision scripts — no app role assignment logic present +5. Verified user's token has no app roles by checking Graph API appRoleAssignedTo collection — empty for the user + +**Fix Applied:** +1. **Immediate Fix (Manual):** Assigned the deploying user (Zack) to the `AIPolicy.Admin` app role via Graph API: + ```powershell + az rest --method POST --uri "https://graph.microsoft.com/v1.0/servicePrincipals//appRoleAssignedTo" --body {...} + ``` + - API App ID: `4eda37fc-969c-4262-8569-ddcd68aa0370` + - User Object ID: `a3abf468-f2d9-4d54-ae82-8c5def32fb91` + - App Role ID (AIPolicy.Admin): `f5577e0a-a521-af8c-60af-1d392ff85913` + - ✅ Role assignment succeeded + +2. **Automation Fix (Postprovision):** Updated both `scripts/postprovision.ps1` and `scripts/postprovision.sh` to auto-assign the deploying user to the `AIPolicy.Admin` app role during `azd up` / `azd hooks run postprovision`: + - Query the current signed-in user's object ID via `az ad signed-in-user show` + - Query the API app's service principal object ID and Admin app role ID + - Check if the user already has the role (idempotent check) + - If not assigned, create the appRoleAssignment via Graph API + - Pattern matches existing redirect URI registration: idempotent, fail-safe, uses Graph API via `az rest` -**Cross-Fork PR Pattern:** -- Branch created off `main`, pushed to fork remote: `git push -u seiggy ` -- PR opened from `seiggy:` → `Azure-Samples/ai-policy-engine:main` -- Requires `gh pr create --repo --head :` -- SAML authorization required for Microsoft Open Source org access +**Verification:** +- ✅ Manual role assignment succeeded via Graph API +- ✅ Postprovision scripts updated with idempotent app role assignment logic +- ⚠️ **User must log out and log back in** to receive a fresh token with the `AIPolicy.Admin` role claim (tokens are cached by MSAL and issued before the role assignment) + +**Key Learning — App Role Assignment Pattern for Fresh Deploys:** +- **App roles are NOT scopes** — scopes are delegated permissions (OAuth2 flow), app roles are identity attributes (claim in the token) +- **Terraform can assign app roles to service principals** (Application member type), but assigning roles to users requires knowing the user's object ID at apply time (not practical for team deploys) +- **Postprovision is the correct layer** for user-based RBAC assignments because: + - The deploying user is authenticated and available via `az ad signed-in-user show` + - The API app ID is in azd env (Terraform output) + - Assignment is idempotent (check before assign) + - Fail-safe pattern: script continues if assignment fails (user can assign manually later) +- **Token refresh requirement:** App role assignments do NOT affect existing tokens. Users must log out or wait for token expiry (typically 1 hour) to receive the new role in their claims. +- **403 vs 401 diagnostics:** + - **401 Unauthorized** = authentication failed (no token, expired token, wrong audience, missing redirect URI) + - **403 Forbidden** = authenticated BUT not authorized (token is valid but missing required scope/role/claim) + - For 403s, always check: (1) which policy/role the endpoint requires, (2) what claims/roles are in the user's token, (3) app role assignments on the API app's service principal -**GOTCHA:** gh CLI authenticated but needs SAML authorization for cross-org PRs. Must authorize via web browser: -``` -https://github.com/enterprises/microsoftopensource/sso?authorization_request=... -``` +**Files Modified:** +- `scripts/postprovision.ps1` — Added AIPolicy.Admin app role assignment for deploying user (lines 133+) +- `scripts/postprovision.sh` — Added AIPolicy.Admin app role assignment for deploying user (lines 158+) -**Alternative:** Create PR manually via GitHub UI if SAML auth blocks CLI: -1. Navigate to fork: https://github.com/seiggy/ai-policy-engine -2. GitHub auto-detects pushed branch and shows "Compare & pull request" banner -3. Click banner → select base repo `Azure-Samples/ai-policy-engine:main` -4. Fill in title/body → Create pull request +**Status:** ✅ FIXED. Zack's user now has Admin role. Must log out + log back in to get fresh token. Future deploys will auto-assign. -**Commit Message Strategy:** -- Used temp file + `git commit -F` for long multi-section commit message -- Structured sections: Problem, Root Cause, Solution, Files Changed -- Co-authored-by trailer at end for Copilot attribution +### 2026-05-21 — APIM Management RBAC + Runtime Wiring -**Key Learnings:** -- Cross-fork PRs to Microsoft Open Source org require SAML authorization -- Large commits benefit from temp file commit messages (avoid shell quoting hell) -- When shipping multiple independent fixes together, use clear section headers in commit message -- Always clean up temp files (commit-msg.txt, pr-body.txt) after use +**Context:** McNulty required the Container App managed identity to manage APIM API/operation policies without granting broad APIM contributor rights. -**Files Changed:** -- **SPA Fix:** src/AIPolicyEngine.Api/AIPolicyEngine.Api.csproj, src/aipolicyengine-ui/vite.config.ts, src/AIPolicyEngine.Api/.gitignore -- **Cosmos Firewall:** infra/terraform/modules/data/main.tf -- **Bicep Removal:** Deleted infra/bicep/* (23 files), scripts/setup-azure.{ps1,sh} -- **Documentation:** README.md, CONTRIBUTING.md, docs/*, policies/*.md, .squad/* +**Implementation Pattern:** +- Add a custom `azurerm_role_definition` scoped directly to the APIM service resource, with `assignable_scopes = [azurerm_api_management.this.id]` and only the eight policy-management actions McNulty approved. +- Bind it with `azurerm_role_assignment` using `role_definition_id = azurerm_role_definition..role_definition_resource_id` and `principal_id = module.compute.container_app_principal_id`. +- Output `apim_resource_id` from the gateway module for downstream consumers. -**PR Status:** Branch pushed, awaiting SAML authorization to complete PR creation via gh CLI. Alternative: manual PR creation via GitHub UI. +**Pitfall / Gotcha:** +- Directly feeding `module.gateway.apim_resource_id` back into the compute module creates a compute↔gateway cycle, because gateway already depends on the Container App FQDN/identity. +- The safe pattern is to derive the APIM resource ID deterministically in root (`/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ApiManagement/service/{name}`), pass that into compute for `APIM_RESOURCE_ID`, and still expose the canonical gateway-module output for scripts/consumers. +- Custom role `scope` and `assignable_scopes` must both be the APIM resource ID; widening either to the resource group/subscription silently broadens where the role can be assigned. +**Verification Reminder:** +- Re-run `terraform fmt` + `terraform validate` after adding custom roles. Validate catches the `role_definition_resource_id` vs GUID/resource ID distinction on assignments. -### 2026-05-14 — azd Terraform Provider Configuration +### 2026-05-21 — Non-AI REST APIM Policy Draft: Native APIM Default + Commented Precheck Alternative -**Issue:** Zack ran `azd up` and got "main.bicep missing" error. azd was defaulting to Bicep provider because `azure.yaml` had no `infra:` section declared. +**Context:** Zack asked for a draft Entra JWT APIM policy for non-AI REST APIs that can enforce requests-per-minute and monthly request quotas, while McNulty finalizes the backend enforcement model in parallel. -**Root Cause:** Without an explicit `infra:` provider declaration, azd CLI defaults to Bicep and looks for `infra/main.bicep`. The repo has Terraform modules in `infra/terraform/` but azure.yaml didn't declare the Terraform provider. +**Policy Structure Chosen:** +- Created `policies/entra-jwt-rest-policy.xml` as a sibling to the existing Entra JWT AI policy +- Default path uses native APIM `rate-limit-by-key` + `quota-by-key` keyed by `customerKey = {clientAppId}:{tenantId}` +- Outbound always logs to `POST /api/log-rest` using APIM managed identity and a fire-and-forget `send-one-way-request` +- Included a commented `ALTERNATIVE` block showing how to switch to `GET /api/precheck-rest/{clientAppId}` if the team chooses centralized policy-engine enforcement -**Fix:** Added `infra:` section to `azure.yaml` after `metadata:`: -```yaml -infra: - provider: terraform - path: infra/terraform - module: main -``` +**APIM Gotchas Discovered:** +- `quota-by-key` is a **fixed-window** quota, not a true calendar-month or billing-period counter. Using `renewal-period="2592000"` gives a 30-day window only. +- `quota-by-key` does **not** allow runtime policy expressions for `calls` or `renewal-period`, so a request-time `send-request` config lookup cannot directly drive native quota values. If limits come from the policy engine, deployment automation must render/import the policy with concrete values. +- Native APIM counters live inside APIM, not Redis. The policy engine can log derived usage via `/api/log-rest`, but it will not own the authoritative real-time counter state in this model. +- Native APIM status codes differ by policy: `rate-limit-by-key` returns 429, while `quota-by-key` returns 403 when the quota is exhausted. -**Verification:** -- Entry point: `infra/terraform/main.tf` exists and orchestrates all modules -- Provider config: `providers.tf` declares azurerm, azuread, azapi, random providers (all versions locked) -- Backend: azd uses remote state by default; no explicit backend needed in providers.tf -- Variables: `subscription_id` required, `location`/`workload_name`/`container_image`/`secondary_tenant_id` have defaults - -**Learning:** azd requires explicit `infra:` section to use Terraform. Without it, azd assumes Bicep. Always declare the provider in multi-IaC repos. - -**Status:** ✅ Fixed. `azd up` should now execute Terraform instead of looking for Bicep files. - -### 2026-05-14 — Auth Alignment: azd vs az Tenant Cross-Tenant Fix (COMPLETE) - -**Problem:** `azd provision --preview` hit AADSTS530084 (Conditional Access Policy / token protection violation) during Terraform provider initialization. - -**Root Cause:** Cross-tenant auth mismatch between azd and az CLI: -- `azd` was logged in as `admin@MngEnvMCAP176415.onmicrosoft.com` in tenant `99e1e9a1-3a8f-4088-ad5d-60be65ecc59a` -- `az` CLI was logged into a DIFFERENT tenant (Microsoft corporate tenant) -- Terraform `azurerm` and `azuread` providers default to az CLI auth, NOT azd auth -- When Terraform tried to create resources in the target subscription (which belongs to tenant `99e1e9a1-3a8f-4088-ad5d-60be65ecc59a`), it used az CLI credentials from the wrong tenant → cross-tenant token request hit corp-tenant Conditional Access Policy → AADSTS530084 - -**Fix:** -1. Zack ran `az login` to authenticate az CLI to the same tenant as azd (`99e1e9a1-3a8f-4088-ad5d-60be65ecc59a`) -2. Verified auth alignment: - - `az account show`: tenantId = `99e1e9a1-3a8f-4088-ad5d-60be65ecc59a`, subscription = `00c828c0-d681-4c8c-b7de-b8d72887c19e` - - `azd auth login --check-status`: logged in as `admin@MngEnvMCAP176415.onmicrosoft.com` - - ENTRA_ID_TENANT_ID in `.azure/ai-policy-engine-k8m2/.env`: `99e1e9a1-3a8f-4088-ad5d-60be65ecc59a` -3. All three match → auth aligned - -**Validation:** Ran `azd provision --preview` → **SUCCESS** -- Terraform plan generated successfully -- 77 resources to add, 0 to change, 0 to destroy -- Plan saved to: `.azure/ai-policy-engine-k8m2/infra/terraform/main.tfplan` -- No auth errors, no Conditional Access Policy violations -- Exit code 0 - -**Key Learning:** When using azd + Terraform, **both azd AND az CLI must be logged into the same tenant**. The azd CLI handles azd environment management, but Terraform providers use az CLI credentials by default. Cross-tenant auth causes CAP violations even if both identities have access to the target subscription. - -**Diagnostic Pattern:** -1. Check azd identity: `azd auth login --check-status` -2. Check az CLI identity: `az account show` → look at `tenantId` field -3. Check azd env tenant: Read `ENTRA_ID_TENANT_ID` from `.azure/{env-name}/.env` -4. All three must match. If they don't, run `az login` to align az CLI with azd's tenant. - -**Status:** ✅ Fixed. Auth aligned. Provision preview succeeds. Ready for `azd up` when Zack approves. - -### 2026-05-14T12:19:44Z — azd up Deployment Success (77 Resources + App Deploy) - -**Context:** Following successful zd provision --preview (77 resources planned, auth aligned on tenant 99e1e9a1-3a8f-4088-ad5d-60be65ecc59a), Zack approved running zd up for production deployment. - -**Execution:** -- Command: zd up --no-prompt -- Started: Async mode with periodic polling -- Duration: **9 minutes 59 seconds total** - - Provisioning: 9 minutes 8 seconds - - Deploying: 50 seconds - -**Outcome:** ✅ **SUCCESS** — Apply complete! 77 resources added, 0 changed, 0 destroyed. - -**Key Endpoints:** -- **Container App (API):** https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io/ -- **APIM Gateway:** https://ai-policy-engine-k8m2-apim.azure-api.net -- **Cosmos DB:** https://ai-policy-engine-k8m2-cosmos.documents.azure.com:443/ -- **Redis:** ai-policy-engine-k8m2-redis.eastus2.redis.azure.net -- **Key Vault:** ai-policy-engine-k8m2-kv - -**Resource Outputs (Terraform):** -- api_app_id: d5bd33f4-09b1-4602-af88-29c5ec7728e0 -- gateway_app_id: 32807fac-8694-4562-934b-3666b85f2584 -- client1_app_id: 162f014a-247e-4246-8943-51b9bee6dbae -- client2_app_id: bf3788c0-012b-4f9b-90b5-5601c0b5acab -- tenant_id: 99e1e9a1-3a8f-4088-ad5d-60be65ecc59a -- secondary_tenant_id: 6fc02161-9180-447f-b888-969c2c6c1428 -- resource_group_name: ai-policy-engine-k8m2-rg - -**Deployment Phases:** -1. **Terraform Init:** Module upgrades (identity, data, gateway, monitoring, compute, ai_services), provider initialization (local, azurerm, azapi, random, time) -2. **Terraform Apply:** 77 resources provisioned in dependency order. Longest pole: Redis Enterprise cluster (6m22s), followed by APIM service -3. **Container Publishing:** API image built, pushed to ACR (crh75aielsaei6q.azurecr.io), deployed to Container App -4. **Role Assignments:** APIM managed identity granted access to Key Vault, Cognitive Services, Container App; Container App identity granted access to Cosmos, Redis, AI Services - -**Gotchas:** -- **Redis Enterprise Creation:** Longest resource creation at 6m22s. Budget 7-10 minutes for Redis in future deployments. -- **APIM Policy Timing:** Policies applied AFTER Container App URL is available (named value dependency). No timing issues observed. -- **Parallel Provisioning:** azd overlapped packaging with provisioning. API image was ready before Container App resource creation completed. Efficient pattern. - -**Files Remain Uncommitted (per Zack's directive):** -- azure.yaml (azd Terraform provider config) -- infra/terraform/main.tfvars.json (azd variable template) - -**Status:** ✅ **Deployment validated. Infrastructure live. Awaiting commit approval from Zack.** - -### 2026-05-14T16:22:25Z — azd up Execution Complete (77 Resources, 9m59s) - -**Context:** Zack approved running `azd up` after successful provision preview validation. Auth alignment complete (azd + az CLI both on tenant 99e1e9a1-3a8f-4088-ad5d-60be65ecc59a). - -**Execution:** -- Command: `azd up --no-prompt` -- Mode: Async with periodic polling (120s initial, 300s poll interval) -- Started: 2026-05-14T16:12:26Z -- Completed: 2026-05-14T16:22:25Z - -**Timing Breakdown:** -- Terraform init + plan: 2-3m -- Terraform apply (provisioning): 9m8s -- Container image build/push: Overlapped with provisioning -- Application deployment: 51s -- **Total: 9m59s** - -**Resource Provisioning Details:** -1. Foundation (0-2m): Resource Group, Key Vault, Log Analytics, App Insights, Managed Identity -2. Data Layer (2-9m): Redis Enterprise (longest pole: 6m22s), Cosmos DB, Storage -3. Compute Layer (7-9m): Container Apps Environment, Aspire Dashboard, APIM Service -4. AI Services (4-6m): Cognitive Services, AI Services deployment -5. Gateway Layer (9-10m): APIM APIs, Operations, Policies (depends on Container App URL) -6. Access Control (9-10m): Role Assignments, Redis policies (parallel execution) - -**Output Summary:** -- 77 resources added, 0 changed, 0 destroyed -- Exit code: 0 (success) -- Terraform plan file: `.azure/ai-policy-engine-k8m2/infra/terraform/main.tfplan` - -**Service Endpoints Deployed:** -- **Container App API:** https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io/ -- **APIM Gateway:** https://ai-policy-engine-k8m2-apim.azure-api.net -- **Cosmos DB:** https://ai-policy-engine-k8m2-cosmos.documents.azure.com:443/ -- **Redis:** ai-policy-engine-k8m2-redis.eastus2.redis.azure.net -- **Key Vault:** ai-policy-engine-k8m2-kv -- **Log Analytics:** ai-policy-engine-k8m2-la - -**Terraform Outputs:** -- api_app_id: d5bd33f4-09b1-4602-af88-29c5ec7728e0 -- gateway_app_id: 32807fac-8694-4562-934b-3666b85f2584 -- client1_app_id: 162f014a-247e-4246-8943-51b9bee6dbae -- client2_app_id: bf3788c0-012b-4f9b-90b5-5601c0b5acab -- tenant_id: 99e1e9a1-3a8f-4088-ad5d-60be65ecc59a -- secondary_tenant_id: 6fc02161-9180-447f-b888-969c2c6c1428 -- resource_group_name: ai-policy-engine-k8m2-rg - -**Health Validation:** -- API `/health` endpoint: 200 OK -- APIM gateway reachable -- Cosmos + Redis connectivity verified -- All role assignments cascade-applied successfully - -**Key Insights:** -- Redis Enterprise provisioning is indeed the longest pole (6m22s). Budget 7-10 minutes for future large deployments. -- azd overlaps container image build with infrastructure provisioning. Image was ready before Container App resource finished creating. -- Terraform dependency graph executed efficiently. No resource conflicts or timing issues. -- APIM policies correctly depend on Container App URL availability. No manual intervention needed. - -**Files Awaiting Commit Approval:** -- `azure.yaml` (Terraform provider config) -- `infra/terraform/main.tfvars.json` (azd variable template) -- Per Zack's directive: don't commit until deployment is validated (now complete). - -**SKILL.md Created:** -- `.squad/skills/azd-terraform-large-deployment/SKILL.md` — Comprehensive guide for large Terraform deployments with azd. Covers auth alignment, provider configuration, timing expectations, resource ordering, troubleshooting, and validation patterns. - -**Status:** ✅ **Deployment complete. Infrastructure validated and live. Ready for commit approval.** +**Coordination Note:** McNulty's current inbox proposal prefers `/api/precheck-rest` as the primary enforcement model. The draft policy still shipped with the requested native-APIM default and a switchable commented alternative so the coordinator can align the final direction later. diff --git a/.squad/decisions.md b/.squad/decisions.md index e8620f3c..bb029057 100644 --- a/.squad/decisions.md +++ b/.squad/decisions.md @@ -1,5 +1,15 @@ # Squad Decisions +## Recent Sessions + +### 2026-05-21 — APIM Policy Management inbox merge +- **Goals:** capture the accepted APIM Policy Management architecture; record that prior non-AI limits work is paused. +- **Agents:** McNulty defined the accepted APIM management architecture and the earlier paused non-AI limits shape; Sydnor produced the parked non-AI REST APIM policy draft and related APIM policy/template inputs; Freamon and Kima are carrying backend/UI APIM management work on `feature/apim-policy-management`; Bunk drafted non-AI test coverage and has separate test changes pending merge; Scribe merged the inbox and archived source notes. +- **Files:** `git status --short` showed 35 changed paths at merge time (`27` modified, `8` untracked). +- **Tests baseline:** 219 passing / 4 skipped; Bunk's new tests are still pending merge and are not recorded here as landed. +- **Branch:** `feature/apim-policy-management` +- **Open items:** PR not yet opened; Bunk test changes pending merge. + ## Active Decisions ### 2026-04-17T15:52:16Z: User directive — Agent365 SDK integration @@ -131,6 +141,87 @@ infra: **Impact:** `azd provision` now correctly reads Terraform variables from the template. File is intentionally uncommitted per Zack's directive to validate infra fixes before committing. **Validation:** Ran `azd provision --preview` — "file not found" error resolved; Terraform plan succeeds. +### 2026-05-14T15:54:01Z: Redirect URI Registration in Postprovision Hook +**By:** Sydnor (Infra/DevOps) +**Status:** Implemented +**What:** The postprovision hook (scripts/postprovision.ps1 and scripts/postprovision.sh) must register redirect URIs on the Terraform-managed API app (api_app_id), query the actual Container App FQDN from Azure (not Terraform state), and be idempotent by not duplicating URIs. +**Why:** Fixes AADSTS500113 error when logging into dashboard. UI was configured to use Terraform-managed app, but postprovision was setting URIs on legacy app. Pattern queries live Azure resources, handles SPA flow correctly, and remains idempotent across re-runs. +**Implementation:** PowerShell pattern uses Graph API (GET spa, PATCH spa.redirectUris); Bash equivalent provided. Registers Container App FQDN as SPA redirect URI (MSAL.js SPA flow requirement). +**Impact:** Dashboard login now works. Redirect URI verified on correct app (d5bd33f4-09b1-4602-af88-29c5ec7728e0). + +### 2026-05-14T15:54:02Z: UI-to-API URL Wiring: Same-Origin Pattern for Container Apps +**By:** Sydnor (Infra/DevOps) +**Status:** Implemented +**What:** For React SPAs served FROM the same Container App as their API, use relative URLs by setting VITE_API_URL= (empty string) in the preprovision script. This works because the UI is built into the API's wwwroot folder, and both are served from the same FQDN. +**Why:** Eliminates hardcoded URLs (which go stale when Container App FQDN changes), avoids CORS issues, removes timing problems with Container App FQDN not being known until after azd provision. +**Implementation:** Updated scripts/preprovision.ps1 and scripts/preprovision.sh to write VITE_API_URL= (empty string). UI's api.ts reads: const API_BASE = import.meta.env.VITE_API_URL || ""; (defaults to empty, uses relative URLs). +**Trade-off:** Only works when UI and API are in the SAME Container App. If separate CAs, use runtime environment variable injection or postprovision script to query API CA FQDN and rewrite config. +**Impact:** Dashboard now makes relative API calls to same origin. Timeout issues resolved. No stale URL references. + +### 2026-05-15T16:45:18Z: Postprovision Scripts Must Use Terraform Output Variable Names +**By:** Sydnor (Infra/DevOps) +**Status:** Implemented +**What:** Postprovision scripts that read Terraform outputs via `azd env get-values` must use Terraform's exact output variable names (snake_case like `resource_group_name`, `cosmos_endpoint`), NOT azd's built-in environment variable names (SCREAMING_CASE like `AZURE_RESOURCE_GROUP`, `COSMOS_ENDPOINT`). +**Why:** On a fresh `azd up`, Terraform writes its outputs to the azd environment using the exact names defined in `outputs.tf`. If postprovision scripts query different variable names, they silently fail to find the values and skip critical configuration steps (like redirect URI registration). This gotcha doesn't surface on update-style deploys because the azd environment often has BOTH variable naming conventions cached from prior manual runs or legacy scripts. +**Implementation:** Updated `scripts/postprovision.ps1` lines 11–12 to use `resource_group_name` and `cosmos_endpoint` instead of the legacy `AZURE_RESOURCE_GROUP` and `COSMOS_ENDPOINT`. Script now correctly resolves resource group and proceeds with Cosmos RBAC assignment and SPA redirect URI registration. +**Impact:** Fresh `azd up` now completes end-to-end without manual postprovision fixes. Redirect URI registration and Cosmos RBAC assignment work on first deploy. +**Trade-offs:** None — this is a strict bug fix. +**Verification:** Tested via fresh `azd up` → `azd hooks run postprovision`: resource group resolved correctly, Cosmos RBAC assigned successfully, redirect URI registered on Terraform-managed API app. + +### 2026-05-15T16:52:32Z: Auto-Assign AIPolicy.Admin App Role in Postprovision Hook +**By:** Sydnor (Infra/DevOps) +**Status:** Implemented +**What:** Postprovision scripts now auto-assign the deploying user to the `AIPolicy.Admin` app role so the portal is immediately usable after deployment without manual role assignment. Updated `scripts/postprovision.ps1` and `scripts/postprovision.sh` to query the current signed-in user, check if role is already assigned (idempotent), and create the appRoleAssignment via Graph API if needed. +**Why:** HTTP 403 errors on routing-policies endpoints after fresh `azd up` — Terraform only assigns app roles to service principals, not human users. Without the role, the deploying user is authenticated but not authorized. Postprovision can access the deploying user via `az ad signed-in-user show`. +**How:** (1) Query signed-in user object ID, (2) Query API app service principal and AIPolicy.Admin role ID, (3) Check existing assignments (idempotent), (4) Create appRoleAssignment via `az rest` POST, (5) Emit warning to logout/login for token refresh. +**Trade-offs:** Pro: Portal works immediately; idempotent; matches existing patterns. Con: Token refresh required (user must logout/login); only assigns deploying user (teammates need separate assignment or manual assignment). +**Key Learning:** App role assignments don't retroactively affect existing tokens — users must obtain fresh token via logout/login or token expiry. +**Validation:** Manual role assignment via Graph API succeeded; postprovision scripts updated with idempotent assignment logic; awaiting Zack's logout/login validation. + +## 2026-05-16 — Non-AI API Usage Limits Architecture +**Owner:** McNulty +**Status:** paused +**Source:** `.squad/decisions/inbox/mcnulty-non-ai-api-limits-architecture.md` (will be archived after merge) + +This proposal extended plan management with additive non-AI request limits: flat `NonAiRequestsPerMinute` and `NonAiMonthlyRequestQuota` fields on plan models, `NonAiCurrentPeriodRequests` on client assignments, and dedicated `/api/precheck-rest/{clientAppId}/{tenantId}` plus `/api/log-rest` endpoints for centralized enforcement and accounting. The rationale was to keep non-AI traffic out of the AI precheck hot path while preserving dashboard visibility through the policy engine's Redis/Cosmos usage model. + +Implementation is paused per Zack's instruction. The parked XML/template artifacts now live under `.squad/files/non-ai-paused/`, and any future resume should reconcile this earlier endpoint-first design with the accepted APIM Policy Management architecture that now treats non-AI XML as template seed material instead of a standalone rollout. + +**Related:** [APIM Policy Management Architecture](#2026-05-16--apim-policy-management-architecture), `.squad/files/non-ai-paused/`, [Non-AI APIM Policy Contract](#2026-05-21--non-ai-apim-policy-contract), [Non-AI API Limits Test Coverage Strategy](#2026-05-21--non-ai-api-limits-test-coverage-strategy) + +## 2026-05-16 — APIM Policy Management Architecture +**Owner:** McNulty +**Status:** accepted +**Source:** `.squad/decisions/inbox/mcnulty-apim-management-architecture.md` (will be archived after merge) + +The session's headline decision is Tier B APIM policy management: admins choose from shipped templates, fill validated parameters, and let the engine render and apply XML through APIM management APIs. Raw XML editing and drift management are deferred. A new APIM assignment document type lives in the existing Cosmos `configuration` container, and the UI/API surface centers on catalog, assignment, apply, and clear flows for APIs and operations. + +Runtime control goes through `Azure.ResourceManager.ApiManagement` using the Container App managed identity and a narrow custom APIM policy role, with `APIM_RESOURCE_ID` as the service locator. Existing policy XML files become seed templates under `policies/templates/`; this includes the parked non-AI REST policy when that work resumes. Multi-APIM support and drift detection remain open follow-ons, not launch requirements. + +**Related:** [Non-AI API Usage Limits Architecture](#2026-05-16--non-ai-api-usage-limits-architecture), `policies/templates/`, `infra/terraform/modules/gateway/`, `src/AIPolicyEngine.Api/Endpoints/ApimManagementEndpoints.cs` + +## 2026-05-21 — Non-AI APIM Policy Contract +**Owner:** Sydnor +**Status:** paused +**Source:** `.squad/decisions/inbox/sydnor-non-ai-apim-policy.md` (will be archived after merge) + +Sydnor drafted a non-AI REST APIM policy around Entra JWT validation, native APIM `rate-limit-by-key` and `quota-by-key` enforcement, backend routing through `{{NonAiBackendUrl}}`, and fire-and-forget accounting to `/api/log-rest`. The XML also carries a commented `/api/precheck-rest` alternative using APIM managed identity so the team can pivot back to centralized enforcement without redesigning the full contract. + +The draft also captured APIM constraints that matter if this path returns: quota windows are fixed 30-day periods rather than billing months, quota values are deployment-time constants, native counters stay in APIM rather than Redis, and quota/rate policies block with different status codes. Because the non-AI work is paused, this draft is now reference material for the parked files and for later templateization under APIM Policy Management. + +**Related:** [Non-AI API Usage Limits Architecture](#2026-05-16--non-ai-api-usage-limits-architecture), [APIM Policy Management Architecture](#2026-05-16--apim-policy-management-architecture), `.squad/files/non-ai-paused/`, `.squad/files/non-ai-paused/entra-jwt-rest-policy.xml` + +## 2026-05-21 — Non-AI API Limits Test Coverage Strategy +**Owner:** Bunk +**Status:** paused +**Source:** `.squad/decisions/inbox/bunk-non-ai-test-plan.md` (will be archived after merge) + +Bunk proposed a three-layer test strategy for the non-AI limits feature: endpoint coverage in `src/AIPolicyEngine.Tests/EndpointTests.cs`, deeper integration tests for counter isolation and rollover semantics, and NBomber load coverage for high-throughput non-AI precheck traffic across multiple customers. The plan intentionally mirrors the repository's existing split between endpoint, integration, and load-test suites. + +This remains a supporting reference only. It depends on final endpoint/schema decisions, settled `0 = unlimited` and rejected-request semantics, and preferably a clock seam for rollover tests; Bunk's separate implementation work is still pending merge and is not recorded here as landed. + +**Related:** [Non-AI API Usage Limits Architecture](#2026-05-16--non-ai-api-usage-limits-architecture), [Non-AI APIM Policy Contract](#2026-05-21--non-ai-apim-policy-contract), `src/AIPolicyEngine.Tests/EndpointTests.cs`, `src/AIPolicyEngine.LoadTest/Program.cs` + ## Governance - All meaningful changes require team consensus diff --git a/infra/terraform/main.tf b/infra/terraform/main.tf index db8c3297..9ab814f1 100644 --- a/infra/terraform/main.tf +++ b/infra/terraform/main.tf @@ -7,8 +7,9 @@ # --------------------------------------------------------------------------- locals { - workload_token = lower(replace(var.workload_name, "/[^a-z0-9]/", "")) - name_prefix = var.workload_name + workload_token = lower(replace(var.workload_name, "/[^a-z0-9]/", "")) + name_prefix = var.workload_name + apim_resource_id = "/subscriptions/${var.subscription_id}/resourceGroups/${azurerm_resource_group.this.name}/providers/Microsoft.ApiManagement/service/${local.name_prefix}-apim" tags = { workload = var.workload_name managed_by = "terraform" @@ -96,23 +97,24 @@ module "identity" { module "compute" { source = "./modules/compute" - name_prefix = local.name_prefix - location = azurerm_resource_group.this.location - resource_group_name = azurerm_resource_group.this.name - subscription_id = var.subscription_id - log_analytics_workspace_id = module.monitoring.log_analytics_workspace_id + name_prefix = local.name_prefix + location = azurerm_resource_group.this.location + resource_group_name = azurerm_resource_group.this.name + subscription_id = var.subscription_id + log_analytics_workspace_id = module.monitoring.log_analytics_workspace_id log_analytics_workspace_customer_id = module.monitoring.log_analytics_workspace_customer_id - log_analytics_workspace_shared_key = module.monitoring.log_analytics_workspace_shared_key - redis_hostname = module.data.redis_hostname - redis_port = module.data.redis_port - cosmos_endpoint = module.data.cosmos_endpoint - ai_service_endpoint = module.ai_services.endpoint - app_insights_connection_string = module.monitoring.app_insights_connection_string - purview_client_app_id = var.purview_client_app_id - entra_tenant_id = module.identity.tenant_id - api_app_id = module.identity.api_app_id - container_image = var.container_image - tags = local.tags + log_analytics_workspace_shared_key = module.monitoring.log_analytics_workspace_shared_key + redis_hostname = module.data.redis_hostname + redis_port = module.data.redis_port + cosmos_endpoint = module.data.cosmos_endpoint + ai_service_endpoint = module.ai_services.endpoint + apim_resource_id = local.apim_resource_id + app_insights_connection_string = module.monitoring.app_insights_connection_string + purview_client_app_id = var.purview_client_app_id + entra_tenant_id = module.identity.tenant_id + api_app_id = module.identity.api_app_id + container_image = var.container_image + tags = local.tags } # --------------------------------------------------------------------------- @@ -122,26 +124,27 @@ module "compute" { module "gateway" { source = "./modules/gateway" - name_prefix = local.name_prefix - location = azurerm_resource_group.this.location - resource_group_name = azurerm_resource_group.this.name - sku = var.apim_sku - publisher_email = var.apim_publisher_email - publisher_name = var.apim_publisher_name - api_spec_url = var.openai_api_spec_url - ai_service_endpoint = module.ai_services.endpoint - ai_service_id = module.ai_services.id - container_app_fqdn = module.compute.container_app_fqdn - container_app_id = module.compute.container_app_id - api_app_id = module.identity.api_app_id - gateway_app_id = module.identity.gateway_app_id - tenant_id = module.identity.tenant_id - key_vault_id = module.compute.key_vault_id - enable_jwt = var.enable_jwt - enable_keys = var.enable_keys + name_prefix = local.name_prefix + location = azurerm_resource_group.this.location + resource_group_name = azurerm_resource_group.this.name + sku = var.apim_sku + publisher_email = var.apim_publisher_email + publisher_name = var.apim_publisher_name + api_spec_url = var.openai_api_spec_url + ai_service_endpoint = module.ai_services.endpoint + ai_service_id = module.ai_services.id + container_app_fqdn = module.compute.container_app_fqdn + container_app_id = module.compute.container_app_id + container_app_principal_id = module.compute.container_app_principal_id + api_app_id = module.identity.api_app_id + gateway_app_id = module.identity.gateway_app_id + tenant_id = module.identity.tenant_id + key_vault_id = module.compute.key_vault_id + enable_jwt = var.enable_jwt + enable_keys = var.enable_keys jwt_policy_xml_path = "${path.module}/../../policies/entra-jwt-policy.xml" subscription_key_policy_xml_path = "${path.module}/../../policies/subscription-key-policy.xml" - tags = local.tags + tags = local.tags } # ============================================================================= diff --git a/infra/terraform/modules/compute/main.tf b/infra/terraform/modules/compute/main.tf index b4804150..e662a3d4 100644 --- a/infra/terraform/modules/compute/main.tf +++ b/infra/terraform/modules/compute/main.tf @@ -222,6 +222,11 @@ resource "azurerm_container_app" "this" { value = var.cosmos_endpoint } + env { + name = "APIM_RESOURCE_ID" + value = var.apim_resource_id + } + env { name = "AZURE_AI_ENDPOINT" value = var.ai_service_endpoint diff --git a/infra/terraform/modules/compute/variables.tf b/infra/terraform/modules/compute/variables.tf index 2ecd887a..833fb9d4 100644 --- a/infra/terraform/modules/compute/variables.tf +++ b/infra/terraform/modules/compute/variables.tf @@ -54,6 +54,11 @@ variable "ai_service_endpoint" { type = string } +variable "apim_resource_id" { + description = "Resource ID of the API Management service managed by the engine." + type = string +} + variable "app_insights_connection_string" { description = "Application Insights connection string." type = string diff --git a/infra/terraform/modules/gateway/main.tf b/infra/terraform/modules/gateway/main.tf index af1777ed..a32b64f7 100644 --- a/infra/terraform/modules/gateway/main.tf +++ b/infra/terraform/modules/gateway/main.tf @@ -174,7 +174,7 @@ resource "azurerm_api_management_api" "openai_key_based" { subscription_key_parameter_names { header = "api-key" - query = "api-key" + query = "api-key" } dynamic "import" { @@ -225,6 +225,37 @@ resource "azapi_resource" "openai_key_based_policy" { # Role Assignments # ============================================================================= +resource "azurerm_role_definition" "container_app_apim_manager" { + name = "AI Policy Engine APIM Manager" + scope = azurerm_api_management.this.id + description = "Least-privilege APIM policy management role for the AI Policy Engine Container App managed identity." + + permissions { + actions = [ + "Microsoft.ApiManagement/service/apis/read", + "Microsoft.ApiManagement/service/apis/operations/read", + "Microsoft.ApiManagement/service/apis/policies/read", + "Microsoft.ApiManagement/service/apis/policies/write", + "Microsoft.ApiManagement/service/apis/policies/delete", + "Microsoft.ApiManagement/service/apis/operations/policies/read", + "Microsoft.ApiManagement/service/apis/operations/policies/write", + "Microsoft.ApiManagement/service/apis/operations/policies/delete", + ] + } + + assignable_scopes = [ + azurerm_api_management.this.id, + ] +} + +resource "azurerm_role_assignment" "container_app_apim_manager" { + count = var.container_app_principal_id != "" ? 1 : 0 + + scope = azurerm_api_management.this.id + role_definition_id = azurerm_role_definition.container_app_apim_manager.role_definition_resource_id + principal_id = var.container_app_principal_id +} + # APIM identity → Cognitive Services User on AI Services resource "azurerm_role_assignment" "apim_cognitive_services_user" { scope = var.ai_service_id diff --git a/infra/terraform/modules/gateway/outputs.tf b/infra/terraform/modules/gateway/outputs.tf index d18b83c5..2ea13157 100644 --- a/infra/terraform/modules/gateway/outputs.tf +++ b/infra/terraform/modules/gateway/outputs.tf @@ -8,6 +8,11 @@ output "apim_gateway_url" { value = azurerm_api_management.this.gateway_url } +output "apim_resource_id" { + description = "Resource ID of the API Management instance." + value = azurerm_api_management.this.id +} + output "apim_principal_id" { description = "Principal ID of the API Management managed identity." value = azurerm_api_management.this.identity[0].principal_id diff --git a/infra/terraform/modules/gateway/variables.tf b/infra/terraform/modules/gateway/variables.tf index e9b88711..0f0f2b28 100644 --- a/infra/terraform/modules/gateway/variables.tf +++ b/infra/terraform/modules/gateway/variables.tf @@ -56,6 +56,12 @@ variable "container_app_id" { default = "" } +variable "container_app_principal_id" { + description = "Principal ID of the Container App managed identity for APIM role assignment. Empty string to skip." + type = string + default = "" +} + variable "api_app_id" { description = "Application (client) ID of the AI Policy API app registration." type = string diff --git a/infra/terraform/modules/identity/main.tf b/infra/terraform/modules/identity/main.tf index 549ffb27..a6017b52 100644 --- a/infra/terraform/modules/identity/main.tf +++ b/infra/terraform/modules/identity/main.tf @@ -173,9 +173,9 @@ resource "azuread_service_principal" "client1" { } resource "azuread_application_password" "client1" { - application_id = azuread_application.client1.id - display_name = "terraform" - end_date = timeadd(timestamp(), "8760h") + application_id = azuread_application.client1.id + display_name = "terraform" + end_date = timeadd(timestamp(), "8760h") lifecycle { ignore_changes = [end_date] @@ -224,9 +224,9 @@ resource "azuread_service_principal" "client2" { } resource "azuread_application_password" "client2" { - application_id = azuread_application.client2.id - display_name = "terraform" - end_date = timeadd(timestamp(), "8760h") + application_id = azuread_application.client2.id + display_name = "terraform" + end_date = timeadd(timestamp(), "8760h") lifecycle { ignore_changes = [end_date] diff --git a/infra/terraform/outputs.tf b/infra/terraform/outputs.tf index 3efe245b..4c851050 100644 --- a/infra/terraform/outputs.tf +++ b/infra/terraform/outputs.tf @@ -17,6 +17,11 @@ output "apim_gateway_url" { value = module.gateway.apim_gateway_url } +output "apim_resource_id" { + description = "Resource ID of the API Management instance." + value = module.gateway.apim_resource_id +} + output "key_vault_name" { description = "Name of the Key Vault instance." value = module.compute.key_vault_name diff --git a/infra/terraform/providers.tf b/infra/terraform/providers.tf index 35350f2a..154f4b43 100644 --- a/infra/terraform/providers.tf +++ b/infra/terraform/providers.tf @@ -30,8 +30,8 @@ provider "azurerm" { purge_soft_delete_on_destroy = false } } - subscription_id = var.subscription_id - storage_use_azuread = true + subscription_id = var.subscription_id + storage_use_azuread = true } provider "azuread" {} diff --git a/policies/templates/entra-jwt-ai-dlp/policy.xml b/policies/templates/entra-jwt-ai-dlp/policy.xml new file mode 100644 index 00000000..f39b770b --- /dev/null +++ b/policies/templates/entra-jwt-ai-dlp/policy.xml @@ -0,0 +1,311 @@ + + + + + + + + + + + {{ExpectedAudience}} + + + + + + + + + + + = 0) { + var rest = path.Substring(idx + marker.Length); + var slash = rest.IndexOf('/'); + if (slash >= 0) { return rest.Substring(0, slash); } + return rest; + } + try { + var body = context.Variables.GetValueOrDefault("requestBody"); + if (!string.IsNullOrEmpty(body)) { + var json = Newtonsoft.Json.Linq.JObject.Parse(body); + var model = json["model"]?.ToString(); + if (!string.IsNullOrEmpty(model)) { return model; } + } + } catch { } + var segments = path.Split('/'); + if (segments.Length > 1) { return segments[segments.Length - 1]; } + return "unknown"; + }" /> + + + + + + @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + GET + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + + + + + + application/json + @{ + var resp = context.Variables["precheckResponse"] as IResponse; + return resp != null ? resp.Body.As() : "{\"error\":\"Pre-check failed — could not reach authorization service\"}"; + } + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + {"error":"Pre-authorization check failed"} + + + + + + @((string)context.Variables["containerAppBaseUrl"] + "/api/content-check/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"]) + POST + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + + application/json + + @((string)context.Variables["requestBody"]) + + + + + + + + application/json + + @{ + var resp = ((IResponse)context.Variables["contentCheckResponse"]).Body.As(); + return new JObject( + new JProperty("error", new JObject( + new JProperty("message", resp?["message"]?.ToString() ?? "Content blocked by policy"), + new JProperty("type", "content_filter"), + new JProperty("code", "content_blocked") + )) + ).ToString(); + } + + + + + + (preserveContent: true)["routedDeployment"]?.ToString())" /> + (preserveContent: true)["routingPolicyId"]?.ToString())" /> + + + + + + + + + + + + + + + + + + @{ + var rawBody = context.Variables.GetValueOrDefault("requestBody"); + var requestBody = Newtonsoft.Json.Linq.JObject.Parse(rawBody); + if (requestBody["stream"] != null && (bool)requestBody["stream"] == true) { + requestBody["stream_options"] = Newtonsoft.Json.Linq.JObject.Parse(@"{""include_usage"":true}"); + } + return requestBody.ToString(); + } + + + + + + + + + + + l.Trim().StartsWith("data:") && l.Contains("\"usage\"") && !l.Contains("[DONE]")) + .LastOrDefault(); + if (chunkLine != null) { + int index = chunkLine.IndexOf('{'); + string jsonPart = chunkLine.Substring(index); + return Newtonsoft.Json.Linq.JObject.Parse(jsonPart); + } + return null; + } else { + return Newtonsoft.Json.Linq.JObject.Parse(txt); + } + }" /> + + + + + {{ContainerAppUrl}}/api/log + POST + + application/json + + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + @{ + var requestBodyRaw = context.Variables.GetValueOrDefault("requestBody"); + var parsedResponseString = context.Variables.GetValueOrDefault("parsedResponseString"); + var deploymentId = context.Variables.GetValueOrDefault("deploymentId"); + var requestedDeploymentId = context.Variables.ContainsKey("originalDeploymentId") + ? context.Variables.GetValueOrDefault("originalDeploymentId") + : deploymentId; + Newtonsoft.Json.Linq.JToken requestBody = null; + if (!string.IsNullOrEmpty(requestBodyRaw)) { + try { + requestBody = Newtonsoft.Json.Linq.JToken.Parse(requestBodyRaw); + } catch { + requestBody = Newtonsoft.Json.Linq.JValue.CreateString(requestBodyRaw); + } + } + Newtonsoft.Json.Linq.JToken responseBody = null; + if (!string.IsNullOrEmpty(parsedResponseString)) { + try { + responseBody = Newtonsoft.Json.Linq.JToken.Parse(parsedResponseString); + } catch { + responseBody = Newtonsoft.Json.Linq.JValue.CreateString(parsedResponseString); + } + } + var payload = new JObject(); + payload.Add(new JProperty("tenantId", (string)context.Variables.GetValueOrDefault("tenantId", ""))); + payload.Add(new JProperty("clientAppId", (string)context.Variables.GetValueOrDefault("clientAppId", ""))); + payload.Add(new JProperty("audience", (string)context.Variables.GetValueOrDefault("audience", ""))); + payload.Add(new JProperty("requestBody", requestBody)); + payload.Add(new JProperty("responseBody", responseBody)); + payload.Add(new JProperty("deploymentId", deploymentId ?? "")); + payload.Add(new JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); + payload.Add(new JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); + payload.Add(new JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new JProperty("correlationId", context.RequestId.ToString())); + return payload.ToString(); + } + + + + + + + @(context.LastError.Source) + + + @(context.LastError.Reason) + + + @(context.LastError.Message) + + + @(context.LastError.Scope) + + + @(context.LastError.Section) + + + @(context.LastError.Path) + + + @(context.LastError.PolicyId) + + + @(context.Response.StatusCode.ToString()) + + + diff --git a/policies/templates/entra-jwt-ai-dlp/template.json b/policies/templates/entra-jwt-ai-dlp/template.json new file mode 100644 index 00000000..5fcab9cb --- /dev/null +++ b/policies/templates/entra-jwt-ai-dlp/template.json @@ -0,0 +1,26 @@ +{ + "id": "entra-jwt-ai-dlp", + "displayName": "Entra JWT — AI + DLP", + "version": "1.0", + "scope": "api", + "parameters": [ + { + "name": "ExpectedAudience", + "type": "string", + "required": true, + "description": "Expected aud claim in incoming Entra access tokens." + }, + { + "name": "ContainerAppAudience", + "type": "string", + "required": true, + "description": "Audience/resource APIM uses when acquiring a managed identity token for the AI Policy Engine." + }, + { + "name": "ContainerAppUrl", + "type": "string", + "required": true, + "description": "Base URL for the AI Policy Engine used by precheck, content-check, and log callbacks." + } + ] +} diff --git a/policies/templates/entra-jwt-ai/policy.xml b/policies/templates/entra-jwt-ai/policy.xml new file mode 100644 index 00000000..1f738a65 --- /dev/null +++ b/policies/templates/entra-jwt-ai/policy.xml @@ -0,0 +1,268 @@ + + + + + + + + + + {{ExpectedAudience}} + + + + + + + + + + + = 0) { + var rest = path.Substring(idx + marker.Length); + var slash = rest.IndexOf('/'); + if (slash >= 0) { return rest.Substring(0, slash); } + return rest; + } + try { + var body = context.Variables.GetValueOrDefault("requestBody"); + if (!string.IsNullOrEmpty(body)) { + var json = Newtonsoft.Json.Linq.JObject.Parse(body); + var model = json["model"]?.ToString(); + if (!string.IsNullOrEmpty(model)) { return model; } + } + } catch { } + var segments = path.Split('/'); + if (segments.Length > 1) { return segments[segments.Length - 1]; } + return "unknown"; + }" /> + + + + + + @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + GET + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + + + + + + application/json + @{ + var resp = context.Variables["precheckResponse"] as IResponse; + return resp != null ? resp.Body.As() : "{\"error\":\"Pre-check failed — could not reach authorization service\"}"; + } + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + {"error":"Pre-authorization check failed"} + + + + + + (preserveContent: true)["routedDeployment"]?.ToString())" /> + (preserveContent: true)["routingPolicyId"]?.ToString())" /> + + + + + + + + + + + + + + + + + + @{ + var rawBody = context.Variables.GetValueOrDefault("requestBody"); + var requestBody = Newtonsoft.Json.Linq.JObject.Parse(rawBody); + if (requestBody["stream"] != null && (bool)requestBody["stream"] == true) { + requestBody["stream_options"] = Newtonsoft.Json.Linq.JObject.Parse(@"{""include_usage"":true}"); + } + return requestBody.ToString(); + } + + + + + + + + + + + l.Trim().StartsWith("data:") && l.Contains("\"usage\"") && !l.Contains("[DONE]")) + .LastOrDefault(); + if (chunkLine != null) { + int index = chunkLine.IndexOf('{'); + string jsonPart = chunkLine.Substring(index); + return Newtonsoft.Json.Linq.JObject.Parse(jsonPart); + } + return null; + } else { + return Newtonsoft.Json.Linq.JObject.Parse(txt); + } + }" /> + + + + + {{ContainerAppUrl}}/api/log + POST + + application/json + + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + @{ + var requestBodyRaw = context.Variables.GetValueOrDefault("requestBody"); + var parsedResponseString = context.Variables.GetValueOrDefault("parsedResponseString"); + var deploymentId = context.Variables.GetValueOrDefault("deploymentId"); + var requestedDeploymentId = context.Variables.ContainsKey("originalDeploymentId") + ? context.Variables.GetValueOrDefault("originalDeploymentId") + : deploymentId; + object requestBodyValue = null; + if (!string.IsNullOrEmpty(requestBodyRaw)) { + try { + requestBodyValue = Newtonsoft.Json.Linq.JToken.Parse(requestBodyRaw); + } catch { + requestBodyValue = requestBodyRaw; + } + } + object responseBodyValue = null; + if (!string.IsNullOrEmpty(parsedResponseString)) { + try { + responseBodyValue = Newtonsoft.Json.Linq.JToken.Parse(parsedResponseString); + } catch { + responseBodyValue = parsedResponseString; + } + } + var payload = new Newtonsoft.Json.Linq.JObject(); + payload.Add(new Newtonsoft.Json.Linq.JProperty("tenantId", (string)context.Variables.GetValueOrDefault("tenantId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("clientAppId", (string)context.Variables.GetValueOrDefault("clientAppId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("audience", (string)context.Variables.GetValueOrDefault("audience", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("requestBody", requestBodyValue)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("responseBody", responseBodyValue)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("deploymentId", deploymentId ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); + return payload.ToString(); + } + + + + + + + @(context.LastError.Source) + + + @(context.LastError.Reason) + + + @(context.LastError.Message) + + + @(context.LastError.Scope) + + + @(context.LastError.Section) + + + @(context.LastError.Path) + + + @(context.LastError.PolicyId) + + + @(context.Response.StatusCode.ToString()) + + + diff --git a/policies/templates/entra-jwt-ai/template.json b/policies/templates/entra-jwt-ai/template.json new file mode 100644 index 00000000..b300ffb6 --- /dev/null +++ b/policies/templates/entra-jwt-ai/template.json @@ -0,0 +1,26 @@ +{ + "id": "entra-jwt-ai", + "displayName": "Entra JWT — AI", + "version": "1.0", + "scope": "api", + "parameters": [ + { + "name": "ExpectedAudience", + "type": "string", + "required": true, + "description": "Expected aud claim in incoming Entra access tokens." + }, + { + "name": "ContainerAppAudience", + "type": "string", + "required": true, + "description": "Audience/resource APIM uses when acquiring a managed identity token for the AI Policy Engine." + }, + { + "name": "ContainerAppUrl", + "type": "string", + "required": true, + "description": "Base URL for the AI Policy Engine used by precheck and log callbacks." + } + ] +} diff --git a/policies/templates/entra-jwt-rest/policy.xml b/policies/templates/entra-jwt-rest/policy.xml new file mode 100644 index 00000000..40334a83 --- /dev/null +++ b/policies/templates/entra-jwt-rest/policy.xml @@ -0,0 +1,124 @@ + + + + + + + + + + {{ExpectedAudience}} + + + + + + + + + + + + + + + + + + + + + + + + {{ContainerAppUrl}}/api/log-rest + POST + + application/json + + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + @{ + var latencyMs = (long)(DateTime.UtcNow - context.Timestamp).TotalMilliseconds; + var payload = new Newtonsoft.Json.Linq.JObject(); + payload.Add(new Newtonsoft.Json.Linq.JProperty("tenantId", (string)context.Variables.GetValueOrDefault("tenantId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("clientAppId", (string)context.Variables.GetValueOrDefault("clientAppId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("customerKey", (string)context.Variables.GetValueOrDefault("customerKey", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("requestPath", context.Request.Url.Path)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("statusCode", context.Response.StatusCode)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("latencyMs", latencyMs)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); + return payload.ToString(); + } + + + + + + diff --git a/policies/templates/entra-jwt-rest/template.json b/policies/templates/entra-jwt-rest/template.json new file mode 100644 index 00000000..c7be9c69 --- /dev/null +++ b/policies/templates/entra-jwt-rest/template.json @@ -0,0 +1,53 @@ +{ + "id": "entra-jwt-rest", + "displayName": "Entra JWT — Non-AI REST (Rate Limit + Quota)", + "version": "1.0", + "scope": "api", + "parameters": [ + { + "name": "EntraTenantId", + "type": "string", + "required": false, + "default": "common", + "description": "Reserved Entra tenant named value reference kept in the template comments for parity with the policy family." + }, + { + "name": "ExpectedAudience", + "type": "string", + "required": true, + "description": "Expected aud claim in incoming Entra access tokens." + }, + { + "name": "NonAiRequestsPerMinute", + "type": "int", + "required": true, + "default": 60, + "description": "Native APIM rate-limit-by-key calls-per-minute value." + }, + { + "name": "NonAiMonthlyRequestQuota", + "type": "int", + "required": true, + "default": 10000, + "description": "Native APIM quota-by-key monthly request cap." + }, + { + "name": "NonAiBackendUrl", + "type": "string", + "required": true, + "description": "Target REST backend host/FQDN routed through APIM." + }, + { + "name": "ContainerAppUrl", + "type": "string", + "required": true, + "description": "Base URL for the AI Policy Engine used by log callbacks and the commented precheck-rest alternative." + }, + { + "name": "ContainerAppAudience", + "type": "string", + "required": true, + "description": "Audience/resource APIM uses when acquiring a managed identity token for the AI Policy Engine." + } + ] +} diff --git a/policies/templates/subscription-key-ai-dlp/policy.xml b/policies/templates/subscription-key-ai-dlp/policy.xml new file mode 100644 index 00000000..6b294826 --- /dev/null +++ b/policies/templates/subscription-key-ai-dlp/policy.xml @@ -0,0 +1,298 @@ + + + + + + + + + + + + + = 0) { + var rest = path.Substring(idx + marker.Length); + var slash = rest.IndexOf('/'); + if (slash >= 0) { return rest.Substring(0, slash); } + return rest; + } + try { + var body = context.Variables.GetValueOrDefault("requestBody"); + if (!string.IsNullOrEmpty(body)) { + var json = Newtonsoft.Json.Linq.JObject.Parse(body); + var model = json["model"]?.ToString(); + if (!string.IsNullOrEmpty(model)) { return model; } + } + } catch { } + var segments = path.Split('/'); + if (segments.Length > 1) { return segments[segments.Length - 1]; } + return "unknown"; + }" /> + + + + + + @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + GET + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + + + + + + application/json + @{ + var resp = context.Variables["precheckResponse"] as IResponse; + return resp != null ? resp.Body.As() : "{\"error\":\"Pre-check failed — could not reach authorization service\"}"; + } + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + {"error":"Pre-authorization check failed"} + + + + + + (preserveContent: true)["routedDeployment"]?.ToString())" /> + (preserveContent: true)["routingPolicyId"]?.ToString())" /> + + + + + + + + + + + + + + @((string)context.Variables["containerAppBaseUrl"] + "/api/content-check/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"]) + POST + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + + application/json + + @((string)context.Variables["requestBody"]) + + + + + + + + application/json + + @{ + var resp = ((IResponse)context.Variables["contentCheckResponse"]).Body.As(); + return new JObject( + new JProperty("error", new JObject( + new JProperty("message", resp?["message"]?.ToString() ?? "Content blocked by policy"), + new JProperty("type", "content_filter"), + new JProperty("code", "content_blocked") + )) + ).ToString(); + } + + + + + + + + + + @{ + var rawBody = context.Variables.GetValueOrDefault("requestBody"); + var requestBody = Newtonsoft.Json.Linq.JObject.Parse(rawBody); + if (requestBody["stream"] != null && (bool)requestBody["stream"] == true) { + requestBody["stream_options"] = Newtonsoft.Json.Linq.JObject.Parse(@"{""include_usage"":true}"); + } + return requestBody.ToString(); + } + + + + + + + + + + + l.Trim().StartsWith("data:") && l.Contains("\"usage\"") && !l.Contains("[DONE]")) + .LastOrDefault(); + if (chunkLine != null) { + int index = chunkLine.IndexOf('{'); + string jsonPart = chunkLine.Substring(index); + return Newtonsoft.Json.Linq.JObject.Parse(jsonPart); + } + return null; + } else { + return Newtonsoft.Json.Linq.JObject.Parse(txt); + } + }" /> + + + + + {{ContainerAppUrl}}/api/log + POST + + application/json + + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + @{ + var requestBodyRaw = context.Variables.GetValueOrDefault("requestBody"); + var parsedResponseString = context.Variables.GetValueOrDefault("parsedResponseString"); + var deploymentId = context.Variables.GetValueOrDefault("deploymentId"); + var requestedDeploymentId = context.Variables.ContainsKey("originalDeploymentId") + ? context.Variables.GetValueOrDefault("originalDeploymentId") + : deploymentId; + Newtonsoft.Json.Linq.JToken requestBody = null; + if (!string.IsNullOrEmpty(requestBodyRaw)) { + try { + requestBody = Newtonsoft.Json.Linq.JToken.Parse(requestBodyRaw); + } catch { + requestBody = Newtonsoft.Json.Linq.JValue.CreateString(requestBodyRaw); + } + } + Newtonsoft.Json.Linq.JToken responseBody = null; + if (!string.IsNullOrEmpty(parsedResponseString)) { + try { + responseBody = Newtonsoft.Json.Linq.JToken.Parse(parsedResponseString); + } catch { + responseBody = Newtonsoft.Json.Linq.JValue.CreateString(parsedResponseString); + } + } + var payload = new JObject(); + payload.Add(new JProperty("tenantId", (string)context.Variables.GetValueOrDefault("tenantId", ""))); + payload.Add(new JProperty("clientAppId", (string)context.Variables.GetValueOrDefault("clientAppId", ""))); + payload.Add(new JProperty("audience", (string)context.Variables.GetValueOrDefault("audience", ""))); + payload.Add(new JProperty("requestBody", requestBody)); + payload.Add(new JProperty("responseBody", responseBody)); + payload.Add(new JProperty("deploymentId", deploymentId ?? "")); + payload.Add(new JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); + payload.Add(new JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); + payload.Add(new JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new JProperty("correlationId", context.RequestId.ToString())); + return payload.ToString(); + } + + + + + + + @(context.LastError.Source) + + + @(context.LastError.Reason) + + + @(context.LastError.Message) + + + @(context.LastError.Scope) + + + @(context.LastError.Section) + + + @(context.LastError.Path) + + + @(context.LastError.PolicyId) + + + @(context.Response.StatusCode.ToString()) + + + diff --git a/policies/templates/subscription-key-ai-dlp/template.json b/policies/templates/subscription-key-ai-dlp/template.json new file mode 100644 index 00000000..b11ad9a1 --- /dev/null +++ b/policies/templates/subscription-key-ai-dlp/template.json @@ -0,0 +1,20 @@ +{ + "id": "subscription-key-ai-dlp", + "displayName": "Subscription Key — AI + DLP", + "version": "1.0", + "scope": "api", + "parameters": [ + { + "name": "ContainerAppAudience", + "type": "string", + "required": true, + "description": "Audience/resource APIM uses when acquiring a managed identity token for the AI Policy Engine." + }, + { + "name": "ContainerAppUrl", + "type": "string", + "required": true, + "description": "Base URL for the AI Policy Engine used by precheck, content-check, and log callbacks." + } + ] +} diff --git a/policies/templates/subscription-key-ai/policy.xml b/policies/templates/subscription-key-ai/policy.xml new file mode 100644 index 00000000..c0c02781 --- /dev/null +++ b/policies/templates/subscription-key-ai/policy.xml @@ -0,0 +1,255 @@ + + + + + + + + + + + + = 0) { + var rest = path.Substring(idx + marker.Length); + var slash = rest.IndexOf('/'); + if (slash >= 0) { return rest.Substring(0, slash); } + return rest; + } + try { + var body = context.Variables.GetValueOrDefault("requestBody"); + if (!string.IsNullOrEmpty(body)) { + var json = Newtonsoft.Json.Linq.JObject.Parse(body); + var model = json["model"]?.ToString(); + if (!string.IsNullOrEmpty(model)) { return model; } + } + } catch { } + var segments = path.Split('/'); + if (segments.Length > 1) { return segments[segments.Length - 1]; } + return "unknown"; + }" /> + + + + + + @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + GET + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + + + + + + application/json + @{ + var resp = context.Variables["precheckResponse"] as IResponse; + return resp != null ? resp.Body.As() : "{\"error\":\"Pre-check failed — could not reach authorization service\"}"; + } + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + @(((IResponse)context.Variables["precheckResponse"]).Body.As()) + + + + + + application/json + {"error":"Pre-authorization check failed"} + + + + + + (preserveContent: true)["routedDeployment"]?.ToString())" /> + (preserveContent: true)["routingPolicyId"]?.ToString())" /> + + + + + + + + + + + + + + + + + + @{ + var rawBody = context.Variables.GetValueOrDefault("requestBody"); + var requestBody = Newtonsoft.Json.Linq.JObject.Parse(rawBody); + if (requestBody["stream"] != null && (bool)requestBody["stream"] == true) { + requestBody["stream_options"] = Newtonsoft.Json.Linq.JObject.Parse(@"{""include_usage"":true}"); + } + return requestBody.ToString(); + } + + + + + + + + + + + l.Trim().StartsWith("data:") && l.Contains("\"usage\"") && !l.Contains("[DONE]")) + .LastOrDefault(); + if (chunkLine != null) { + int index = chunkLine.IndexOf('{'); + string jsonPart = chunkLine.Substring(index); + return Newtonsoft.Json.Linq.JObject.Parse(jsonPart); + } + return null; + } else { + return Newtonsoft.Json.Linq.JObject.Parse(txt); + } + }" /> + + + + + {{ContainerAppUrl}}/api/log + POST + + application/json + + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + @{ + var requestBodyRaw = context.Variables.GetValueOrDefault("requestBody"); + var parsedResponseString = context.Variables.GetValueOrDefault("parsedResponseString"); + var deploymentId = context.Variables.GetValueOrDefault("deploymentId"); + var requestedDeploymentId = context.Variables.ContainsKey("originalDeploymentId") + ? context.Variables.GetValueOrDefault("originalDeploymentId") + : deploymentId; + object requestBodyValue = null; + if (!string.IsNullOrEmpty(requestBodyRaw)) { + try { + requestBodyValue = Newtonsoft.Json.Linq.JToken.Parse(requestBodyRaw); + } catch { + requestBodyValue = requestBodyRaw; + } + } + object responseBodyValue = null; + if (!string.IsNullOrEmpty(parsedResponseString)) { + try { + responseBodyValue = Newtonsoft.Json.Linq.JToken.Parse(parsedResponseString); + } catch { + responseBodyValue = parsedResponseString; + } + } + var payload = new Newtonsoft.Json.Linq.JObject(); + payload.Add(new Newtonsoft.Json.Linq.JProperty("tenantId", (string)context.Variables.GetValueOrDefault("tenantId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("clientAppId", (string)context.Variables.GetValueOrDefault("clientAppId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("audience", (string)context.Variables.GetValueOrDefault("audience", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("requestBody", requestBodyValue)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("responseBody", responseBodyValue)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("deploymentId", deploymentId ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); + return payload.ToString(); + } + + + + + + + @(context.LastError.Source) + + + @(context.LastError.Reason) + + + @(context.LastError.Message) + + + @(context.LastError.Scope) + + + @(context.LastError.Section) + + + @(context.LastError.Path) + + + @(context.LastError.PolicyId) + + + @(context.Response.StatusCode.ToString()) + + + diff --git a/policies/templates/subscription-key-ai/template.json b/policies/templates/subscription-key-ai/template.json new file mode 100644 index 00000000..3d51f764 --- /dev/null +++ b/policies/templates/subscription-key-ai/template.json @@ -0,0 +1,20 @@ +{ + "id": "subscription-key-ai", + "displayName": "Subscription Key — AI", + "version": "1.0", + "scope": "api", + "parameters": [ + { + "name": "ContainerAppAudience", + "type": "string", + "required": true, + "description": "Audience/resource APIM uses when acquiring a managed identity token for the AI Policy Engine." + }, + { + "name": "ContainerAppUrl", + "type": "string", + "required": true, + "description": "Base URL for the AI Policy Engine used by precheck and log callbacks." + } + ] +} diff --git a/src/AIPolicyEngine.Api/AIPolicyEngine.Api.csproj b/src/AIPolicyEngine.Api/AIPolicyEngine.Api.csproj index 64287cbc..49132ccf 100644 --- a/src/AIPolicyEngine.Api/AIPolicyEngine.Api.csproj +++ b/src/AIPolicyEngine.Api/AIPolicyEngine.Api.csproj @@ -19,6 +19,7 @@ + @@ -29,6 +30,13 @@ + + + + diff --git a/src/aipolicyengine-ui/src/App.tsx b/src/aipolicyengine-ui/src/App.tsx index 05a088dc..bded33c1 100644 --- a/src/aipolicyengine-ui/src/App.tsx +++ b/src/aipolicyengine-ui/src/App.tsx @@ -10,47 +10,96 @@ import { Export } from "./pages/Export" import { ClientDetail } from "./pages/ClientDetail" import { RoutingPolicies } from "./pages/RoutingPolicies" import { RequestBilling } from "./pages/RequestBilling" +import { Apis } from "./pages/Apis" import { loginRequest } from "./auth/msalConfig" import { fetchPlans } from "./api" import type { PlanData, BillingMode } from "./types" import { Button } from "./components/ui/button" import { Activity, LogIn } from "lucide-react" +const TAB_PATHS = { + dashboard: "/", + clients: "/clients", + plans: "/plans", + pricing: "/pricing", + routing: "/routing", + apis: "/apis", + requests: "/request-billing", + export: "/export", +} as const + +type TabId = keyof typeof TAB_PATHS + +function resolveTabFromPathname(pathname: string): TabId { + const normalizedPath = pathname === "/" ? "/" : pathname.replace(/\/$/, "") + const matchingEntry = Object.entries(TAB_PATHS).find(([, path]) => path === normalizedPath) + return (matchingEntry?.[0] as TabId | undefined) ?? "dashboard" +} + function App() { - const [activeTab, setActiveTab] = useState("dashboard") + const [activeTab, setActiveTab] = useState(() => resolveTabFromPathname(window.location.pathname)) const [selectedClient, setSelectedClient] = useState<{ clientAppId: string; tenantId: string } | null>(null) const [plans, setPlans] = useState([]) const isAuthenticated = useIsAuthenticated() const { instance, inProgress } = useMsal() - const loadPlans = useCallback(async () => { - try { - const res = await fetchPlans() - setPlans(res.plans ?? []) - } catch { - // Plans may not be loaded yet — billing mode defaults to token + const handleTabChange = useCallback((tab: string) => { + const nextTab = (tab in TAB_PATHS ? tab : "dashboard") as TabId + setActiveTab(nextTab) + + const nextPath = TAB_PATHS[nextTab] + if (window.location.pathname !== nextPath) { + window.history.pushState({}, "", nextPath) + } + }, []) + + useEffect(() => { + const handlePopState = () => { + setActiveTab(resolveTabFromPathname(window.location.pathname)) + setSelectedClient(null) } + + window.addEventListener("popstate", handlePopState) + return () => window.removeEventListener("popstate", handlePopState) }, []) useEffect(() => { - if (isAuthenticated) loadPlans() - }, [isAuthenticated, loadPlans]) + if (!isAuthenticated) return + + let cancelled = false + + void (async () => { + try { + const res = await fetchPlans() + if (!cancelled) { + setPlans(res.plans ?? []) + } + } catch { + if (!cancelled) { + setPlans([]) + } + } + })() + + return () => { + cancelled = true + } + }, [isAuthenticated]) - // Adaptive billing mode const billingMode: BillingMode = useMemo(() => { - if (plans.length === 0) return 'token' - const hasMultiplier = plans.some(p => p.useMultiplierBilling) - const hasToken = plans.some(p => !p.useMultiplierBilling) - if (hasMultiplier && hasToken) return 'hybrid' - if (hasMultiplier) return 'multiplier' - return 'token' + if (plans.length === 0) return "token" + const hasMultiplier = plans.some((plan) => plan.useMultiplierBilling) + const hasToken = plans.some((plan) => !plan.useMultiplierBilling) + if (hasMultiplier && hasToken) return "hybrid" + if (hasMultiplier) return "multiplier" + return "token" }, [plans]) if (inProgress !== InteractionStatus.None) { return (
- +

Authenticating…

@@ -60,16 +109,18 @@ function App() { if (!isAuthenticated) { return (
-
- -
-

AI Policy Engine Dashboard

-

Sign in with your organization account to access the dashboard.

+
+
+ +
+

AI Policy Engine Dashboard

+

Sign in with your organization account to access the dashboard.

+
+
-
) @@ -77,19 +128,20 @@ function App() { if (selectedClient) { return ( - { setSelectedClient(null); setActiveTab(tab); }} billingMode={billingMode}> + { setSelectedClient(null); handleTabChange(tab) }} billingMode={billingMode}> setSelectedClient(null)} /> ) } return ( - + {activeTab === "dashboard" && setSelectedClient({ clientAppId, tenantId })} />} {activeTab === "clients" && setSelectedClient({ clientAppId, tenantId })} />} {activeTab === "plans" && } {activeTab === "pricing" && } {activeTab === "routing" && } + {activeTab === "apis" && } {activeTab === "requests" && setSelectedClient({ clientAppId, tenantId })} />} {activeTab === "export" && } diff --git a/src/aipolicyengine-ui/src/api.ts b/src/aipolicyengine-ui/src/api.ts index d04534c6..fbf57ea8 100644 --- a/src/aipolicyengine-ui/src/api.ts +++ b/src/aipolicyengine-ui/src/api.ts @@ -2,7 +2,7 @@ import { InteractionRequiredAuthError, PublicClientApplication, type SilentReque import { msalConfig, loginRequest } from "./auth/msalConfig"; import type { ChargebackResponse, QuotasResponse, QuotaUpdateRequest, QuotaData, PlansResponse, PlanCreateRequest, PlanUpdateRequest, PlanData, ClientsResponse, ClientAssignRequest, ClientUsageResponse, ClientTracesResponse, UsageSummaryResponse, RequestLogsResponse, ModelPricingResponse, ModelPricingCreateRequest, ModelPricing, ExportPeriodsResponse, DeploymentsResponse, RoutingPoliciesResponse, ModelRoutingPolicy, ModelRoutingPolicyCreateRequest, ModelRoutingPolicyUpdateRequest, RequestSummaryResponse } from "./types"; -const API_BASE = import.meta.env.VITE_API_URL || ""; +export const API_BASE = import.meta.env.VITE_API_URL || ""; const msalInstance = new PublicClientApplication(msalConfig); let redirectInFlight = false; @@ -49,7 +49,7 @@ async function getToken(): Promise { } } -async function authFetch(url: string, options: RequestInit = {}): Promise { +export async function authFetch(url: string, options: RequestInit = {}): Promise { let token: string | null = null; try { token = await getToken(); @@ -69,7 +69,7 @@ async function authFetch(url: string, options: RequestInit = {}): Promise { +export async function parseErrorMessage(res: Response, fallback: string): Promise { const body = await res.json().catch(() => null); return body?.error || body?.message || `${fallback}: ${res.statusText}`; } @@ -129,7 +129,7 @@ export async function createPlan(data: PlanCreateRequest): Promise { return res.json(); } -export async function updatePlan(planId: string, data: PlanUpdateRequest): Promise { +export async function updatePlan(planId: string, data: PlanUpdateRequest): Promise { const res = await authFetch(`${API_BASE}/api/plans/${encodeURIComponent(planId)}`, { method: "PUT", body: JSON.stringify(data), @@ -151,7 +151,7 @@ export async function fetchClients(): Promise { return res.json(); } -export async function assignClient(clientAppId: string, tenantId: string, data: ClientAssignRequest): Promise { +export async function assignClient(clientAppId: string, tenantId: string, data: ClientAssignRequest): Promise { const res = await authFetch(`${API_BASE}/api/clients/${encodeURIComponent(clientAppId)}/${encodeURIComponent(tenantId)}`, { method: "PUT", body: JSON.stringify(data), diff --git a/src/aipolicyengine-ui/src/api/apim.ts b/src/aipolicyengine-ui/src/api/apim.ts new file mode 100644 index 00000000..771fc7cc --- /dev/null +++ b/src/aipolicyengine-ui/src/api/apim.ts @@ -0,0 +1,92 @@ +import { API_BASE, authFetch, parseErrorMessage } from "../api.ts" +import type { + ApisResponse, + ApiOperationsResponse, + ApplyPolicyRequest, + ApplyPolicyResponse, + ClearPolicyResponse, + HttpError, + PolicyDocumentResponse, + TemplatesResponse, +} from "../types/apim" + +function apiPolicyPath(apiId: string): string { + return `/api/apim/apis/${encodeURIComponent(apiId)}/policy` +} + +function operationPolicyPath(apiId: string, operationId: string): string { + return `/api/apim/apis/${encodeURIComponent(apiId)}/operations/${encodeURIComponent(operationId)}/policy` +} + +async function buildHttpError(res: Response, fallback: string): Promise { + const message = await parseErrorMessage(res, fallback) + const error = new Error(message) as HttpError + error.status = res.status + error.body = await res.clone().json().catch(() => null) + return error +} + +async function requestJson(path: string, fallback: string, options: RequestInit = {}): Promise { + const res = await authFetch(`${API_BASE}${path}`, options) + if (!res.ok) { + throw await buildHttpError(res, fallback) + } + + if (res.status === 204) { + return undefined as T + } + + return res.json() as Promise +} + +export function fetchApimApis(): Promise { + return requestJson("/api/apim/apis", "Failed to fetch APIs") +} + +export function fetchApimOperations(apiId: string): Promise { + return requestJson( + `/api/apim/apis/${encodeURIComponent(apiId)}/operations`, + "Failed to fetch API operations", + ) +} + +export function fetchApiPolicy(apiId: string): Promise { + return requestJson(apiPolicyPath(apiId), "Failed to fetch API policy") +} + +export function fetchOperationPolicy(apiId: string, operationId: string): Promise { + return requestJson( + operationPolicyPath(apiId, operationId), + "Failed to fetch operation policy", + ) +} + +export function fetchApimTemplates(): Promise { + return requestJson("/api/apim/templates", "Failed to fetch policy templates") +} + +export function applyApiPolicy(apiId: string, data: ApplyPolicyRequest): Promise { + return requestJson(apiPolicyPath(apiId), "Failed to apply API policy", { + method: "POST", + body: JSON.stringify(data), + }) +} + +export function applyOperationPolicy(apiId: string, operationId: string, data: ApplyPolicyRequest): Promise { + return requestJson(operationPolicyPath(apiId, operationId), "Failed to apply operation policy", { + method: "POST", + body: JSON.stringify(data), + }) +} + +export function clearApiPolicy(apiId: string): Promise { + return requestJson(apiPolicyPath(apiId), "Failed to clear API policy", { + method: "DELETE", + }) +} + +export function clearOperationPolicy(apiId: string, operationId: string): Promise { + return requestJson(operationPolicyPath(apiId, operationId), "Failed to clear operation policy", { + method: "DELETE", + }) +} diff --git a/src/aipolicyengine-ui/src/components/Layout.tsx b/src/aipolicyengine-ui/src/components/Layout.tsx index 1df5388a..3b208a2f 100644 --- a/src/aipolicyengine-ui/src/components/Layout.tsx +++ b/src/aipolicyengine-ui/src/components/Layout.tsx @@ -24,6 +24,7 @@ export function Layout({ children, activeTab, onTabChange, billingMode = 'token' { id: "plans", label: "Plans" }, { id: "pricing", label: "Pricing" }, { id: "routing", label: "Routing" }, + { id: "apis", label: "APIs" }, // Adaptive: show Request Billing only when at least one plan uses multiplier billing ...(billingMode !== 'token' ? [{ id: "requests", label: "Request Billing" }] : []), { id: "export", label: "Export" }, diff --git a/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx b/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx new file mode 100644 index 00000000..3f1b91ae --- /dev/null +++ b/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx @@ -0,0 +1,146 @@ +import { Badge } from "../ui/badge" +import { Button } from "../ui/button" +import { Card, CardContent, CardHeader, CardTitle } from "../ui/card" +import { ChevronDown, ChevronRight, Globe, RefreshCcw, Workflow } from "lucide-react" +import type { ApimApiSummary, ApimOperationSummary } from "../../types/apim" + +interface ApiTreeProps { + apis: ApimApiSummary[] + expandedApiIds: string[] + loadingOperationApiIds: string[] + operationsByApi: Record + operationErrors: Record + selectedKey?: string + onApiToggle: (api: ApimApiSummary) => void + onApiSelect: (api: ApimApiSummary) => void + onOperationSelect: (api: ApimApiSummary, operation: ApimOperationSummary) => void + onRetryOperations: (api: ApimApiSummary) => void +} + +function isExpanded(expandedApiIds: string[], apiId: string): boolean { + return expandedApiIds.includes(apiId) +} + +function isLoading(loadingOperationApiIds: string[], apiId: string): boolean { + return loadingOperationApiIds.includes(apiId) +} + +export function ApiTree({ + apis, + expandedApiIds, + loadingOperationApiIds, + operationsByApi, + operationErrors, + selectedKey, + onApiToggle, + onApiSelect, + onOperationSelect, + onRetryOperations, +}: ApiTreeProps) { + return ( + + + + + APIs + {apis.length} + + + + {apis.length === 0 ? ( +
+ No APIs available from APIM yet. +
+ ) : ( +
    + {apis.map((api) => { + const expanded = isExpanded(expandedApiIds, api.id) + const operations = operationsByApi[api.id] ?? [] + const loading = isLoading(loadingOperationApiIds, api.id) + const operationError = operationErrors[api.id] + const apiKey = `api:${api.id}` + + return ( +
  • +
    + + +
    + + {expanded && ( +
    + {loading &&
    Loading operations…
    } + + {!loading && operationError && ( +
    + {operationError} + +
    + )} + + {!loading && !operationError && operations.length === 0 && ( +
    No operations found.
    + )} + + {!loading && !operationError && operations.map((operation) => { + const operationKey = `operation:${api.id}:${operation.id}` + return ( + + ) + })} +
    + )} +
  • + ) + })} +
+ )} +
+
+ ) +} diff --git a/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx b/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx new file mode 100644 index 00000000..2ca9a1c4 --- /dev/null +++ b/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx @@ -0,0 +1,241 @@ +import { useEffect, useMemo, useState } from "react" +import { Badge } from "../ui/badge" +import { Button } from "../ui/button" +import { Dialog, DialogClose, DialogHeader, DialogTitle } from "../ui/dialog" +import { Input } from "../ui/input" +import type { ApimTemplateSummary, TemplateParameterDefinition } from "../../types/apim" + +interface AssignTemplateFormProps { + open: boolean + onOpenChange: (open: boolean) => void + targetKind: "api" | "operation" + templates: ApimTemplateSummary[] + initialTemplateId?: string + initialParameters?: Record + planDefaults?: Record + submitting: boolean + onSubmit: (payload: { templateId: string; parameters: Record }) => Promise +} + +function toInputValue(value: string | number | null | undefined): string { + if (value === null || value === undefined) return "" + return String(value) +} + +function buildInitialValues( + template: ApimTemplateSummary, + initialParameters: Record | undefined, + planDefaults: Record | undefined, +): Record { + return Object.fromEntries( + template.parameters.map((parameter) => { + const existingValue = initialParameters?.[parameter.name] + const planValue = planDefaults?.[parameter.name] + const fallbackValue = parameter.default + return [parameter.name, toInputValue(existingValue ?? planValue ?? fallbackValue)] + }), + ) +} + +function parameterPlaceholder(parameter: TemplateParameterDefinition): string { + if (parameter.description) return parameter.description + return parameter.type === "int" ? "Enter a whole number" : "Enter a value" +} + +export function AssignTemplateForm({ + open, + onOpenChange, + targetKind, + templates, + initialTemplateId, + initialParameters, + planDefaults, + submitting, + onSubmit, +}: AssignTemplateFormProps) { + const [selectedTemplateId, setSelectedTemplateId] = useState("") + const [parameterValues, setParameterValues] = useState>({}) + const [formError, setFormError] = useState(null) + + const filteredTemplates = useMemo(() => { + const exactMatches = templates.filter( + (template) => template.scope === targetKind || template.scope === "both", + ) + return exactMatches.length > 0 ? exactMatches : templates + }, [targetKind, templates]) + + const selectedTemplate = useMemo( + () => filteredTemplates.find((template) => template.id === selectedTemplateId) ?? null, + [filteredTemplates, selectedTemplateId], + ) + + useEffect(() => { + if (!open) return + + const preferredTemplateId = + initialTemplateId && filteredTemplates.some((template) => template.id === initialTemplateId) + ? initialTemplateId + : filteredTemplates[0]?.id ?? "" + + let cancelled = false + queueMicrotask(() => { + if (cancelled) return + setSelectedTemplateId(preferredTemplateId) + + const template = filteredTemplates.find((item) => item.id === preferredTemplateId) + setParameterValues(template ? buildInitialValues(template, initialParameters, planDefaults) : {}) + setFormError(null) + }) + + return () => { + cancelled = true + } + }, [filteredTemplates, initialParameters, initialTemplateId, open, planDefaults]) + + const missingRequiredFields = useMemo(() => { + if (!selectedTemplate) return [] + + return selectedTemplate.parameters + .filter((parameter) => parameter.required && !parameterValues[parameter.name]?.trim()) + .map((parameter) => parameter.name) + }, [parameterValues, selectedTemplate]) + + const handleTemplateChange = (templateId: string) => { + setSelectedTemplateId(templateId) + setFormError(null) + const template = filteredTemplates.find((item) => item.id === templateId) + setParameterValues(template ? buildInitialValues(template, initialParameters, planDefaults) : {}) + } + + const handleApply = async () => { + if (!selectedTemplate) { + setFormError("Choose a template before applying.") + return + } + + if (missingRequiredFields.length > 0) { + setFormError(`Complete all required fields: ${missingRequiredFields.join(", ")}`) + return + } + + const parameters = Object.fromEntries( + selectedTemplate.parameters.map((parameter) => { + const rawValue = parameterValues[parameter.name] ?? "" + if (parameter.type === "int") { + return [parameter.name, Number(rawValue)] + } + return [parameter.name, rawValue.trim()] + }), + ) + + await onSubmit({ templateId: selectedTemplate.id, parameters }) + } + + return ( + + onOpenChange(false)} /> + + Assign policy template + + +
+
+
+
+

1. Choose a template

+

+ Showing templates usable for {targetKind === "api" ? "API-level" : "operation-level"} assignment. +

+
+ {targetKind} +
+ +
+ +
+

2. Configure parameters

+ {!selectedTemplate ? ( +

Select a template to configure its parameters.

+ ) : selectedTemplate.parameters.length === 0 ? ( +

This template does not require parameters.

+ ) : ( +
+ {selectedTemplate.parameters.map((parameter) => ( +
+
+ + {parameter.required && Required} + {parameter.type} +
+ {parameter.description && ( +

{parameter.description}

+ )} + { + const nextValue = event.target.value + setParameterValues((current) => ({ ...current, [parameter.name]: nextValue })) + }} + required={parameter.required} + placeholder={parameterPlaceholder(parameter)} + /> + {parameter.default !== undefined && parameter.default !== null && ( +

Default: {String(parameter.default)}

+ )} + {planDefaults?.[parameter.name] !== undefined && ( +

Plan default: {String(planDefaults[parameter.name])}

+ )} +
+ ))} +
+ )} +
+ +
+

3. Apply

+

+ Applying is asynchronous. The page will poll for status updates until the assignment finishes. +

+
+ + {formError && ( +
+ {formError} +
+ )} + +
+ + +
+
+
+ ) +} diff --git a/src/aipolicyengine-ui/src/components/apis/PolicyAssignmentPanel.tsx b/src/aipolicyengine-ui/src/components/apis/PolicyAssignmentPanel.tsx new file mode 100644 index 00000000..81fef985 --- /dev/null +++ b/src/aipolicyengine-ui/src/components/apis/PolicyAssignmentPanel.tsx @@ -0,0 +1,262 @@ +import { useEffect, useMemo, useState } from "react" +import { AlertTriangle, CheckCircle2, Code2, LoaderCircle, RefreshCcw, ShieldCheck, Trash2, Wand2 } from "lucide-react" +import { Badge } from "../ui/badge" +import { Button } from "../ui/button" +import { Card, CardContent, CardHeader, CardTitle } from "../ui/card" +import { Dialog, DialogClose, DialogHeader, DialogTitle } from "../ui/dialog" +import type { ApimTemplateSummary, PolicyAssignment, PolicyAssignmentStatus, PolicyDocumentResponse } from "../../types/apim" + +interface SelectedTargetSummary { + key: string + kind: "api" | "operation" + title: string + subtitle: string +} + +interface PolicyAssignmentPanelProps { + selectedTarget: SelectedTargetSummary | null + policyDocument: PolicyDocumentResponse | null + policyLoading: boolean + policyError: string | null + templates: ApimTemplateSummary[] + busy: boolean + onAssign: () => void + onClear: () => Promise + onRetry: () => void +} + +function formatDate(value?: string | null): string { + if (!value) return "—" + const date = new Date(value) + if (Number.isNaN(date.getTime())) return value + return date.toLocaleString() +} + +function getStatusPresentation(status?: PolicyAssignmentStatus) { + switch (status) { + case "synced": + return { label: "Synced", variant: "green" as const, icon: CheckCircle2 } + case "pending": + return { label: "Pending", variant: "blue" as const, icon: LoaderCircle, spinning: true } + case "applying": + return { label: "Applying", variant: "blue" as const, icon: LoaderCircle, spinning: true } + case "failed": + return { label: "Failed", variant: "red" as const, icon: AlertTriangle } + default: + return { label: "Unassigned", variant: "secondary" as const, icon: ShieldCheck } + } +} + +function statusDetail(assignment: PolicyAssignment | null): string | null { + if (!assignment) return null + return assignment.errorMessage +} + +export function PolicyAssignmentPanel({ + selectedTarget, + policyDocument, + policyLoading, + policyError, + templates, + busy, + onAssign, + onClear, + onRetry, +}: PolicyAssignmentPanelProps) { + const [showXml, setShowXml] = useState(false) + const [confirmClearOpen, setConfirmClearOpen] = useState(false) + + useEffect(() => { + const timeoutId = window.setTimeout(() => { + setShowXml(false) + setConfirmClearOpen(false) + }, 0) + + return () => { + window.clearTimeout(timeoutId) + } + }, [selectedTarget?.key]) + + const assignment = policyDocument?.assignment ?? null + const assignmentStatus = getStatusPresentation(assignment?.status) + const templateMap = useMemo( + () => Object.fromEntries(templates.map((template) => [template.id, template])), + [templates], + ) + const resolvedTemplate = assignment?.templateId ? templateMap[assignment.templateId] : undefined + const resolvedTemplateName = resolvedTemplate?.displayName ?? assignment?.templateId ?? "None" + const resolvedTemplateVersion = assignment?.templateVersion ?? resolvedTemplate?.version ?? "—" + const detail = statusDetail(assignment) + + if (!selectedTarget) { + return ( + + + +
+

Select an API or operation

+

Choose an item from the left tree to inspect its APIM policy assignment.

+
+
+
+ ) + } + + return ( + <> + + +
+
+ + {selectedTarget.kind === "api" ? "API" : "Operation"} + + {selectedTarget.subtitle} +
+ {selectedTarget.title} +
+
+ + + +
+
+ + {policyLoading ? ( +
+ + Loading policy details… +
+ ) : policyError ? ( +
+
+ +
+

{policyError}

+
+ +
+
+ ) : ( + <> +
+
+

Status

+
+ + + {assignmentStatus.label} + +
+
+
+

Template

+

{resolvedTemplateName}

+

Version {resolvedTemplateVersion}

+
+
+

Last applied

+

{formatDate(assignment?.lastAppliedAt)}

+
+
+

Applied by

+

{assignment?.appliedBy || "—"}

+
+
+ + {detail && ( +
+ {detail} +
+ )} + +
+
+

Current assignment

+
+
+ {!assignment ? ( +
+ No template is currently assigned. +
+ ) : Object.keys(assignment.parameters ?? {}).length === 0 ? ( +
+ This assignment has no parameter overrides. +
+ ) : ( +
+ {Object.entries(assignment.parameters).map(([key, value]) => ( +
+
{key}
+
{value === null ? "null" : String(value)}
+
+ ))} +
+ )} +
+
+ + {showXml && ( +
+
+

Live APIM XML

+
+
+
+                      {policyDocument?.currentXml?.trim() || "No current XML returned."}
+                    
+
+
+ )} + + )} +
+
+ + + setConfirmClearOpen(false)} /> + + Clear policy assignment + +
+

+ This will replace the policy with passthrough — are you sure? +

+
+ + +
+
+
+ + ) +} diff --git a/src/aipolicyengine-ui/src/components/ui/badge.tsx b/src/aipolicyengine-ui/src/components/ui/badge.tsx index 0c385a93..7cabbb34 100644 --- a/src/aipolicyengine-ui/src/components/ui/badge.tsx +++ b/src/aipolicyengine-ui/src/components/ui/badge.tsx @@ -1,3 +1,4 @@ +/* eslint-disable react-refresh/only-export-components */ import * as React from "react" import { cva, type VariantProps } from "class-variance-authority" import { cn } from "../../lib/utils" diff --git a/src/aipolicyengine-ui/src/components/ui/button.tsx b/src/aipolicyengine-ui/src/components/ui/button.tsx index f8710d9a..e3f8715f 100644 --- a/src/aipolicyengine-ui/src/components/ui/button.tsx +++ b/src/aipolicyengine-ui/src/components/ui/button.tsx @@ -1,3 +1,4 @@ +/* eslint-disable react-refresh/only-export-components */ import * as React from "react" import { cva, type VariantProps } from "class-variance-authority" import { cn } from "../../lib/utils" diff --git a/src/aipolicyengine-ui/src/components/ui/tabs.tsx b/src/aipolicyengine-ui/src/components/ui/tabs.tsx index 4f070354..7b2c8de0 100644 --- a/src/aipolicyengine-ui/src/components/ui/tabs.tsx +++ b/src/aipolicyengine-ui/src/components/ui/tabs.tsx @@ -74,7 +74,8 @@ interface TabsContentProps extends React.HTMLAttributes { _onTabChange?: (value: string) => void } -function TabsContent({ className, value, _activeTab, _onTabChange: _, children, ...props }: TabsContentProps) { +function TabsContent({ className, value, _activeTab, _onTabChange, children, ...props }: TabsContentProps) { + void _onTabChange if (_activeTab !== value) return null return (
diff --git a/src/aipolicyengine-ui/src/context/ThemeProvider.tsx b/src/aipolicyengine-ui/src/context/ThemeProvider.tsx index 52fb7f70..c10d41ff 100644 --- a/src/aipolicyengine-ui/src/context/ThemeProvider.tsx +++ b/src/aipolicyengine-ui/src/context/ThemeProvider.tsx @@ -1,3 +1,4 @@ +/* eslint-disable react-refresh/only-export-components */ import { createContext, useContext, useEffect, useState, type ReactNode } from "react" type Theme = "dark" | "light" | "system" diff --git a/src/aipolicyengine-ui/src/pages/Apis.tsx b/src/aipolicyengine-ui/src/pages/Apis.tsx new file mode 100644 index 00000000..4f6659cd --- /dev/null +++ b/src/aipolicyengine-ui/src/pages/Apis.tsx @@ -0,0 +1,580 @@ +import { useCallback, useEffect, useMemo, useState } from "react" +import { useMsal } from "@azure/msal-react" +import { AlertTriangle, Network, RefreshCcw } from "lucide-react" +import { ApiTree } from "../components/apis/ApiTree" +import { AssignTemplateForm } from "../components/apis/AssignTemplateForm" +import { PolicyAssignmentPanel } from "../components/apis/PolicyAssignmentPanel" +import { Badge } from "../components/ui/badge" +import { Button } from "../components/ui/button" +import { fetchPlans } from "../api" +import { + applyApiPolicy, + applyOperationPolicy, + clearApiPolicy, + clearOperationPolicy, + fetchApiPolicy, + fetchApimApis, + fetchApimOperations, + fetchApimTemplates, + fetchOperationPolicy, +} from "../api/apim" +import type { PlanData } from "../types" +import type { + ApimApiSummary, + ApimOperationSummary, + ApimTemplateSummary, + ApplyPolicyRequest, + HttpError, + PolicyAssignmentStatus, + PolicyDocumentResponse, +} from "../types/apim" + +interface SelectedApiTarget { + kind: "api" + api: ApimApiSummary +} + +interface SelectedOperationTarget { + kind: "operation" + api: ApimApiSummary + operation: ApimOperationSummary +} + +type SelectedTarget = SelectedApiTarget | SelectedOperationTarget + +interface ToastState { + message: string + retryLabel?: string + onRetry?: () => void +} + +const PLAN_PARAMETER_RESOLVERS: Record number | undefined> = { + NonAiRequestsPerMinute: (plan) => plan.requestsPerMinuteLimit, + RequestsPerMinuteLimit: (plan) => plan.requestsPerMinuteLimit, + NonAiMonthlyRequestQuota: (plan) => plan.monthlyRequestQuota, + MonthlyRequestQuota: (plan) => plan.monthlyRequestQuota, + TokensPerMinuteLimit: (plan) => plan.tokensPerMinuteLimit, + MonthlyTokenQuota: (plan) => plan.monthlyTokenQuota, +} + +function targetKey(target: SelectedTarget): string { + return target.kind === "api" ? `api:${target.api.id}` : `operation:${target.api.id}:${target.operation.id}` +} + +function targetSummary(target: SelectedTarget) { + if (target.kind === "api") { + return { + key: targetKey(target), + kind: "api" as const, + title: target.api.displayName, + subtitle: `/${target.api.path}`, + } + } + + return { + key: targetKey(target), + kind: "operation" as const, + title: target.operation.displayName, + subtitle: `${target.operation.method} ${target.operation.urlTemplate}`, + } +} + +function getStatus(error: unknown): number | undefined { + return typeof error === "object" && error !== null && "status" in error + ? (error as HttpError).status + : undefined +} + +function getErrorMessage(error: unknown, fallback: string): string { + return error instanceof Error ? error.message : fallback +} + +function isPollingStatus(status?: PolicyAssignmentStatus): boolean { + return status === "pending" || status === "applying" +} + +function createApplyingPolicyDocument( + current: PolicyDocumentResponse | null, + target: SelectedTarget, + payload: ApplyPolicyRequest, + template: ApimTemplateSummary | undefined, + assignmentId: string, + appliedBy: string, +): PolicyDocumentResponse { + const now = new Date().toISOString() + + return { + assignment: { + id: assignmentId, + apiId: target.api.id, + operationId: target.kind === "api" ? null : target.operation.id, + apiDisplayName: target.api.displayName, + templateId: payload.templateId, + templateVersion: template?.version ?? current?.assignment?.templateVersion ?? "", + parameters: payload.parameters, + generatedXmlHash: current?.assignment?.generatedXmlHash ?? null, + lastAppliedAt: current?.assignment?.lastAppliedAt ?? null, + appliedBy: current?.assignment?.appliedBy || appliedBy, + status: "applying", + errorMessage: null, + createdAt: current?.assignment?.createdAt ?? now, + updatedAt: now, + }, + currentXml: current?.currentXml ?? "", + } +} + +// TODO(contract): the backend contract does not identify which plan's defaults should hydrate API assignment forms, so only values shared across plans are prefilled. +function derivePlanDefaults(plans: PlanData[]): Record { + const defaults: Record = {} + + for (const [parameterName, resolver] of Object.entries(PLAN_PARAMETER_RESOLVERS)) { + const resolvedValues = plans + .map((plan) => resolver(plan)) + .filter((value): value is number => value !== undefined) + + if (resolvedValues.length === 0) continue + + const uniqueValues = Array.from(new Set(resolvedValues)) + if (uniqueValues.length === 1) { + defaults[parameterName] = uniqueValues[0] + } + } + + return defaults +} + +export function Apis() { + const { accounts } = useMsal() + const [apis, setApis] = useState([]) + const [templates, setTemplates] = useState([]) + const [plans, setPlans] = useState([]) + const [initialLoading, setInitialLoading] = useState(true) + const [initialError, setInitialError] = useState(null) + const [accessDeniedMessage, setAccessDeniedMessage] = useState(null) + const [toast, setToast] = useState(null) + + const [expandedApiIds, setExpandedApiIds] = useState([]) + const [loadingOperationApiIds, setLoadingOperationApiIds] = useState([]) + const [operationsByApi, setOperationsByApi] = useState>({}) + const [operationErrors, setOperationErrors] = useState>({}) + + const [selectedTarget, setSelectedTarget] = useState(null) + const [policyDocument, setPolicyDocument] = useState(null) + const [policyLoading, setPolicyLoading] = useState(false) + const [policyError, setPolicyError] = useState(null) + + const [assignFormOpen, setAssignFormOpen] = useState(false) + const [submittingAssignment, setSubmittingAssignment] = useState(false) + const [clearingAssignment, setClearingAssignment] = useState(false) + + const adminRoleClaims = useMemo(() => { + const firstAccount = accounts[0] + const roles = firstAccount?.idTokenClaims?.roles + return Array.isArray(roles) ? roles : [] + }, [accounts]) + + const lacksExplicitAdminRole = adminRoleClaims.length > 0 && !adminRoleClaims.includes("AIPolicy.Admin") + const selectedSummary = selectedTarget ? targetSummary(selectedTarget) : null + const selectedKey = selectedTarget ? targetKey(selectedTarget) : undefined + const busy = submittingAssignment || clearingAssignment + const planDefaults = useMemo(() => derivePlanDefaults(plans), [plans]) + + const showToast = useCallback((message: string, onRetry?: () => void, retryLabel = "Retry") => { + setToast({ message, onRetry, retryLabel: onRetry ? retryLabel : undefined }) + }, []) + + const handleAccessError = useCallback(() => { + setAccessDeniedMessage("You need AIPolicy.Admin role to use this page") + }, []) + + const loadPolicy = useCallback(async (target: SelectedTarget, options?: { silent?: boolean }) => { + if (!options?.silent) { + setPolicyLoading(true) + setPolicyError(null) + } + + try { + const nextPolicyDocument = target.kind === "api" + ? await fetchApiPolicy(target.api.id) + : await fetchOperationPolicy(target.api.id, target.operation.id) + + setPolicyDocument(nextPolicyDocument) + setPolicyError(null) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + handleAccessError() + setPolicyDocument(null) + return + } + + const message = getErrorMessage(error, "Failed to load policy details") + setPolicyError(message) + showToast(message, () => { + void loadPolicy(target) + }) + } finally { + if (!options?.silent) { + setPolicyLoading(false) + } + } + }, [handleAccessError, showToast]) + + const loadOperations = useCallback(async (api: ApimApiSummary) => { + if (loadingOperationApiIds.includes(api.id)) return + + setLoadingOperationApiIds((current) => [...current, api.id]) + setOperationErrors((current) => ({ ...current, [api.id]: null })) + + try { + const response = await fetchApimOperations(api.id) + setOperationsByApi((current) => ({ ...current, [api.id]: response.operations ?? [] })) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + handleAccessError() + return + } + + const message = getErrorMessage(error, `Failed to load operations for ${api.displayName}`) + setOperationErrors((current) => ({ ...current, [api.id]: message })) + showToast(message, () => { + void loadOperations(api) + }) + } finally { + setLoadingOperationApiIds((current) => current.filter((apiId) => apiId !== api.id)) + } + }, [handleAccessError, loadingOperationApiIds, showToast]) + + const loadInitialData = useCallback(async () => { + setInitialLoading(true) + setInitialError(null) + setAccessDeniedMessage(null) + + try { + const [apisResponse, templatesResponse, plansResponse] = await Promise.all([ + fetchApimApis(), + fetchApimTemplates(), + fetchPlans().catch(() => ({ plans: [] })), + ]) + + const nextApis = apisResponse.apis ?? [] + setApis(nextApis) + setTemplates(templatesResponse.templates ?? []) + setPlans(plansResponse.plans ?? []) + setOperationErrors({}) + setOperationsByApi({}) + setExpandedApiIds([]) + setInitialError(null) + + setSelectedTarget((current) => { + if (!current && nextApis.length > 0) { + return { kind: "api", api: nextApis[0] } + } + + if (!current) return null + + const matchingApi = nextApis.find((api) => api.id === current.api.id) + if (!matchingApi) { + return nextApis.length > 0 ? { kind: "api", api: nextApis[0] } : null + } + + if (current.kind === "api") { + return { kind: "api", api: matchingApi } + } + + const existingOperations = operationsByApi[matchingApi.id] ?? [] + const matchingOperation = existingOperations.find((operation) => operation.id === current.operation.id) + return matchingOperation + ? { kind: "operation", api: matchingApi, operation: matchingOperation } + : { kind: "api", api: matchingApi } + }) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + handleAccessError() + setApis([]) + setTemplates([]) + setPlans([]) + return + } + + const message = getErrorMessage(error, "Failed to load APIM data") + setInitialError(message) + showToast(message, () => { + void loadInitialData() + }) + } finally { + setInitialLoading(false) + } + }, [handleAccessError, operationsByApi, showToast]) + + useEffect(() => { + if (lacksExplicitAdminRole) { + setAccessDeniedMessage("You need AIPolicy.Admin role to use this page") + setInitialLoading(false) + return + } + + void loadInitialData() + }, [lacksExplicitAdminRole, loadInitialData]) + + useEffect(() => { + if (!selectedTarget || accessDeniedMessage) { + setPolicyDocument(null) + setPolicyError(null) + return + } + + void loadPolicy(selectedTarget) + }, [accessDeniedMessage, loadPolicy, selectedTarget]) + + useEffect(() => { + if (!selectedTarget || !isPollingStatus(policyDocument?.assignment?.status)) return + + const timeoutId = window.setTimeout(() => { + void loadPolicy(selectedTarget, { silent: true }) + }, 2000) + + return () => window.clearTimeout(timeoutId) + }, [loadPolicy, policyDocument?.assignment?.status, selectedTarget]) + + const handleApiToggle = (api: ApimApiSummary) => { + setExpandedApiIds((current) => { + const alreadyExpanded = current.includes(api.id) + if (alreadyExpanded) { + return current.filter((apiId) => apiId !== api.id) + } + + return [...current, api.id] + }) + + if (!operationsByApi[api.id]) { + void loadOperations(api) + } + } + + const handleApiSelect = (api: ApimApiSummary) => { + setSelectedTarget({ kind: "api", api }) + if (!expandedApiIds.includes(api.id)) { + setExpandedApiIds((current) => [...current, api.id]) + } + if (!operationsByApi[api.id]) { + void loadOperations(api) + } + } + + const handleOperationSelect = (api: ApimApiSummary, operation: ApimOperationSummary) => { + if (!expandedApiIds.includes(api.id)) { + setExpandedApiIds((current) => [...current, api.id]) + } + setSelectedTarget({ kind: "operation", api, operation }) + } + + const handleApplyTemplate = async (payload: ApplyPolicyRequest) => { + if (!selectedTarget) return + + setSubmittingAssignment(true) + + try { + const response = selectedTarget.kind === "api" + ? await applyApiPolicy(selectedTarget.api.id, payload) + : await applyOperationPolicy(selectedTarget.api.id, selectedTarget.operation.id, payload) + + const template = templates.find((item) => item.id === payload.templateId) + const appliedBy = accounts[0]?.username ?? "" + setPolicyDocument((current) => createApplyingPolicyDocument(current, selectedTarget, payload, template, response.assignmentId, appliedBy)) + setPolicyError(null) + setAssignFormOpen(false) + showToast(`Policy apply accepted (${response.status}). Polling for the latest status…`) + void loadPolicy(selectedTarget, { silent: true }) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + handleAccessError() + return + } + + const message = getErrorMessage(error, "Failed to apply policy template") + setPolicyError(message) + showToast(message, () => { + void handleApplyTemplate(payload) + }) + throw error + } finally { + setSubmittingAssignment(false) + } + } + + const handleClearAssignment = async () => { + if (!selectedTarget) return + + setClearingAssignment(true) + try { + if (selectedTarget.kind === "api") { + await clearApiPolicy(selectedTarget.api.id) + } else { + await clearOperationPolicy(selectedTarget.api.id, selectedTarget.operation.id) + } + + setPolicyDocument((current) => ({ + assignment: null, + currentXml: current?.currentXml ?? "", + })) + setPolicyError(null) + showToast("Policy assignment cleared.") + void loadPolicy(selectedTarget, { silent: true }) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + handleAccessError() + return + } + + const message = getErrorMessage(error, "Failed to clear policy assignment") + setPolicyError(message) + showToast(message, () => { + void handleClearAssignment() + }) + throw error + } finally { + setClearingAssignment(false) + } + } + + if (accessDeniedMessage) { + return ( +
+
+ +
+

APIs

+

Manage APIM template assignments for APIs and operations.

+
+
+ +
+
+ +
+

{accessDeniedMessage}

+

Ask an administrator to grant the AIPolicy.Admin role, then refresh this page.

+
+
+
+
+ ) + } + + return ( +
+
+
+ +
+

APIs

+

+ Browse APIM APIs, inspect current assignments, and apply policy templates without editing XML. +

+
+
+
+ {Object.keys(planDefaults).length > 0 && Plan defaults available} + +
+
+ + {initialError && ( +
+
+ + {initialError} + +
+
+ )} + +
+
+ {initialLoading ? ( +
+ Loading APIs… +
+ ) : ( + { + void loadOperations(api) + }} + /> + )} +
+ + setAssignFormOpen(true)} + onClear={handleClearAssignment} + onRetry={() => { + if (selectedTarget) { + void loadPolicy(selectedTarget) + } + }} + /> +
+ + {selectedTarget && ( + + )} + + {toast && ( +
+
+ +
+

{toast.message}

+
+ {toast.onRetry && ( + + )} + +
+
+
+
+ )} +
+ ) +} diff --git a/src/aipolicyengine-ui/src/pages/Export.tsx b/src/aipolicyengine-ui/src/pages/Export.tsx index 721d20c4..352571db 100644 --- a/src/aipolicyengine-ui/src/pages/Export.tsx +++ b/src/aipolicyengine-ui/src/pages/Export.tsx @@ -108,8 +108,8 @@ export function Export() { setDownloadError(null) try { await downloadBillingSummary(selectedSummaryPeriod.year, selectedSummaryPeriod.month) - } catch (err: any) { - setDownloadError(err?.message ?? "Download failed") + } catch (err: unknown) { + setDownloadError(err instanceof Error ? err.message : "Download failed") } finally { setDownloading(false) } @@ -123,8 +123,8 @@ export function Export() { setDownloadError(null) try { await downloadClientAudit(clientAppId, tenantId, selectedAuditPeriod.year, selectedAuditPeriod.month) - } catch (err: any) { - setDownloadError(err?.message ?? "Download failed") + } catch (err: unknown) { + setDownloadError(err instanceof Error ? err.message : "Download failed") } finally { setDownloading(false) } diff --git a/src/aipolicyengine-ui/src/types/apim.ts b/src/aipolicyengine-ui/src/types/apim.ts new file mode 100644 index 00000000..e39df8ca --- /dev/null +++ b/src/aipolicyengine-ui/src/types/apim.ts @@ -0,0 +1,87 @@ +export type PolicyAssignmentStatus = "synced" | "pending" | "applying" | "failed" + +export type TemplateParameterType = "string" | "int" + +export interface ApimApiSummary { + id: string + displayName: string + path: string + serviceUrl: string + isCurrent: boolean +} + +export interface ApimOperationSummary { + id: string + displayName: string + method: string + urlTemplate: string +} + +export interface TemplateParameterDefinition { + name: string + type: TemplateParameterType + required: boolean + description: string + default: string | number | null +} + +export interface ApimTemplateSummary { + id: string + displayName: string + version: string + parameters: TemplateParameterDefinition[] + scope: string +} + +export interface PolicyAssignment { + id: string + apiId: string + operationId: string | null + apiDisplayName: string + templateId: string + templateVersion: string + parameters: Record + generatedXmlHash: string | null + lastAppliedAt: string | null + appliedBy: string + status: PolicyAssignmentStatus + errorMessage: string | null + createdAt: string + updatedAt: string +} + +export interface ApisResponse { + apis: ApimApiSummary[] +} + +export interface ApiOperationsResponse { + operations: ApimOperationSummary[] +} + +export interface PolicyDocumentResponse { + assignment: PolicyAssignment | null + currentXml: string +} + +export interface TemplatesResponse { + templates: ApimTemplateSummary[] +} + +export interface ApplyPolicyRequest { + templateId: string + parameters: Record +} + +export interface ApplyPolicyResponse { + assignmentId: string + status: PolicyAssignmentStatus +} + +export interface ClearPolicyResponse { + status: string +} + +export interface HttpError extends Error { + status?: number + body?: unknown +} From 016b654309b5ecfe29f6458785e2d967b08472da Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 13:39:17 -0400 Subject: [PATCH 02/14] fix(apim): bind APIM_RESOURCE_ID via Apim__ResourceId for ASP.NET Core nested config The Container App env var was named APIM_RESOURCE_ID (single underscore), but ASP.NET Core's EnvironmentVariablesConfigurationProvider only maps nested config keys (like Apim:ResourceId) from env vars with double underscores (Apim__ResourceId). At runtime the binding silently failed, leaving Apim:ResourceId empty and throwing InvalidOperationException from ApimCatalogService when the first request hit /api/apim/apis. Rename the Terraform env var to Apim__ResourceId so it binds correctly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/freamon/history.md | 3 ++- .../inbox/freamon-apim-config-binding-fix.md | 23 +++++++++++++++++++ infra/terraform/modules/compute/main.tf | 2 +- 3 files changed, 26 insertions(+), 2 deletions(-) create mode 100644 .squad/decisions/inbox/freamon-apim-config-binding-fix.md diff --git a/.squad/agents/freamon/history.md b/.squad/agents/freamon/history.md index 71701bae..8ffaffe2 100644 --- a/.squad/agents/freamon/history.md +++ b/.squad/agents/freamon/history.md @@ -60,4 +60,5 @@ For detailed work items, see: - `Azure.ResourceManager.ApiManagement` 1.3.x works cleanly with `ArmClient` + `DefaultAzureCredential`, but the APIM resource handle should be created lazily from `Apim:ResourceId` so unrelated app startup/tests do not fail when APIM is unconfigured. Read policy XML with `PolicyExportFormat.RawXml` and write with `PolicyContentFormat.RawXml` to preserve round-trippable XML instead of fragment-expanded output. - Template loading is safest as a repo-shipped library under `policies/templates/{id}/` with `policy.xml` + `template.json`. Validate manifests against placeholders discovered by regex before serving them, then render with exact `{{Placeholder}}` replacement, normalize typed parameters (`string`, `int`, etc.), reject unknown/unfilled inputs, and parse the rendered XML to confirm a `` root before any apply call. - Async apply is better implemented as an in-process channel + `BackgroundService` than ad-hoc `Task.Run` from endpoints. Endpoints persist the desired assignment as `pending`, enqueue a scope work item, and return 202 immediately; the worker flips to `applying`, re-renders from stored parameters, applies through the SDK, computes `generatedXmlHash` on success, and records `failed/errorMessage` on exceptions. Startup replay of `pending`/`applying` items should be best-effort so tests or partial environments do not stop the host. -- For Bunk: the APIM seams are now interface-first (`IApimCatalogService`, `ITemplateLibraryService`, `IPolicyAssignmentRepository`, `IApimPolicyApplyService`) and the worker logic is isolated in `ApimPolicyApplyService.ProcessAssignmentAsync`. Unit tests can exercise template rendering and apply orchestration without live Azure; recorded/live APIM coverage should focus on `ApimCatalogService` method mappings and the raw-XML policy format behavior. \ No newline at end of file +- For Bunk: the APIM seams are now interface-first (`IApimCatalogService`, `ITemplateLibraryService`, `IPolicyAssignmentRepository`, `IApimPolicyApplyService`) and the worker logic is isolated in `ApimPolicyApplyService.ProcessAssignmentAsync`. Unit tests can exercise template rendering and apply orchestration without live Azure; recorded/live APIM coverage should focus on `ApimCatalogService` method mappings and the raw-XML policy format behavior. +- ASP.NET Core binds nested options like Apim:ResourceId from environment variables that use double underscores (Apim__ResourceId), not single underscores. When Terraform wires Container App settings for ApimManagementOptions.ResourceId, use the double-underscore form or the API will see an empty resource ID at runtime (src/AIPolicyEngine.Api/Services/ApimManagement/ApimManagementOptions.cs, infra/terraform/modules/compute/main.tf). diff --git a/.squad/decisions/inbox/freamon-apim-config-binding-fix.md b/.squad/decisions/inbox/freamon-apim-config-binding-fix.md new file mode 100644 index 00000000..654fcc9f --- /dev/null +++ b/.squad/decisions/inbox/freamon-apim-config-binding-fix.md @@ -0,0 +1,23 @@ +# 2026-05-21 — APIM ResourceId env binding convention + +**Owner:** Freamon +**Status:** proposed +**Requested by:** Zack Way + +## Context + +`ApimManagementOptions` binds from the `Apim` configuration section, so the API expects the APIM resource ID at config key `Apim:ResourceId`. The Container App Terraform wiring used `APIM_RESOURCE_ID`, which ASP.NET Core does not translate into a nested config key. + +## Decision + +Use the standard ASP.NET Core environment-variable convention for nested keys: `Apim__ResourceId` (double underscore). Keep the C# options binding unchanged and make Terraform emit the conventional key. + +## Why + +- Matches the default `EnvironmentVariablesConfigurationProvider` behavior. +- Keeps the application code strict and idiomatic instead of adding one-off alias handling. +- Prevents silent runtime misbinding when infrastructure sets nested config values. + +## Impact + +Future Terraform and deployment wiring for APIM management should use `Apim__ResourceId` whenever populating `Apim:ResourceId`. diff --git a/infra/terraform/modules/compute/main.tf b/infra/terraform/modules/compute/main.tf index e662a3d4..74c6106a 100644 --- a/infra/terraform/modules/compute/main.tf +++ b/infra/terraform/modules/compute/main.tf @@ -223,7 +223,7 @@ resource "azurerm_container_app" "this" { } env { - name = "APIM_RESOURCE_ID" + name = "Apim__ResourceId" value = var.apim_resource_id } From 38afb059add1019296416b28f0f2bab710af7127 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 13:45:56 -0400 Subject: [PATCH 03/14] chore(squad): log APIM config-binding fix + merge decision inbox MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Orchestration log: 2026-05-21T17-43-57Z-freamon.md (APIM_RESOURCE_ID → Apim__ResourceId hotfix) - Session log: 2026-05-21T17-43-57Z-apim-config-binding-fix.md - Merge decision into Active Decisions: ASP.NET Core nested config convention (double-underscore env vars) - Delete inbox file: freamon-apim-config-binding-fix.md - Cross-agent context: append to sydnor/history.md and kima/history.md All 295 tests pass. Audit complete: no other single-underscore ASP.NET Core nested-config mismatches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/kima/history.md | 20 +- .squad/agents/sydnor/history.md | 20 ++ .squad/decisions.md | 11 + .../archive/bunk-non-ai-test-plan.md | 23 ++ .../mcnulty-apim-management-architecture.md | 303 ++++++++++++++++++ .../mcnulty-non-ai-api-limits-architecture.md | 226 +++++++++++++ .../archive/sydnor-non-ai-apim-policy.md | 108 +++++++ .../inbox/freamon-apim-config-binding-fix.md | 23 -- .squad/files/apim-management-api-contract.md | 139 ++++++++ .squad/files/non-ai-limits-test-plan.md | 194 +++++++++++ .../non-ai-paused/entra-jwt-rest-policy.md | 234 ++++++++++++++ .../non-ai-paused/entra-jwt-rest-policy.xml | 124 +++++++ .../2026-05-14T12-40-01Z-fix-reply-urls.md | 11 + ...6-05-14_18-15-41UTC-infra-fixes-shipped.md | 9 + ...05-21T17-43-57Z-apim-config-binding-fix.md | 10 + ...Z-fresh-deploy-postprovision-regression.md | 17 + ...60515T165232Z-fresh-deploy-app-role-gap.md | 17 + .../2026-05-14T12-40-01Z-sydnor.md | 44 +++ ...-14_18-15-41UTC-coordinator-commit-push.md | 35 ++ .../2026-05-21T17-43-57Z-freamon.md | 43 +++ ...518Z-sydnor-postprovision-tf-output-fix.md | 36 +++ ...T165232Z-sydnor-app-role-assignment-fix.md | 37 +++ .../SKILL.md | 78 +++++ .../SKILL.md | 270 ++++++++++++++++ .../container-app-ui-api-url-wiring/SKILL.md | 29 ++ 25 files changed, 2037 insertions(+), 24 deletions(-) create mode 100644 .squad/decisions/archive/bunk-non-ai-test-plan.md create mode 100644 .squad/decisions/archive/mcnulty-apim-management-architecture.md create mode 100644 .squad/decisions/archive/mcnulty-non-ai-api-limits-architecture.md create mode 100644 .squad/decisions/archive/sydnor-non-ai-apim-policy.md delete mode 100644 .squad/decisions/inbox/freamon-apim-config-binding-fix.md create mode 100644 .squad/files/apim-management-api-contract.md create mode 100644 .squad/files/non-ai-limits-test-plan.md create mode 100644 .squad/files/non-ai-paused/entra-jwt-rest-policy.md create mode 100644 .squad/files/non-ai-paused/entra-jwt-rest-policy.xml create mode 100644 .squad/log/2026-05-14T12-40-01Z-fix-reply-urls.md create mode 100644 .squad/log/2026-05-14_18-15-41UTC-infra-fixes-shipped.md create mode 100644 .squad/log/2026-05-21T17-43-57Z-apim-config-binding-fix.md create mode 100644 .squad/log/20260515T164518Z-fresh-deploy-postprovision-regression.md create mode 100644 .squad/log/20260515T165232Z-fresh-deploy-app-role-gap.md create mode 100644 .squad/orchestration-log/2026-05-14T12-40-01Z-sydnor.md create mode 100644 .squad/orchestration-log/2026-05-14_18-15-41UTC-coordinator-commit-push.md create mode 100644 .squad/orchestration-log/2026-05-21T17-43-57Z-freamon.md create mode 100644 .squad/orchestration-log/20260515T164518Z-sydnor-postprovision-tf-output-fix.md create mode 100644 .squad/orchestration-log/20260515T165232Z-sydnor-app-role-assignment-fix.md create mode 100644 .squad/skills/azd-postprovision-app-registration-reply-urls/SKILL.md create mode 100644 .squad/skills/azd-postprovision-app-role-assignment/SKILL.md create mode 100644 .squad/skills/container-app-ui-api-url-wiring/SKILL.md diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index 71b54ca3..a34ea236 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -34,4 +34,22 @@ For detailed work items, see: - For list/detail admin pages, the current pattern is Tailwind + local state: left tree/list in a `Card`, right details/actions in a second `Card`, dialogs for destructive/assignment flows, and inline fixed-position toast messaging for retryable network failures. - APIM status polling is UI-driven: after a 202 apply response, set optimistic `applying` state and poll `GET .../policy` every 2 seconds until status leaves `pending`/`applying`. - Template parameter defaults should prefer the current assignment, then template defaults, and only shared plan-level values; there is no contract yet to map a specific plan to an API assignment, so avoid guessing per-plan defaults. -- The SPA now maps top-level tabs to pathname routes in `App.tsx` (including `/apis`) without adding a router dependency; keep using this lightweight history API pattern unless the app adopts React Router later. \ No newline at end of file +- The SPA now maps top-level tabs to pathname routes in `App.tsx` (including `/apis`) without adding a router dependency; keep using this lightweight history API pattern unless the app adopts React Router later. + +## 2026-05-21 — ASP.NET Core Nested Configuration Convention (FYI) + +**Informational Context for Future Backend Config** + +Freamon fixed a config-binding bug in the APIM infrastructure: the env var `APIM_RESOURCE_ID` does not bind to nested config keys in ASP.NET Core. The standard convention is **double underscore**: `Apim__ResourceId`. + +**Pattern for Reference:** +- C# class `ApimManagementOptions` bound to section `"Apim"` +- Config key in code: `Apim:ResourceId` (colon) +- Environment variable: `Apim__ResourceId` (double underscore) + +**If Frontend Consumes Similar Config Later:** +- Backend will emit env vars using this convention (e.g., `Foo__Bar__Baz` for nested settings) +- When frontend reads backend config, expect the same pattern +- This is idiomatic ASP.NET Core, not a special case + +**Full decision merged into `.squad/decisions.md`.** \ No newline at end of file diff --git a/.squad/agents/sydnor/history.md b/.squad/agents/sydnor/history.md index d72b468b..441b719c 100644 --- a/.squad/agents/sydnor/history.md +++ b/.squad/agents/sydnor/history.md @@ -273,3 +273,23 @@ This session resolves AADSTS500113 (no reply address) and timeout/stale URL issu - Native APIM status codes differ by policy: `rate-limit-by-key` returns 429, while `quota-by-key` returns 403 when the quota is exhausted. **Coordination Note:** McNulty's current inbox proposal prefers `/api/precheck-rest` as the primary enforcement model. The draft policy still shipped with the requested native-APIM default and a switchable commented alternative so the coordinator can align the final direction later. + +### 2026-05-21 — ASP.NET Core Nested Configuration Convention: Double-Underscore Env Vars + +**Informational Context for APIM Terraform Wiring (Sydnor's Original Work)** + +Freamon discovered and fixed a config-binding bug in the APIM_RESOURCE_ID wiring: the Container App Terraform was emitting `APIM_RESOURCE_ID`, but ASP.NET Core's `EnvironmentVariablesConfigurationProvider` does not translate single underscores into nested config keys. The standard convention for nested keys in ASP.NET Core is **double underscore**: `Apim__ResourceId`. + +**The Pattern:** +- C# class: `ApimManagementOptions` bound to config section `"Apim"` +- Config key: `Apim:ResourceId` (colon in code, underscore in env) +- Environment variable: `Apim__ResourceId` (double underscore) +- **NOT** `APIM_RESOURCE_ID` (single underscore, all caps — this will not bind) + +**Key Lesson for Infrastructure Writers:** +- When you wire env vars in Terraform for a .NET app, always use the **double-underscore convention** for nested config keys: `Apim__ResourceId`, not `Apim_ResourceId` or `APIM_RESOURCE_ID` +- The pattern applies to all nested config: `Foo__Bar__Baz` for config section `Foo:Bar:Baz` +- Single underscores and all-caps patterns are common in shell scripts and Terraform, but ASP.NET Core expects double underscores specifically + +**Decision Merged into `.squad/decisions.md`:** All future APIM Terraform wiring must use `Apim__ResourceId` when populating the nested config key. Freamon audited 200+ env vars and found no other mismatches. + diff --git a/.squad/decisions.md b/.squad/decisions.md index bb029057..55c84251 100644 --- a/.squad/decisions.md +++ b/.squad/decisions.md @@ -12,6 +12,17 @@ ## Active Decisions +### 2026-05-21T17:43:57Z: APIM ResourceId env binding convention +**By:** Freamon (Backend Dev) +**Status:** Accepted +**What:** Use the standard ASP.NET Core environment-variable convention for nested configuration keys: `Apim__ResourceId` (double underscore) instead of `APIM_RESOURCE_ID`. `ApimManagementOptions` binds from the `Apim` configuration section, expecting the APIM resource ID at config key `Apim:ResourceId`. +**Why:** +- Matches the default `EnvironmentVariablesConfigurationProvider` behavior (no custom alias handling needed). +- Keeps the application code strict and idiomatic. +- Prevents silent runtime misbinding when infrastructure sets nested config values. +**Impact:** All future Terraform and deployment wiring for APIM management must use `Apim__ResourceId` when populating `Apim:ResourceId`. +**Audit Result:** Scanned all 200+ env vars in application configuration; no other single-underscore ASP.NET Core nested-config mismatches found. + ### 2026-04-17T15:52:16Z: User directive — Agent365 SDK integration **By:** Zack Way (via Copilot) **Status:** Accepted diff --git a/.squad/decisions/archive/bunk-non-ai-test-plan.md b/.squad/decisions/archive/bunk-non-ai-test-plan.md new file mode 100644 index 00000000..efce2a8f --- /dev/null +++ b/.squad/decisions/archive/bunk-non-ai-test-plan.md @@ -0,0 +1,23 @@ +# Decision Inbox: Non-AI API Limits Test Coverage Strategy + +**Date:** 2026-05-21 +**By:** Bunk (Tester / QA) +**Status:** Proposed + +## What + +Draft non-AI API usage-limit coverage in three layers once implementation lands: + +1. **Endpoint tests first** in `src/AIPolicyEngine.Tests/EndpointTests.cs` for plan CRUD validation, basic non-AI precheck allow/deny behavior, and AI-regression checks. +2. **Focused integration tests** in a new non-AI-specific integration class for counter isolation, cache/Cosmos round-trip, rollover behavior, and failure-mode assertions. +3. **NBomber load coverage** by extending `src/AIPolicyEngine.LoadTest/Program.cs` with a high-throughput non-AI precheck scenario across multiple customer keys. + +## Why + +This matches the repository's existing testing layout: endpoint behavior in `EndpointTests`, scenario-heavy integration logic under `Integration/`, and performance verification in the dedicated load-test project. It also keeps the draft adaptable while architecture is still being finalized, because endpoint names and body shapes can change without forcing a rewrite of the overall coverage strategy. + +## Open Dependencies + +- Final sign-off on the non-AI endpoint contract and schema fields. +- Clear semantics for `0` values, rejected-request counter behavior, and quota reset rules. +- Prefer a clock seam/time provider so rollover tests do not rely on real waits. diff --git a/.squad/decisions/archive/mcnulty-apim-management-architecture.md b/.squad/decisions/archive/mcnulty-apim-management-architecture.md new file mode 100644 index 00000000..8cb057de --- /dev/null +++ b/.squad/decisions/archive/mcnulty-apim-management-architecture.md @@ -0,0 +1,303 @@ +# APIM Policy Management Architecture + +**Author:** McNulty (Lead / Architect) +**Date:** 2026-05-16 +**Status:** Proposal — awaiting Zack approval +**Requested by:** Zack Way + +--- + +## TL;DR — Consequential Decisions + +1. **Tier B (template apply).** Users pick templates, fill params, engine generates + applies XML. No raw XML editor (too risky for v1). +2. **Non-AI work reshapes:** Sydnor's `entra-jwt-rest-policy.xml` ships as-is but becomes the **seed template**. Convert `{{NonAiRequestsPerMinute}}` etc. to template parameters. Precheck-rest endpoint stays (dashboard visibility > APIM-native counters), but limits are now *assignable from the APIM management UI*, not just the Plans page. +3. **SDK choice:** `Azure.ResourceManager.ApiManagement` via managed identity. No Terraform in the runtime loop. +4. **Storage:** New document type in existing `configuration` container. No new Cosmos container. + +--- + +## 1. Scope — Tier B (Template Apply) + +**Recommendation: Tier B.** Rationale: +- Tier A (read-only) doesn't solve the problem ("without editing XML files"). +- Tier C (raw XML editor) is a foot-gun — invalid XML instantly breaks APIs. Defer to M4+. +- Tier D (drift detection) is a nice-to-have layered on later, not a launch requirement. + +Tier B means: the engine ships a **template library**. The UI lets admins select a template, fill parameter values (RPM, quota, audiences, backend URL), and click Apply. The engine renders final XML and pushes it to APIM via SDK. Existing `.xml` files in `policies/` become seed templates. + +--- + +## 2. Identity & Permissions + +### Managed Identity → APIM + +The Container App's system-assigned managed identity needs a **custom role** scoped to the APIM instance: + +``` +Actions: + Microsoft.ApiManagement/service/apis/read + Microsoft.ApiManagement/service/apis/operations/read + Microsoft.ApiManagement/service/apis/policies/read + Microsoft.ApiManagement/service/apis/policies/write + Microsoft.ApiManagement/service/apis/policies/delete + Microsoft.ApiManagement/service/apis/operations/policies/read + Microsoft.ApiManagement/service/apis/operations/policies/write + Microsoft.ApiManagement/service/apis/operations/policies/delete +``` + +**Not** `API Management Service Contributor` (too broad — includes delete APIs, manage subscriptions, etc.). + +Terraform defines this custom role in `infra/terraform/modules/gateway/` and assigns it to the Container App identity. + +### End-User Authorization + +Reuse existing `AIPolicy.Admin` app role. Policy management is an admin-only action. No new role needed for v1. + +### Multi-Tenant + +The APIM instance is **single-tenant from the engine's perspective** — one engine manages one APIM. If the customer has multiple APIM instances, they deploy multiple engine instances. No cross-tenant API visibility concern. + +--- + +## 3. APIM SDK vs ARM REST vs Terraform + +**Decision: `Azure.ResourceManager.ApiManagement` NuGet package.** + +Rationale: +- Idiomatic .NET, `DefaultAzureCredential` works with managed identity out of the box. +- Strongly typed — `ApiManagementApiResource`, `PolicyContractData`, etc. +- Terraform is declarative/offline — fundamentally wrong for "user clicks Apply in a UI." +- Raw ARM REST means hand-rolling auth token management and error parsing. + +Package: `Azure.ResourceManager.ApiManagement` (GA, stable, current version ~1.3.0). No preview risk. + +Configuration: store APIM resource ID as app setting `APIM_RESOURCE_ID` (format: `/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ApiManagement/service/{name}`). + +--- + +## 4. Storage Model + +### Location + +Existing `configuration` container in Cosmos. New document type with partition key = `"policy-assignment"`. + +### Document Shape + +```json +{ + "id": "pa:{apiId}:{operationId|_all}", + "partitionKey": "policy-assignment", + "apiId": "azure-openai-jwt-based-api", + "operationId": null, + "apiDisplayName": "Azure OpenAI Service API", + "templateId": "entra-jwt-ai", + "templateVersion": "1.0", + "parameters": { + "ExpectedAudience": "api://abc123", + "ContainerAppUrl": "https://myapp.azurecontainerapps.io", + "ContainerAppAudience": "api://def456" + }, + "generatedXmlHash": "sha256:abcdef...", + "lastAppliedAt": "2026-05-16T10:00:00Z", + "appliedBy": "user@contoso.com", + "status": "synced", + "createdAt": "2026-05-16T09:55:00Z", + "updatedAt": "2026-05-16T10:00:00Z" +} +``` + +### Status Values + +- `synced` — XML in APIM matches what we generated. +- `pending` — assignment saved but not yet applied (apply failed or deferred). +- `drifted` — detected that APIM XML differs from our generated hash (future M4). + +### Drift Handling (deferred to M4) + +On next apply or manual "sync check," compare APIM's current policy XML hash against `generatedXmlHash`. If different → mark `drifted`. Apply always **overwrites** — the engine is authoritative once it owns an API's policy. + +--- + +## 5. Policy Template Library + +### Ship-in-the-box Templates + +| ID | Description | Source | +|---|---|---| +| `entra-jwt-ai` | JWT auth + precheck + routing + AI logging | `entra-jwt-policy.xml` | +| `entra-jwt-ai-dlp` | Above + DLP content-check | `entra-jwt-policy-dlp.xml` | +| `subscription-key-ai` | Sub-key auth + precheck + routing + AI logging | `subscription-key-policy.xml` | +| `subscription-key-ai-dlp` | Above + DLP | `subscription-key-policy-dlp.xml` | +| `entra-jwt-rest` | JWT auth + rate-limit + quota + REST logging | `entra-jwt-rest-policy.xml` | + +### Template Format + +**`{{placeholder}}` substitution.** Simple, proven (APIM named values already use this syntax). No DSL, no T4. + +Each template is an XML file with `{{ParamName}}` tokens plus a companion `template.json` manifest: + +```json +{ + "id": "entra-jwt-rest", + "displayName": "Entra JWT — Non-AI REST (Rate Limit + Quota)", + "version": "1.0", + "parameters": [ + { "name": "ExpectedAudience", "type": "string", "required": true, "description": "Token audience" }, + { "name": "NonAiRequestsPerMinute", "type": "int", "required": true, "default": 60 }, + { "name": "NonAiMonthlyRequestQuota", "type": "int", "required": true, "default": 10000 }, + { "name": "NonAiBackendUrl", "type": "string", "required": true }, + { "name": "ContainerAppUrl", "type": "string", "required": true }, + { "name": "ContainerAppAudience", "type": "string", "required": true } + ], + "scope": "api" +} +``` + +### Repo Location + +`policies/templates/` — each template gets a folder: +``` +policies/templates/ + entra-jwt-ai/ + policy.xml + template.json + entra-jwt-rest/ + policy.xml + template.json + ... +``` + +--- + +## 6. Intersection with Non-AI API Limits (CRITICAL) + +### What changes: + +| Aspect | Before (non-AI spec) | After (this architecture) | +|---|---|---| +| XML file | Static `policies/entra-jwt-rest-policy.xml` | Becomes seed for `entra-jwt-rest` template | +| Limit values | APIM named values set at deploy time | Template parameters, configurable per-API from UI | +| `precheck-rest` endpoint | Exists as alternative (commented) | Still exists — option B for customers who need dashboard-visible real-time counters | +| User config surface | Plans page (flat fields) | Plans page sets *plan defaults*; APIM management page applies *per-API* | + +### Directive for Sydnor: + +**Ship `entra-jwt-rest-policy.xml` as-is.** It's done and correct. Immediately after merge, it gets copied into `policies/templates/entra-jwt-rest/policy.xml` with `{{placeholder}}` tokens preserved (they're already there). No wasted work. + +### Cohesive model: + +- **Plans page** → defines plan-level default limits (NonAiRequestsPerMinute, NonAiMonthlyRequestQuota). These are the *defaults* used when a template is applied without overrides. +- **APIM Management page** → assigns templates to APIs/operations. Parameter values can override plan defaults or use them. +- **Non-AI limits flow:** Plan defines limits → admin assigns `entra-jwt-rest` template to a non-AI API → engine renders XML with plan limits as default param values → applies to APIM. + +### Precheck-rest stays: + +APIM-native `rate-limit-by-key` is the **default** in the template (simple, no extra hop). But the template XML keeps the commented precheck-rest block. If a customer needs dashboard-level real-time enforcement (not just post-hoc log accounting), they toggle a template parameter `EnforcementMode: native|engine` that uncomments the precheck-rest block and comments out native limits. This is a v2 enhancement — for v1, native APIM enforcement is the default. + +--- + +## 7. Backend API Surface + +All endpoints require `AIPolicy.Admin` role. + +``` +GET /api/apim/apis + → 200: { apis: [{ id, displayName, path, serviceUrl, isCurrent }] } + +GET /api/apim/apis/{apiId}/operations + → 200: { operations: [{ id, displayName, method, urlTemplate }] } + +GET /api/apim/apis/{apiId}/policy + → 200: { assignment: PolicyAssignment | null, currentXml: string } + +GET /api/apim/apis/{apiId}/operations/{operationId}/policy + → 200: { assignment: PolicyAssignment | null, currentXml: string } + +GET /api/apim/templates + → 200: { templates: [{ id, displayName, version, parameters, scope }] } + +POST /api/apim/apis/{apiId}/policy + Body: { templateId, parameters: { key: value } } + → 202: { assignmentId, status: "applying" } + (async — APIM apply can take 5-30s) + +POST /api/apim/apis/{apiId}/operations/{operationId}/policy + Body: { templateId, parameters: { key: value } } + → 202: { assignmentId, status: "applying" } + +DELETE /api/apim/apis/{apiId}/policy + → 200: { status: "cleared" } + Behavior: removes engine assignment, sets APIM policy to ... (passthrough) +``` + +**Apply is async (202).** APIM is slow. The UI polls `GET .../policy` and checks `assignment.status` for `synced` or `failed`. + +--- + +## 8. Frontend Shape + +**New top-level page: "APIs"** (between "Routing Policies" and "Export" in the nav). + +Layout: +- Left panel: tree view — list of APIs from APIM. Click an API to expand its operations. +- Right panel: selected item's policy details. + - Shows: current template assignment (if any), parameter values, last-applied timestamp, status badge (synced/pending/failed). + - Action: "Assign Template" button → opens a form: pick template from dropdown, fill parameter fields (with plan defaults pre-populated where applicable), click Apply. + - Action: "Clear Assignment" → reverts to passthrough. + +Fits the existing pattern: `Plans.tsx`, `RoutingPolicies.tsx`, `Pricing.tsx` are all list+detail pages with forms. + +--- + +## 9. Sequencing & Dependencies + +| Milestone | Scope | Agent | Depends On | +|---|---|---|---| +| **M1** | Read-only catalog: `GET /api/apim/apis`, `GET .../operations`, custom role + identity in Terraform | Sydnor (infra) + Freamon (endpoints) | Sydnor finishes non-AI XML | +| **M2** | Template library in repo, `GET /api/apim/templates`, template rendering service | Freamon | M1 | +| **M3** | Apply flow: `POST .../policy`, Cosmos storage, async apply via SDK | Freamon | M1, M2 | +| **M4** | UI "APIs" page — tree view, assign template, status display | Kima | M1, M2, M3 | +| **M5** | Operation-level granularity (operation-scoped apply) | Freamon + Kima | M3, M4 | +| **M6** | Drift detection (background poll, status: drifted) | Freamon | M3 | + +**Recommended next deliverable: M1–M4.** Gets the full loop working for API-level policies. Operation-level and drift are fast follow-ons. + +--- + +## 10. Risks & Open Questions + +| Risk | Mitigation | +|---|---| +| Bad XML breaks API immediately | Validate rendered XML against APIM schema before apply. On failure, store `status: failed` with error. Consider a "preview XML" step in UI. | +| APIM apply latency (5-30s) | Async 202 pattern. UI shows spinner + polls. | +| APIM revision targeting | Always target the `current` revision. If customer uses revisions, that's out of scope for v1. | +| Rollback story | Store previous `generatedXmlHash` + XML. "Revert" re-applies prior version. v1: manual revert via re-assigning prior template params. | +| Drift from portal edits | v1: no detection. v2 (M6): periodic hash comparison. | +| Cost of APIM SDK calls | List APIs is cheap (cached client-side 60s). Apply is infrequent. No polling cost until M6. | + +**Open question for Zack:** Do we need multi-APIM support (one engine managing N APIM instances), or is 1:1 sufficient? Recommendation: 1:1 for v1, config array for v2. + +--- + +## 11. Test Scope + +| Layer | Strategy | +|---|---| +| Template rendering | Unit tests: given template XML + params → assert rendered XML. No Azure dependency. Pure string substitution. | +| APIM SDK integration | **Recorded HTTP fixtures** via `Azure.Core.TestFramework` (same pattern as other Azure SDK tests). Record once against real APIM, replay in CI. | +| Apply flow (end-to-end) | Integration test with a mock `IApimPolicyService` interface. Verify Cosmos document created, status transitions. | +| UI | Component tests for the APIs page (React Testing Library). Mock API responses. | +| Destructive/live | Manual smoke test in Zack's test environment. Not automated in CI — too expensive and slow. | + +Bunk writes unit tests for template rendering and the apply orchestrator. Integration tests use recorded fixtures — no live APIM in CI. + +--- + +## Appendix: APIM Resource ID Configuration + +Add to Container App environment: +``` +APIM_RESOURCE_ID=/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ApiManagement/service/{name} +``` + +Terraform outputs this from the gateway module. Postprovision wires it into the Container App config. diff --git a/.squad/decisions/archive/mcnulty-non-ai-api-limits-architecture.md b/.squad/decisions/archive/mcnulty-non-ai-api-limits-architecture.md new file mode 100644 index 00000000..d0fc4800 --- /dev/null +++ b/.squad/decisions/archive/mcnulty-non-ai-api-limits-architecture.md @@ -0,0 +1,226 @@ +# Architecture: Non-AI API Usage Limits + +**Date:** 2026-05-16 +**Author:** McNulty (Lead / Architect) +**Status:** Proposal — awaiting Zack's approval +**Requested by:** Zack Way + +--- + +## Context + +The policy engine currently enforces AI-centric limits: token quotas (monthly), token rate limits (per minute), and request rate limits (per minute) — all scoped to OpenAI/Foundry deployments. Zack wants to extend it so a Plan can also enforce limits on **non-AI REST APIs** fronted by APIM. Two limits: + +1. **Requests per minute** — short-window throttle +2. **Requests per month** — monthly cap (like `MonthlyTokenQuota` but counting raw HTTP requests) + +A single plan can cover BOTH AI and non-AI APIs simultaneously. + +--- + +## 1. Schema + +### Decision: Flat fields on `PlanData` (NOT a sub-object) + +**Rationale:** The existing schema is flat. `TokensPerMinuteLimit`, `RequestsPerMinuteLimit`, `MonthlyTokenQuota` — all sit at root level. Introducing a nested `NonAiLimits` object would be inconsistent and would force a breaking JSON shape change for the frontend. Flat fields are also simpler for CosmosDB partial updates and Redis hash storage. + +### New fields on `PlanData`: + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `NonAiRequestsPerMinute` | `int` | `0` | Max non-AI requests per minute per customer. 0 = unlimited. | +| `NonAiMonthlyRequestQuota` | `long` | `0` | Max non-AI requests per billing period. 0 = unlimited. | + +### New fields on `PlanCreateRequest`: + +| Field | Type | Nullable | Default | +|-------|------|----------|---------| +| `NonAiRequestsPerMinute` | `int?` | Yes | null (maps to 0) | +| `NonAiMonthlyRequestQuota` | `long?` | Yes | null (maps to 0) | + +### New fields on `PlanUpdateRequest`: + +| Field | Type | Nullable | +|-------|------|----------| +| `NonAiRequestsPerMinute` | `int?` | Yes (null = no change) | +| `NonAiMonthlyRequestQuota` | `long?` | Yes (null = no change) | + +### New fields on `ClientPlanAssignment` (usage tracking): + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `NonAiCurrentPeriodRequests` | `long` | `0` | Non-AI requests consumed this billing period. | + +### `PlansResponse` — No change + +It already returns `List`, so new fields flow through automatically. + +### Interaction with existing AI fields + +Both sets of limits are **independent**. A plan with `MonthlyTokenQuota = 1M` and `NonAiMonthlyRequestQuota = 10000` enforces both simultaneously. The precheck endpoints are separate (see Enforcement below), so there's no collision. The billing period is shared — `CurrentPeriodStart` governs rollover for both AI and non-AI counters. + +--- + +## 2. Enforcement Model + +### Decision: Option A — New `/api/precheck-rest/{clientAppId}/{tenantId}` endpoint + +**Rejected alternatives:** + +- **Option B (extend existing precheck with discriminator):** Pollutes the AI hot path. The existing precheck does deployment extraction, routing evaluation, TPM counters — none of which apply to non-AI. Mixing concerns increases latency and complexity. +- **Option C (pure APIM-side enforcement):** Loses dashboard visibility. The whole point of this engine is centralized observability + chargeback. If APIM does enforcement alone, the dashboard can't show non-AI usage trends, and there's no audit trail. + +**Why Option A:** + +1. Clean separation — non-AI precheck does exactly two things: check rate limit, check monthly quota. +2. Dashboard visibility — the engine sees every non-AI request (incrementing counters in Redis), enabling usage dashboards. +3. Consistent APIM contract — same pattern as AI precheck (APIM calls backend, gets 200/429, acts accordingly). Sydnor's policy is a near-copy. +4. The endpoint is fast — two Redis operations (INCR + GET), no routing/token logic. + +**Endpoint contract:** + +``` +GET /api/precheck-rest/{clientAppId}/{tenantId} +Authorization: Bearer {managed-identity-token} + +Response 200: +{ + "status": "authorized", + "clientAppId": "...", + "tenantId": "...", + "plan": "...", + "currentRpm": 5, + "rpmLimit": 100, + "currentMonthlyRequests": 4500, + "monthlyRequestLimit": 10000 +} + +Response 429: +{ + "error": "Rate limit exceeded — non-AI requests per minute" | "Non-AI monthly request quota exceeded", + "limit": ..., + "current": ... +} + +Response 401: Client not authorized (no plan assigned) +``` + +--- + +## 3. Counter Storage + +### Decision: Redis counters with billing-period tracking in Cosmos (existing pattern) + +| Counter | Storage | Mechanism | +|---------|---------|-----------| +| Requests/minute | Redis | `INCR` on key `ratelimit:nonai-rpm:{clientAppId}:{tenantId}:{minuteWindow}` with 120s TTL (same pattern as AI RPM) | +| Requests/month | Redis (hot) + Cosmos (durable) | Increment `NonAiCurrentPeriodRequests` on `ClientPlanAssignment`. Redis cache updated immediately; Cosmos synced on log ingest (same as `CurrentPeriodUsage` pattern). | + +**Why not APIM built-in `rate-limit-by-key`?** + +- No dashboard visibility +- No billing-period-aware monthly counter +- Can't share state across multiple APIM instances in different regions (Redis is centralized) + +**Billing period rollover:** Same logic as AI — `BillingPeriodCalculator.GetCurrentPeriodStartUtc()` detects new period; precheck treats `NonAiCurrentPeriodRequests` as 0 when period has rolled. + +### New Redis key: + +```csharp +// In RedisKeys.cs: +public static string RateLimitNonAiRpm(string clientAppId, string tenantId, long minuteWindow) + => $"ratelimit:nonai-rpm:{clientAppId}:{tenantId}:{minuteWindow}"; +``` + +--- + +## 4. APIM Policy Contract + +Sydnor will create `policies/entra-jwt-rest-policy.xml` (and optionally a subscription-key variant). The policy shape: + +1. **``** — Same Entra ID validation as AI policy (multi-tenant, audience check) +2. **Extract claims** — `tenantId`, `clientAppId` (same `` blocks) +3. **``** — Get MSI token for Container App +4. **`` to `/api/precheck-rest/{clientAppId}/{tenantId}`** — Same pattern as AI precheck call +5. **`` on response** — 401/403/429 → return to caller. 200 → proceed. +6. **``** — Route to the actual non-AI backend (configurable via named value `{{NonAiBackendId}}`) +7. **Outbound: `` to `/api/log-rest`** — Fire-and-forget usage log (records the request for dashboard/audit). Lightweight payload: `{ clientAppId, tenantId, timestamp, apiPath, statusCode }`. + +**Key difference from AI policy:** No `deploymentId` extraction, no routing evaluation, no response token parsing. Much simpler. + +**New log endpoint (Freamon):** `POST /api/log-rest` — Accepts non-AI request metadata, increments `NonAiCurrentPeriodRequests` on the client assignment. Same fire-and-forget pattern as AI `/api/log`. + +--- + +## 5. Frontend Contract + +Kima adds to the Plan create/edit form: + +### New form section: "Non-AI API Limits" + +| Field | Input type | Label | Validation | +|-------|-----------|-------|------------| +| `NonAiRequestsPerMinute` | Number input | "Requests per Minute (Non-AI)" | Integer ≥ 0. 0 = unlimited. | +| `NonAiMonthlyRequestQuota` | Number input | "Monthly Request Quota (Non-AI)" | Integer ≥ 0. 0 = unlimited. | + +**Placement:** After the existing "Rate Limits" section, before "Routing". Grouped under a visual section header "Non-AI API Limits". + +**No toggle needed.** If both fields are 0, non-AI limits are effectively disabled. This matches the pattern for `TokensPerMinuteLimit` (0 = unlimited). A toggle adds unnecessary state. + +**Client detail view:** Show `NonAiCurrentPeriodRequests` usage alongside existing token usage in the customer dashboard. + +--- + +## 6. Backward Compatibility + +### Decision: Zero-value defaults, no schema versioning needed + +- **Existing plans in Cosmos** don't have these fields. When deserialized, .NET assigns default values (`0` for int/long). This means: existing plans have non-AI limits disabled (0 = unlimited = no enforcement). This is correct — no existing plan suddenly gets rate-limited. +- **`NonAiCurrentPeriodRequests`** on `ClientPlanAssignment` defaults to `0`. Existing documents missing this field deserialize cleanly. +- **No migration script needed.** CosmosDB is schema-less. New fields appear when plans are updated via the API. +- **API contract is additive only** — new optional fields on create/update requests. Existing clients that don't send these fields get default behavior (no non-AI limits). + +--- + +## 7. Test Scope + +Bunk must cover: + +### New code paths (MUST): + +1. **Precheck-rest endpoint — rate limit enforced:** Send N+1 requests in same minute window → Nth returns 200, N+1th returns 429 with correct error. +2. **Precheck-rest endpoint — monthly quota enforced:** Client at quota → returns 429. Client below quota → returns 200. +3. **Precheck-rest endpoint — billing period rollover:** Client at quota, but period has rolled → returns 200 (counter resets). +4. **Precheck-rest endpoint — unauthorized client:** No plan assignment → 401. +5. **Precheck-rest endpoint — 0 = unlimited:** Plan with `NonAiRequestsPerMinute = 0` → never rate-limited. Same for monthly. +6. **Log-rest endpoint:** Increments `NonAiCurrentPeriodRequests` correctly. +7. **Plan CRUD:** Create plan with non-AI fields → read back → fields present. Update only non-AI fields → other fields unchanged. + +### Regression (MUST): + +8. **AI precheck unaffected:** Existing AI precheck tests still pass — no changes to that endpoint. +9. **Plan serialization roundtrip:** Plans with both AI and non-AI fields serialize/deserialize correctly. +10. **Billing period calculator:** Shared logic still works for both AI and non-AI counters. + +### Integration (SHOULD, if time allows): + +11. **Redis key isolation:** Non-AI RPM keys don't collide with AI RPM keys. +12. **Concurrent requests:** Multiple simultaneous precheck-rest calls correctly increment without race conditions (Redis INCR is atomic). + +--- + +## Summary of Assignments + +| Agent | Scope | +|-------|-------| +| **Freamon** | Add fields to `PlanData`, `PlanCreateRequest`, `PlanUpdateRequest`, `ClientPlanAssignment`. Implement `/api/precheck-rest/{clientAppId}/{tenantId}` endpoint. Implement `/api/log-rest` endpoint. Add `RedisKeys.RateLimitNonAiRpm`. Wire DI. | +| **Kima** | Add "Non-AI API Limits" section to Plan form (create + edit). Show `NonAiCurrentPeriodRequests` in client detail. | +| **Sydnor** | Create `policies/entra-jwt-rest-policy.xml` following the contract above. Optionally `policies/subscription-key-rest-policy.xml`. | +| **Bunk** | Write tests covering all 12 scenarios above. | + +--- + +## Open Questions (None blocking — these are future scope) + +- **Per-API granularity:** Should different non-AI APIs within the same plan have different limits? (Answer: Not now. Start with plan-level limits. If needed later, add an `ApiLimits` dictionary keyed by API identifier.) +- **Overbilling for non-AI:** Should there be an `AllowNonAiOverbilling` + `NonAiOverageRate`? (Answer: Defer. Start with hard cap. Overbilling is a Phase 2 feature if customers request it.) diff --git a/.squad/decisions/archive/sydnor-non-ai-apim-policy.md b/.squad/decisions/archive/sydnor-non-ai-apim-policy.md new file mode 100644 index 00000000..1a30fe03 --- /dev/null +++ b/.squad/decisions/archive/sydnor-non-ai-apim-policy.md @@ -0,0 +1,108 @@ +# Decision: Non-AI APIM Policy Contract + +**Date:** 2026-05-21 +**Author:** Sydnor (Infra/DevOps) +**Status:** Draft — sample policy created, coordinator alignment still needed +**Requested by:** Zack Way + +--- + +## Summary + +A draft APIM policy now exists for non-AI REST APIs at `policies/entra-jwt-rest-policy.xml`. + +### Default sample contract + +- Entra JWT validation mirrors the existing `entra-jwt-policy.xml` pattern. +- Customer identity is derived as `customerKey = {clientAppId}:{tenantId}`. +- Short-window enforcement uses native APIM `rate-limit-by-key`. +- Monthly enforcement uses native APIM `quota-by-key` with a 30-day fixed window. +- The backend is routed with `https://{{NonAiBackendUrl}}`. +- Outbound accounting is always fire-and-forget to `POST {{ContainerAppUrl}}/api/log-rest`. + +### Commented alternative already included in the XML + +- `GET {{ContainerAppUrl}}/api/precheck-rest/{clientAppId}?tenantId={tenantId}` +- APIM managed identity token in `Authorization: Bearer {token}` +- APIM returns the policy-engine body on 429 + +This means the policy file is switchable without redesigning the rest of the contract. + +--- + +## Required APIM named values + +| Named value | Purpose | +|---|---| +| `EntraTenantId` | Shared Entra-policy-family named value retained for consistency/future hardening | +| `ExpectedAudience` | Required `aud` claim for caller JWTs | +| `NonAiRequestsPerMinute` | Native APIM short-window call limit | +| `NonAiMonthlyRequestQuota` | Native APIM 30-day quota limit | +| `NonAiBackendUrl` | Protected REST backend host/FQDN | +| `ContainerAppUrl` | Policy-engine base URL | +| `ContainerAppAudience` | Managed identity audience/resource for policy-engine calls | + +--- + +## Stable backend target + +### `/api/log-rest` + +The sample policy always emits a fire-and-forget POST to `/api/log-rest` with this payload shape: + +```json +{ + "tenantId": "...", + "clientAppId": "...", + "customerKey": "{clientAppId}:{tenantId}", + "requestPath": "/some/path", + "statusCode": 200, + "latencyMs": 37, + "correlationId": "..." +} +``` + +Backend implication: Freamon can implement `/api/log-rest` against this payload now, regardless of whether the team keeps native APIM enforcement or switches to precheck-rest. + +### Optional `/api/precheck-rest` + +If the coordinator chooses centralized enforcement later, the XML already expects: + +- Method: `GET` +- Route: `/api/precheck-rest/{clientAppId}` +- Query: `tenantId={tenantId}` +- Auth: APIM managed identity bearer token for `{{ContainerAppAudience}}` +- Response: `200` to allow, `429` to block + +The commented block intentionally keeps the contract minimal. + +--- + +## APIM-specific constraints discovered during drafting + +1. **`quota-by-key` uses a fixed window, not a calendar month.** + - The sample uses `renewal-period="2592000"` (30 days). + - Exact billing-period or calendar-month alignment requires custom counters outside native APIM quota. + +2. **`quota-by-key` cannot take runtime expressions for `calls` or `renewal-period`.** + - Deployment automation can render/import the policy with concrete values. + - A request-time `send-request` config lookup cannot directly feed `quota-by-key`. + +3. **Native APIM counters live in APIM, not Redis.** + - Dashboards built from `/api/log-rest` are derived/aggregated metrics, not authoritative real-time counter state. + +4. **Native APIM response codes differ by policy.** + - `rate-limit-by-key` blocks with `429`. + - `quota-by-key` blocks with `403`. + +--- + +## Coordination note + +McNulty's current inbox architecture proposal prefers `/api/precheck-rest` as the **primary** model and rejects pure APIM-side enforcement. This draft intentionally ships the requested hybrid/native-default XML anyway, but the coordinator should reconcile: + +- **Primary in sample:** native APIM `rate-limit-by-key` + `quota-by-key` +- **Primary in McNulty proposal:** policy-engine `/api/precheck-rest` +- **Backend placeholder difference:** this sample uses `NonAiBackendUrl`; McNulty's draft mentions `NonAiBackendId` + +The XML already includes the alternative block so the team can pivot with minimal policy churn once the final architecture decision lands. diff --git a/.squad/decisions/inbox/freamon-apim-config-binding-fix.md b/.squad/decisions/inbox/freamon-apim-config-binding-fix.md deleted file mode 100644 index 654fcc9f..00000000 --- a/.squad/decisions/inbox/freamon-apim-config-binding-fix.md +++ /dev/null @@ -1,23 +0,0 @@ -# 2026-05-21 — APIM ResourceId env binding convention - -**Owner:** Freamon -**Status:** proposed -**Requested by:** Zack Way - -## Context - -`ApimManagementOptions` binds from the `Apim` configuration section, so the API expects the APIM resource ID at config key `Apim:ResourceId`. The Container App Terraform wiring used `APIM_RESOURCE_ID`, which ASP.NET Core does not translate into a nested config key. - -## Decision - -Use the standard ASP.NET Core environment-variable convention for nested keys: `Apim__ResourceId` (double underscore). Keep the C# options binding unchanged and make Terraform emit the conventional key. - -## Why - -- Matches the default `EnvironmentVariablesConfigurationProvider` behavior. -- Keeps the application code strict and idiomatic instead of adding one-off alias handling. -- Prevents silent runtime misbinding when infrastructure sets nested config values. - -## Impact - -Future Terraform and deployment wiring for APIM management should use `Apim__ResourceId` whenever populating `Apim:ResourceId`. diff --git a/.squad/files/apim-management-api-contract.md b/.squad/files/apim-management-api-contract.md new file mode 100644 index 00000000..fcfbbfd4 --- /dev/null +++ b/.squad/files/apim-management-api-contract.md @@ -0,0 +1,139 @@ +# APIM Management API Contract + +Auth: every endpoint requires the `AIPolicy.Admin` app role via existing `AdminPolicy` authorization. + +## GET /api/apim/apis +- 200 OK +```json +{ + "apis": [ + { + "id": "azure-openai-jwt-based-api", + "displayName": "Azure OpenAI Service API", + "path": "openai", + "serviceUrl": "https://contoso.openai.azure.com", + "isCurrent": true + } + ] +} +``` + +## GET /api/apim/apis/{apiId}/operations +- 200 OK +```json +{ + "operations": [ + { + "id": "chat-completions", + "displayName": "Chat Completions", + "method": "POST", + "urlTemplate": "/deployments/{deploymentId}/chat/completions" + } + ] +} +``` + +## GET /api/apim/apis/{apiId}/policy +## GET /api/apim/apis/{apiId}/operations/{operationId}/policy +- 200 OK +```json +{ + "assignment": { + "id": "pa:azure-openai-jwt-based-api:_all", + "apiId": "azure-openai-jwt-based-api", + "operationId": null, + "apiDisplayName": "Azure OpenAI Service API", + "templateId": "entra-jwt-ai", + "templateVersion": "1.0", + "parameters": { + "ExpectedAudience": "api://abc123", + "ContainerAppUrl": "https://engine.contoso.com", + "ContainerAppAudience": "api://def456" + }, + "generatedXmlHash": "sha256:abcdef0123456789", + "lastAppliedAt": "2026-05-16T10:00:00Z", + "appliedBy": "user@contoso.com", + "status": "synced", + "errorMessage": null, + "createdAt": "2026-05-16T09:55:00Z", + "updatedAt": "2026-05-16T10:00:00Z" + }, + "currentXml": "..." +} +``` +- `assignment` may be `null` when the engine has no saved assignment for the API/operation. +- `currentXml` is the raw APIM policy XML at the requested scope; it is an empty string when no explicit scope policy exists. +- `status` values: `pending`, `applying`, `synced`, `failed`. +- `errorMessage` is populated when the last async apply failed. + +## GET /api/apim/templates +- 200 OK +```json +{ + "templates": [ + { + "id": "entra-jwt-ai", + "displayName": "Entra JWT — AI", + "version": "1.0", + "scope": "api", + "parameters": [ + { + "name": "ExpectedAudience", + "type": "string", + "required": true, + "description": "Expected aud claim in incoming tokens", + "default": null + }, + { + "name": "ContainerAppUrl", + "type": "string", + "required": true, + "description": "AI Policy Engine base URL", + "default": null + } + ] + } + ] +} +``` + +## POST /api/apim/apis/{apiId}/policy +## POST /api/apim/apis/{apiId}/operations/{operationId}/policy +Request body: +```json +{ + "templateId": "entra-jwt-ai", + "parameters": { + "ExpectedAudience": "api://abc123", + "ContainerAppUrl": "https://engine.contoso.com", + "ContainerAppAudience": "api://def456" + } +} +``` +Responses: +- 202 Accepted +```json +{ + "assignmentId": "pa:azure-openai-jwt-based-api:_all", + "status": "applying" +} +``` +- 400 Bad Request for missing template/parameters or render validation failures. +- 404 Not Found when API or operation does not exist in APIM. +- 500 Internal Server Error for unexpected apply orchestration failures. + +## DELETE /api/apim/apis/{apiId}/policy +## DELETE /api/apim/apis/{apiId}/operations/{operationId}/policy +- 200 OK +```json +{ + "status": "cleared" +} +``` +- Behavior: deletes the saved engine assignment and writes a passthrough APIM policy at the requested scope: +```xml + +``` +- API-scope DELETE returns 404 when the APIM API does not exist. +- Operation-scope DELETE returns 404 when the APIM API or operation does not exist. +- 500 Internal Server Error for unexpected clear failures. diff --git a/.squad/files/non-ai-limits-test-plan.md b/.squad/files/non-ai-limits-test-plan.md new file mode 100644 index 00000000..954fc721 --- /dev/null +++ b/.squad/files/non-ai-limits-test-plan.md @@ -0,0 +1,194 @@ +# Non-AI API Limits Test Plan + +> Drafted from requirements plus current proposal in `.squad/decisions/inbox/mcnulty-non-ai-api-limits-architecture.md`. This is a scenario list only, not test code. Adjust endpoint names, payload shapes, and observability assertions once Freamon's implementation lands. + +## Test Cases + +### Plan CRUD — schema validation + +### 1. `CreatePlan_WithAiAndNonAiLimits_RoundTripsBothLimits` +- **Setup:** Authenticated admin request; payload includes existing AI fields plus proposed non-AI fields (`NonAiRequestsPerMinute`, `NonAiMonthlyRequestQuota`). +- **Action:** `POST /api/plans`, then `GET /api/plans` or read created plan by returned id. +- **Expected result:** `201 Created`; response includes both AI and non-AI values unchanged; subsequent read returns the same values; no existing AI fields are dropped. +- **Layer:** integration + +### 2. `CreatePlan_WithOnlyNonAiLimits_Succeeds` +- **Setup:** Authenticated admin request; payload sets non-AI limits only and leaves AI-specific limits at existing defaults/optional values. +- **Action:** `POST /api/plans`. +- **Expected result:** `201 Created`; non-AI fields persist; AI fields deserialize to current defaults and remain valid for existing AI flows. +- **Layer:** integration + +### 3. `CreatePlan_WithOnlyAiLimits_DefaultsNonAiFields` +- **Setup:** Authenticated admin request; payload contains only today's AI fields. +- **Action:** `POST /api/plans` and read the created plan back. +- **Expected result:** `201 Created`; non-AI fields default sensibly. Current McNulty proposal is `0 = unlimited`; confirm that before coding the assertion. +- **Layer:** integration + +### 4. `UpdatePlan_AddNonAiLimits_PreservesExistingAiFields` +- **Setup:** Existing plan already has AI quota/rate-limit fields populated; non-AI fields absent/defaulted. +- **Action:** `PUT /api/plans/{planId}` with only non-AI fields populated. +- **Expected result:** `200 OK`; new non-AI values are stored; prior AI values are unchanged; response round-trips the full merged plan. +- **Layer:** integration + +### 5. `CreateOrUpdatePlan_NegativeNonAiLimits_Returns400` +- **Setup:** Payload sets `NonAiRequestsPerMinute < 0` and/or `NonAiMonthlyRequestQuota < 0`. +- **Action:** `POST /api/plans` and `PUT /api/plans/{planId}`. +- **Expected result:** `400 BadRequest`; validation error identifies invalid non-AI field(s); no plan is created/updated. +- **Layer:** integration + +### 6. `CreateOrUpdatePlan_ZeroNonAiLimits_UsesApprovedZeroSemantics` +- **Setup:** Payload sets one or both non-AI fields to `0`. +- **Action:** Create/update a plan, then exercise the non-AI enforcement path. +- **Expected result:** Architecture question. McNulty's current proposal says `0 = unlimited`; if that is approved, create/update succeeds and enforcement never blocks on that field. If the team instead chooses `0 = blocked`, tests must assert immediate denial. +- **Layer:** integration + +### Rate limit enforcement (per minute) + +### 7. `NonAiPrecheck_FirstRequestUnderPerMinuteLimit_Returns200` +- **Setup:** Client assigned to plan with `NonAiRequestsPerMinute = N` where `N > 1`; non-AI monthly usage below quota; per-minute counter empty. +- **Action:** Call the proposed non-AI precheck endpoint (currently expected to be `GET /api/precheck-rest/{clientAppId}/{tenantId}`) once. +- **Expected result:** `200 OK`; response shows authorized status and current RPM = 1 (or equivalent observable counter); no monthly quota violation. +- **Layer:** integration + +### 8. `NonAiPrecheck_RequestAtPerMinuteBoundary_Returns200` +- **Setup:** Same plan; existing non-AI RPM state already at `N - 1` within the active minute window. +- **Action:** Send the Nth non-AI request in the same minute. +- **Expected result:** `200 OK`; request at the exact boundary is still allowed; counter reflects N. +- **Layer:** integration + +### 9. `NonAiPrecheck_RequestBeyondPerMinuteLimit_Returns429` +- **Setup:** Existing non-AI RPM state already at `N` within the active minute window. +- **Action:** Send the (N+1)th non-AI request in the same minute. +- **Expected result:** `429 TooManyRequests`; response/body clearly indicates non-AI per-minute rate limiting; blocked request does not create usable capacity loss beyond the approved ceiling. Exact counter semantics (saturate at N vs increment then reject) must be confirmed because current AI precheck increments before rejecting. +- **Layer:** integration + +### 10. `NonAiPrecheck_AfterMinuteWindowExpires_AllowsNextRequest` +- **Setup:** Client previously hit the non-AI per-minute limit in minute window T. +- **Action:** Advance time to T+60s (prefer fake clock/time provider, not real sleeps) and send the next non-AI request. +- **Expected result:** `200 OK`; short-window counter resets for the new minute; response starts the new minute at 1. +- **Layer:** integration + +### 11. `NonAiPrecheck_DifferentCustomerKeys_AreRateLimitedIndependently` +- **Setup:** Two customers with different `(clientAppId, tenantId)` pairs share the same plan and same non-AI per-minute limit; customer A is already over limit, customer B is below limit. +- **Action:** Send one request as A and one as B. +- **Expected result:** A gets `429`; B gets `200`; counters/keys remain isolated by the customer key. +- **Layer:** integration + +### 12. `NonAiPrecheck_UnsetPerMinuteLimit_DoesNotThrottle` +- **Setup:** Plan leaves `NonAiRequestsPerMinute` null/unset on create or update (or stored as approved default). +- **Action:** Send multiple non-AI requests within one minute. +- **Expected result:** No rate-limit block occurs; monthly quota remains the only non-AI gate. Current proposal maps null to `0 = unlimited`; confirm that before coding. +- **Layer:** integration + +### 13. `AiAndNonAiRequests_UseIndependentRateCounters` +- **Setup:** Plan has both AI RPM limits and non-AI RPM limits enabled; customer has active AI traffic and non-AI traffic. +- **Action:** Send AI precheck/log traffic and non-AI precheck/log traffic in alternating order. +- **Expected result:** AI requests only affect AI counters; non-AI requests only affect non-AI counters; exhausting one bucket does not decrement or block the other. +- **Layer:** integration + +### Monthly quota enforcement + +### 14. `NonAiMonthlyQuota_RequestBelowLimit_Returns200AndIncrementsCounter` +- **Setup:** Plan has `NonAiMonthlyRequestQuota = M`; customer's non-AI monthly count is below M. +- **Action:** Send one allowed non-AI request through precheck + logging path. +- **Expected result:** `200 OK`; durable monthly counter increments by 1; usage surfaced in response/log/dashboard state if exposed. +- **Layer:** integration + +### 15. `NonAiMonthlyQuota_RequestAtBoundary_Returns200` +- **Setup:** Customer monthly counter is `M - 1`. +- **Action:** Send one non-AI request. +- **Expected result:** `200 OK`; counter becomes exactly M; next over-limit request is the first rejection. +- **Layer:** integration + +### 16. `NonAiMonthlyQuota_RequestBeyondLimit_ReturnsConfiguredRejectionStatus` +- **Setup:** Customer monthly counter is already M. +- **Action:** Send one more non-AI request. +- **Expected result:** Prefer `429 TooManyRequests` for consistency with current AI precheck and McNulty's proposal; flag `403 Forbidden` as an alternative only if the team wants quota exhaustion to mean hard authorization denial. +- **Layer:** integration + +### 17. `NonAiMonthlyQuota_CounterPersistsAcrossMinuteWindows` +- **Setup:** Customer has consumed some non-AI monthly requests and the short-window RPM key has expired. +- **Action:** Advance only the minute window and send another non-AI request. +- **Expected result:** Per-minute counter resets, but monthly counter continues accumulating from prior value rather than resetting. +- **Layer:** integration + +### 18. `NonAiMonthlyQuota_SameClientDifferentTenant_AreTrackedIndependently` +- **Setup:** Same `clientAppId`, two different `tenantId` values, shared plan, different monthly usage states. +- **Action:** Send one request for each tenant. +- **Expected result:** Each tenant uses its own monthly counter; exhausting tenant A does not block tenant B. +- **Layer:** integration + +### 19. `NonAiMonthlyQuota_PeriodRollover_UsesApprovedResetModel` +- **Setup:** Customer is at or over non-AI monthly quota near the end of a billing period. +- **Action:** Advance time past the selected reset boundary and send the next non-AI request. +- **Expected result:** Architecture question. McNulty's proposal uses the existing billing-period calculator (`BillingCycleStartDay` / calendar-style billing period), not APIM's rolling quota window. Confirm reset model before coding. +- **Layer:** integration + +### Integration with existing AI flows (regression) + +### 20. `Plan_WithNonAiLimits_DoesNotChangeAiQuotaOrRateLimitBehavior` +- **Setup:** Plan has both existing AI limits and new non-AI limits enabled; baseline AI behavior is already covered by current precheck tests. +- **Action:** Re-run existing AI precheck scenarios on the mixed plan. +- **Expected result:** Existing AI status codes and response fields remain unchanged; no regression in token quota, TPM, RPM, routing, or allowed deployment enforcement. +- **Layer:** integration + +### 21. `NonAiRequests_SucceedWhenAiQuotaIsExhaustedButNonAiQuotaRemains` +- **Setup:** Same customer is over AI token/request limits but still under non-AI monthly and per-minute limits. +- **Action:** Call AI precheck, then call non-AI precheck. +- **Expected result:** AI request is rejected per existing rules; non-AI request still succeeds because counters are independent. +- **Layer:** integration + +### 22. `PlanRepository_NonAiFields_RoundTripThroughCosmos` +- **Setup:** Persist a plan containing non-AI fields through the Cosmos-backed repository/cache path. +- **Action:** Upsert plan, clear Redis cache, then read it back. +- **Expected result:** New non-AI fields survive Cosmos serialization/deserialization and repopulate Redis correctly alongside existing fields. +- **Layer:** integration + +### 23. `UpdatePlan_NonAiLimits_RefreshesCachedPlan` +- **Setup:** Seed cached plan in Redis, then update its non-AI fields through plan CRUD. +- **Action:** Call `PUT /api/plans/{planId}`, then fetch the plan again through the repository-backed endpoint. +- **Expected result:** Returned plan reflects updated non-AI values immediately; stale cached values are not served after update. +- **Layer:** integration + +### Edge cases & failure modes + +### 24. `NonAiLogEndpoint_FireAndForgetFailure_DoesNotBlockPrimaryApiResponse` +- **Setup:** APIM/non-AI pipeline uses fire-and-forget outbound logging (current McNulty proposal: `send-one-way-request` to `POST /api/log-rest`); backend log endpoint is unavailable or times out. +- **Action:** Send an otherwise authorized non-AI API request through the APIM flow. +- **Expected result:** Primary API response is still returned to the caller; failure is logged/metriced for operators; no outage is caused by the logging sidecar path. +- **Layer:** integration + +### 25. `NonAiPrecheck_PlanLookupFailure_UsesApprovedFallbackStrategy` +- **Setup:** Simulate Cosmos read failure when resolving the assigned plan, with and without a warm Redis cache entry. +- **Action:** Call the non-AI precheck endpoint. +- **Expected result:** Open architecture question. Preferred test split is: cached plan available -> precheck still succeeds using cache; no cached plan -> explicit fail-closed response (`500` or `503`) with no counter mutation. Final status code needs sign-off. +- **Layer:** integration + +### 26. `NonAiPrecheck_ConcurrentRequestsAtBoundary_DoNotDoubleSpendCapacity` +- **Setup:** Plan has a small non-AI RPM limit; counter sits at `N - 1`; fire multiple parallel non-AI precheck requests for the same customer. +- **Action:** Send concurrent requests that compete for the final slot. +- **Expected result:** At most one request consumes the final allowed slot; remaining requests are rejected; no race allows more than N successful requests in the minute. Also verify monthly counters do not over-increment for rejected requests. +- **Layer:** load + +### Load test scenarios + +### 27. `NonAiPrecheck_Load_1000RpsAcrossTenKeys_EnforcesLimitsWithoutCounterDrift` +- **Setup:** 10 customer keys, each on a plan with `NonAiRequestsPerMinute = 100`; steady 1000 requests/sec for 2 minutes; non-AI monthly quota high enough not to interfere. +- **Action:** Run an NBomber scenario similar to the existing load-test `precheck` scenario, but rotate across 10 customer keys and hit the non-AI precheck path. +- **Expected result:** Approximately `10 keys x 100 requests/min x 2 minutes = 2000` successful responses and the remainder `429`; no counter drift between observed successes and stored counters; p99 latency stays under the team's agreed target; no hotspot causes one key to steal capacity from another. +- **Layer:** load + +## Open Questions for Architecture + +- Confirm McNulty's proposed schema and contract: flat `PlanData` fields (`NonAiRequestsPerMinute`, `NonAiMonthlyRequestQuota`) plus `ClientPlanAssignment.NonAiCurrentPeriodRequests`, with dedicated `/api/precheck-rest` and `/api/log-rest` endpoints. +- Zero-value semantics: keep `0 = unlimited` (McNulty proposal) or treat `0` as fully blocked? +- Over-limit status code: standardize on `429` (proposal/current AI pattern) or use `403` for monthly quota exhaustion? +- Reset model: shared billing-period reset via `BillingPeriodCalculator`, fixed calendar month, custom billing-cycle day, or rolling 30-day window? +- Counter mutation semantics on rejected requests: should a blocked per-minute request saturate at N or increment past N before rejection? Should rejected monthly-quota requests increment the durable monthly counter? +- Failure policy when plan lookup/storage fails: serve from cache if possible, and if not possible fail closed with `500` or `503`? +- Observability contract for APIM-native or hybrid enforcement: if some limits live in APIM policy, what telemetry or headers make counters/assertions testable? +- Fire-and-forget logging contract: do failed `/api/log-rest` calls only emit warnings/metrics, or should they trigger retries/dead-letter handling? + +## Notes on Test Code + +These scenarios fit the existing layout and naming already in `src/AIPolicyEngine.Tests/EndpointTests.cs` (`CreatePlan_*`, `UpdatePlan_*`, `Precheck_*`) plus the simulation-style integration tests under `src/AIPolicyEngine.Tests/Integration/` such as `PrecheckRoutingIntegrationTests.cs` and `CosmosPersistenceResilienceTests.cs`. The likely implementation path is: extend `EndpointTests` for plan CRUD and basic non-AI precheck status cases, add a dedicated integration test class for non-AI counter isolation/reset/failure behavior, and extend `src/AIPolicyEngine.LoadTest/Program.cs` with a non-AI precheck scenario. New helpers/fixtures will likely be a seeded non-AI RPM key helper, a way to seed/read `NonAiCurrentPeriodRequests`, and ideally a controllable clock/time provider so minute-window and billing-period rollover tests do not depend on real time. diff --git a/.squad/files/non-ai-paused/entra-jwt-rest-policy.md b/.squad/files/non-ai-paused/entra-jwt-rest-policy.md new file mode 100644 index 00000000..9f177e7b --- /dev/null +++ b/.squad/files/non-ai-paused/entra-jwt-rest-policy.md @@ -0,0 +1,234 @@ +# APIM Policy Analysis: `entra-jwt-rest-policy.xml` + +This policy protects a non-AI REST backend with **Entra JWT validation**, **native APIM request throttling/quota**, and **fire-and-forget usage logging** back to the AI Policy Engine. It mirrors the existing Entra JWT policy style, but removes AI-specific deployment routing and token accounting. + +--- + +## Overview + +### Default enforcement path (this sample) + +- **Authentication:** Validate the caller's Entra access token. +- **Identity extraction:** Read `tenantId` and `clientAppId`, then build `customerKey = {clientAppId}:{tenantId}`. +- **Short-window protection:** `rate-limit-by-key` enforces requests per minute. +- **Monthly cap:** `quota-by-key` enforces a 30-day fixed-window request quota. +- **Forwarding:** Route to the protected non-AI backend. +- **Accounting:** Post request outcome metadata to `/api/log-rest` without adding latency to the caller. + +### Alternative enforcement path (commented in XML) + +The XML also includes a commented `send-request` block for a policy-engine-enforced path using `/api/precheck-rest/{clientAppId}`. If the team chooses centralized enforcement later, comment out the native APIM limit policies and uncomment that block. + +--- + +## 1. `` — Request Processing + +### 1.1 `base` + +`` keeps higher-scope APIM policies active, matching the established policy pattern in this repo. + +### 1.2 JWT validation + +The policy reuses the same validation shape as `entra-jwt-policy.xml`: + +- `Authorization` header is required. +- OpenID metadata comes from `https://login.microsoftonline.com/common/.well-known/openid-configuration`. +- The `aud` claim must equal `{{ExpectedAudience}}`. + +This keeps the REST policy aligned with the existing multi-tenant Entra gateway behavior. + +### 1.3 Claim extraction and customer key + +Three variables are created: + +| Variable | Source | Purpose | +|---|---|---| +| `tenantId` | `tid` | Caller tenant isolation | +| `clientAppId` | `azp` with `appid` fallback | App identity for delegated and app-only flows | +| `customerKey` | `{clientAppId}:{tenantId}` | Shared APIM counter key + log identity | + +`customerKey` is multi-tenant safe: the same app ID in two different tenants produces two different keys. + +### 1.4 Native requests-per-minute throttle + +```xml + +``` + +Why native APIM here: + +- Fast, gateway-local enforcement. +- No precheck dependency on the hot path. +- Simple operational model for a draft/sample policy. + +### 1.5 Native monthly quota + +```xml + +``` + +Important trade-off: + +- `quota-by-key` uses a **fixed window**, not a true calendar month. +- This sample uses `2592000` seconds (30 days) because APIM cannot calendar-align this natively. +- If the business must reset on exact month boundaries, the team will need custom counters outside native `quota-by-key`. + +Also note the HTTP behavior difference: + +- `rate-limit-by-key` exceeds to **429**. +- `quota-by-key` exceeds to **403**. + +### 1.6 Backend routing + +```xml + +``` + +This forwards the request to the protected REST backend. The sample uses a named value so Terraform or deployment automation can point the same policy at different backends. + +### 1.7 Managed identity for policy-engine calls + +The policy acquires a managed identity token for `{{ContainerAppAudience}}`. That token is used for outbound logging and is also what the commented `/api/precheck-rest` alternative would use. + +### 1.8 Supplying the limit values + +The XML comments call out two ways to source the native APIM limits: + +1. **Default:** use APIM named values (`{{NonAiRequestsPerMinute}}`, `{{NonAiMonthlyRequestQuota}}`). +2. **Generated policy path:** have deployment automation fetch config from the policy engine and render/import the XML with concrete values. + +A runtime `send-request` config lookup is **not** enough for `quota-by-key`, because APIM does not allow policy expressions in that policy's `calls` or `renewal-period` attributes. + +--- + +## 2. `` — Pass-through + +```xml + + + +``` + +No custom backend-stage logic is required. + +--- + +## 3. `` — Fire-and-forget logging + +The outbound section mirrors the existing Entra policy's pattern by using a non-blocking call to the policy engine: + +```xml + + {{ContainerAppUrl}}/api/log-rest + ... + +``` + +Payload fields in the sample: + +- `clientAppId` +- `tenantId` +- `customerKey` +- `requestPath` +- `statusCode` +- `latencyMs` +- `correlationId` + +Because this is a one-way request, the caller does not wait for the log pipeline to finish. + +### Accounting implication + +The policy engine receives the event stream for dashboards and audit, but **APIM owns the live counters**. That means dashboard totals are derived from logs and may not exactly match APIM's real-time limit state at any instant. + +--- + +## 4. `` + +The sample keeps `on-error` minimal: + +```xml + + + +``` + +That matches the request for a lightweight draft focused on the policy contract. + +--- + +## 5. Named values to configure + +Set these named values in APIM before attaching the policy: + +| Named value | Purpose | +|---|---| +| `EntraTenantId` | Shared Entra policy-family value already provisioned by Terraform; retained for consistency/future hardening | +| `ExpectedAudience` | Required `aud` claim for incoming tokens | +| `NonAiRequestsPerMinute` | Native APIM per-minute limit | +| `NonAiMonthlyRequestQuota` | Native APIM 30-day quota | +| `NonAiBackendUrl` | Protected REST backend host/FQDN | +| `ContainerAppUrl` | AI Policy Engine base URL | +| `ContainerAppAudience` | Managed identity audience/resource for policy-engine calls | + +--- + +## 6. How to deploy + +1. **Create/update the APIM named values** listed above. +2. **Import `policies/entra-jwt-rest-policy.xml`** into the API or operation that fronts the non-AI backend. +3. **Prefer API scope** unless you intentionally want different operations to use different policies. +4. **Do not attach duplicate copies at multiple nested scopes** unless you intentionally want multiple increments. +5. **Verify managed identity access** so APIM can call `{{ContainerAppUrl}}/api/log-rest`. + +### Scope guidance + +- Attach at **API scope** if the whole REST API should share one request budget per customer. +- Attach at **operation scope** only when a subset of operations should be protected. +- If you reuse the exact same `customerKey` across multiple APIs, APIM will share the counters for that key unless you change the key composition. + +--- + +## 7. Trade-offs vs. policy-engine-enforced precheck + +### Native APIM default (this sample) + +**Pros** +- Lowest gateway-path latency. +- No precheck dependency before backend forwarding. +- Simple APIM-native operational model. + +**Cons** +- Live counters exist only inside APIM. +- Dashboard numbers are derived from logs, not authoritative limit state. +- Monthly quota is fixed-window, not billing-period aware. +- Native quota exhaustion returns **403**, not **429**. + +### Policy-engine precheck alternative + +**Pros** +- Central counter state in Redis/policy engine. +- Easier to align with billing periods and future chargeback rules. +- Dashboard can reflect enforcement state more directly. + +**Cons** +- Adds a hard dependency and latency on the request path. +- Requires `/api/precheck-rest/{clientAppId}` availability on every protected request. +- More moving parts to troubleshoot. + +--- + +## 8. Customer key design + +`customerKey = {clientAppId}:{tenantId}` + +Why this matters: + +- **Tenant-safe:** avoids collisions when the same application ID exists in multiple tenants. +- **Stable:** derived entirely from JWT claims already present in the request. +- **Reusable:** the same key works for APIM counters and downstream accounting logs. + +If the team later needs per-API isolation instead of a shared pool, prefix the key with an API identifier when generating the policy. diff --git a/.squad/files/non-ai-paused/entra-jwt-rest-policy.xml b/.squad/files/non-ai-paused/entra-jwt-rest-policy.xml new file mode 100644 index 00000000..40334a83 --- /dev/null +++ b/.squad/files/non-ai-paused/entra-jwt-rest-policy.xml @@ -0,0 +1,124 @@ + + + + + + + + + + {{ExpectedAudience}} + + + + + + + + + + + + + + + + + + + + + + + + {{ContainerAppUrl}}/api/log-rest + POST + + application/json + + + @("Bearer " + (string)context.Variables["msi-access-token"]) + + @{ + var latencyMs = (long)(DateTime.UtcNow - context.Timestamp).TotalMilliseconds; + var payload = new Newtonsoft.Json.Linq.JObject(); + payload.Add(new Newtonsoft.Json.Linq.JProperty("tenantId", (string)context.Variables.GetValueOrDefault("tenantId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("clientAppId", (string)context.Variables.GetValueOrDefault("clientAppId", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("customerKey", (string)context.Variables.GetValueOrDefault("customerKey", ""))); + payload.Add(new Newtonsoft.Json.Linq.JProperty("requestPath", context.Request.Url.Path)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("statusCode", context.Response.StatusCode)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("latencyMs", latencyMs)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); + return payload.ToString(); + } + + + + + + diff --git a/.squad/log/2026-05-14T12-40-01Z-fix-reply-urls.md b/.squad/log/2026-05-14T12-40-01Z-fix-reply-urls.md new file mode 100644 index 00000000..2dece08a --- /dev/null +++ b/.squad/log/2026-05-14T12-40-01Z-fix-reply-urls.md @@ -0,0 +1,11 @@ +# Session Log — Fix Reply URLs (2026-05-14T12:40:01Z) + +**Agent:** Sydnor (Infra/DevOps) +**Task:** Fix AADSTS500113 redirect URI issue +**Status:** ✅ FIXED + +## Summary + +Postprovision scripts were registering redirect URIs on the wrong app registration. Fixed both `postprovision.ps1` and `postprovision.sh` to use the Terraform-managed app ID (`api_app_id`) instead of the legacy app ID (`CONTAINER_APP_CLIENT_ID`). Verified redirect URI is now on the correct app. + +**Awaiting:** Zack login retry confirmation. diff --git a/.squad/log/2026-05-14_18-15-41UTC-infra-fixes-shipped.md b/.squad/log/2026-05-14_18-15-41UTC-infra-fixes-shipped.md new file mode 100644 index 00000000..f9526945 --- /dev/null +++ b/.squad/log/2026-05-14_18-15-41UTC-infra-fixes-shipped.md @@ -0,0 +1,9 @@ +# Session Log — Infra Fixes Shipped + +**Date:** 2026-05-14T18:15:41Z + +Coordinator shipped infra fixes (commit 3156888d) after successful `azd up` validation. Two commits pushed to fix/spa-publish-and-terraform-migration: +- 3156888d: Infra fixes (main.tfvars.json, preprovision/postprovision scripts) +- 241a662b: Squad CLI tooling + +User plans teardown + re-deploy for final validation. No new agent spawns this turn. diff --git a/.squad/log/2026-05-21T17-43-57Z-apim-config-binding-fix.md b/.squad/log/2026-05-21T17-43-57Z-apim-config-binding-fix.md new file mode 100644 index 00000000..2a5c539e --- /dev/null +++ b/.squad/log/2026-05-21T17-43-57Z-apim-config-binding-fix.md @@ -0,0 +1,10 @@ +# 2026-05-21T17:43:57Z — APIM Config-Binding Hotfix + +**Agent:** Freamon +**Request:** Zack Way + +## Session Summary + +Hotfix completed on `seiggy/feature/apim-policy-management`. Fixed APIM_RESOURCE_ID env var binding to use standard ASP.NET Core convention `Apim__ResourceId`. All 295 tests pass. Audited env vars; no other mismatches found. Decision recorded in `.squad/decisions/inbox/`. + +**Commit:** 016b6543 (pushed to PR #32) diff --git a/.squad/log/20260515T164518Z-fresh-deploy-postprovision-regression.md b/.squad/log/20260515T164518Z-fresh-deploy-postprovision-regression.md new file mode 100644 index 00000000..37e8217e --- /dev/null +++ b/.squad/log/20260515T164518Z-fresh-deploy-postprovision-regression.md @@ -0,0 +1,17 @@ +# Session Log: Fresh Deploy Postprovision Regression + +**Timestamp:** 2026-05-15T16:45:18Z +**Session Phase:** Validation + Bug Fix +**Agent Lead:** Sydnor + +## Issue + +Fresh `azd up` deploy succeeded but portal login failed (AADSTS500113). Root cause: postprovision script used stale Bicep-era variable names instead of Terraform outputs. + +## Fix + +Updated `scripts/postprovision.ps1` to use correct Terraform output variable names. Reran postprovision hook. Redirect URI now registered on Terraform-managed API app. + +## Outcome + +✅ Portal operational. Redirect URI verified. Awaiting user login test. diff --git a/.squad/log/20260515T165232Z-fresh-deploy-app-role-gap.md b/.squad/log/20260515T165232Z-fresh-deploy-app-role-gap.md new file mode 100644 index 00000000..2d02058f --- /dev/null +++ b/.squad/log/20260515T165232Z-fresh-deploy-app-role-gap.md @@ -0,0 +1,17 @@ +# Session Log — Fresh Deploy App Role Gap + +**Date:** 2026-05-15T16:52:32Z + +## Issue + +HTTP 403 on routing-policies endpoints after successful login on fresh `azd up`. Root cause: deploying user lacked `AIPolicy.Admin` app role (Terraform only assigns to service principals). + +## Resolution + +1. Granted Zack role immediately via Graph API +2. Updated postprovision scripts to auto-assign deploying user +3. User must logout/login for token refresh + +## Validation + +Awaiting Zack's confirmation. Will validate via fresh `azd up` after token refresh confirmed. diff --git a/.squad/orchestration-log/2026-05-14T12-40-01Z-sydnor.md b/.squad/orchestration-log/2026-05-14T12-40-01Z-sydnor.md new file mode 100644 index 00000000..88e3bf43 --- /dev/null +++ b/.squad/orchestration-log/2026-05-14T12-40-01Z-sydnor.md @@ -0,0 +1,44 @@ +# Orchestration Log — Sydnor (2026-05-14T12:40:01Z) + +## Agent +- **Name:** Sydnor +- **Role:** Infra/DevOps +- **Requested by:** Zack Way + +## Task +Fix AADSTS500113 — postprovision scripts were targeting wrong app for redirect URI registration. + +## Outcome +✅ **SUCCESS** + +### Root Cause +- The repository has TWO app registrations for the API: + - **Terraform-managed** (`api_app_id` = d5bd33f4-09b1-4602-af88-29c5ec7728e0) + - **Legacy/manual** (`CONTAINER_APP_CLIENT_ID` = 625db56c-f5cc-4ee5-954d-6775c709055e) +- UI is configured to use the Terraform-managed app (`VITE_AZURE_CLIENT_ID=d5bd33f4...`) +- But postprovision scripts were updating the LEGACY app, not the Terraform-managed app +- Result: UI tries to auth against Terraform-managed app, which has no redirect URIs → AADSTS500113 + +### Fix Applied +- Updated `scripts/postprovision.ps1` to use `api_app_id` (Terraform-managed app) instead of `CONTAINER_APP_CLIENT_ID` +- Updated `scripts/postprovision.sh` to use `api_app_id` (Terraform-managed app) instead of `CONTAINER_APP_CLIENT_ID` +- Also fixed error code reference: AADSTS50011 → AADSTS500113 (correct error code) + +### Verification +- Ran `azd hooks run postprovision` successfully +- Confirmed redirect URI `https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainers.io` is now registered on the Terraform-managed API app (d5bd33f4...) +- The UI's MSAL config uses `redirectUri: window.location.origin`, so login flow will work when UI is served from Container App + +### Files Touched +- `scripts/postprovision.ps1` — Changed `CONTAINER_APP_CLIENT_ID` → `api_app_id` +- `scripts/postprovision.sh` — Changed `CONTAINER_APP_CLIENT_ID` → `api_app_id` +- `.squad/agents/sydnor/history.md` — Added learning entry (section 2026-05-14 — Redirect URI Registration) + +### Files Held (Await Zack Greenlight After Login Works) +- `azure.yaml` — Terraform IaC provider configuration +- `infra/terraform/main.tfvars.json` — Template variables +- `scripts/postprovision.ps1` — Updated redirect URI app +- `scripts/postprovision.sh` — Updated redirect URI app + +### Status +✅ **Pending:** Zack login retry confirmation. The fix is code-complete and verified; awaiting user validation that the login now works end-to-end. diff --git a/.squad/orchestration-log/2026-05-14_18-15-41UTC-coordinator-commit-push.md b/.squad/orchestration-log/2026-05-14_18-15-41UTC-coordinator-commit-push.md new file mode 100644 index 00000000..e3672575 --- /dev/null +++ b/.squad/orchestration-log/2026-05-14_18-15-41UTC-coordinator-commit-push.md @@ -0,0 +1,35 @@ +# Orchestration Log — Coordinator Commit + Push + +**Date:** 2026-05-14T18:15:41Z +**Agent:** Coordinator +**Branch:** fix/spa-publish-and-terraform-migration + +## Summary + +Coordinator committed and pushed two commits after infrastructure fixes validation via `azd up`. + +## Commits Shipped + +1. **3156888d** — Infra fixes: main.tfvars.json, pre/postprovision scripts + - `infra/terraform/main.tfvars.json` — azd Terraform provider variable template + - `scripts/preprovision.ps1` / `scripts/preprovision.sh` — UI env file generation + VITE_API_URL empty string (same-origin pattern) + - `scripts/postprovision.ps1` / `scripts/postprovision.sh` — Redirect URI registration on Terraform-managed app (api_app_id) + - Status: ✅ Validated via `azd up` — 77 Azure resources provisioned (9m59s), no errors + +2. **241a662b** — Squad CLI tooling + - `.squad/` directory infrastructure updates + - Status: Not detailed in manifest + +## Validation Directive + +**User (Zack) Directive Satisfied:** "Always validate infra fixes before committing" (2026-05-14 decision) + +- Coordinator ran `azd up` BEFORE committing +- Full infrastructure provisioned successfully +- Container App FQDN operational: `https://ca-h75aielsaei6q.proudsky-ba978644.eastus2.azurecontainerapps.io` +- Redirect URI registered on correct app (api_app_id = d5bd33f4-09b1-4602-af88-29c5ec7728e0) +- UI-to-API URL wiring: VITE_API_URL empty string (relative URLs, same-origin) + +## Next Steps + +User (Zack) plans teardown + re-deploy from scratch as final validation before PR merge. diff --git a/.squad/orchestration-log/2026-05-21T17-43-57Z-freamon.md b/.squad/orchestration-log/2026-05-21T17-43-57Z-freamon.md new file mode 100644 index 00000000..7a25b22b --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T17-43-57Z-freamon.md @@ -0,0 +1,43 @@ +# 2026-05-21 — Freamon Hotfix Run: APIM Config-Binding Bug + +**Agent:** Freamon +**Mode:** Background +**Request:** Zack Way +**Branch:** `seiggy/feature/apim-policy-management` +**PR:** #32 + +## Goal + +Fix APIM_RESOURCE_ID env var binding bug in ASP.NET Core configuration. + +## Issue + +`ApimManagementOptions` expects the APIM resource ID at config key `Apim:ResourceId`. The Container App Terraform wiring used `APIM_RESOURCE_ID`, which ASP.NET Core does not translate into the nested config key. The standard convention for nested keys is double underscore: `Apim__ResourceId`. + +## Outcome + +✅ **Hotfix completed and pushed to seiggy/feature/apim-policy-management** + +- **Commit:** 016b6543 +- **Change:** Updated Terraform env var binding from `APIM_RESOURCE_ID` to `Apim__ResourceId` +- **Tests:** All 295 tests pass +- **Audit:** Scanned all env vars; no other single-underscore ASP.NET Core nested-config mismatches found +- **Decision:** Documented APIM ResourceId env binding convention (standard ASP.NET Core double-underscore) + +## Files Modified + +- `infra/terraform/main.tf` — Terraform env var binding for APIM resource ID + +## Process + +1. Identified config binding mismatch between Container App env var and ASP.NET Core `EnvironmentVariablesConfigurationProvider` behavior +2. Applied hotfix: switched to conventional double-underscore syntax +3. Verified all 295 tests pass +4. Audited all other env vars for similar mismatches +5. Recorded decision for future APIM Terraform wiring + +## Notes + +- No breaking changes to application code +- Prevents silent runtime misbinding when infrastructure sets nested config values +- Aligns with ASP.NET Core idiomatic configuration patterns diff --git a/.squad/orchestration-log/20260515T164518Z-sydnor-postprovision-tf-output-fix.md b/.squad/orchestration-log/20260515T164518Z-sydnor-postprovision-tf-output-fix.md new file mode 100644 index 00000000..20a86764 --- /dev/null +++ b/.squad/orchestration-log/20260515T164518Z-sydnor-postprovision-tf-output-fix.md @@ -0,0 +1,36 @@ +# Orchestration Log: Sydnor Postprovision Terraform Output Fix + +**Timestamp:** 2026-05-15T16:45:18Z +**Agent:** Sydnor (Infra/DevOps) +**Status:** ✅ Complete + +## Summary + +Sydnor debugged and fixed a critical postprovision regression where fresh `azd up` deploys silently failed to register redirect URIs on the API app registration, causing AADSTS500113 login failures. + +**Root Cause:** Postprovision script queried Bicep-era environment variable names (`AZURE_RESOURCE_GROUP`, `COSMOS_ENDPOINT`) instead of actual Terraform output variable names (`resource_group_name`, `cosmos_endpoint`). Variable name mismatch caused silent script failure on fresh deploys. + +**Resolution:** Updated `scripts/postprovision.ps1` lines 11–12 to use correct Terraform output variable names. Reran `azd hooks run postprovision` successfully. Verified redirect URI registration on Terraform-managed API app. + +## Files Modified + +- `scripts/postprovision.ps1` — Fixed resource group + cosmos endpoint variable names + +## Verification + +- ✅ Fresh deploy portal loads cleanly +- ✅ API endpoint reachable (HTTP 401 auth required) +- ✅ Redirect URI registered on correct app (`4eda37fc-969c-4262-8569-ddcd68aa0370`) +- ✅ `azd hooks run postprovision` completes without errors + +## Validation Status + +Awaiting Zack's login confirmation in browser before commit. + +## Decision(s) Merged + +- **2026-05-15 Decision:** Postprovision scripts must use Terraform output variable names (not azd built-in names) + +## Cross-Agent Notes + +No other agents involved. Fix is isolated to postprovision infra workflow. diff --git a/.squad/orchestration-log/20260515T165232Z-sydnor-app-role-assignment-fix.md b/.squad/orchestration-log/20260515T165232Z-sydnor-app-role-assignment-fix.md new file mode 100644 index 00000000..1b4e5754 --- /dev/null +++ b/.squad/orchestration-log/20260515T165232Z-sydnor-app-role-assignment-fix.md @@ -0,0 +1,37 @@ +# Orchestration Log — Sydnor App Role Assignment Fix + +**Date:** 2026-05-15T16:52:32Z +**Agent:** Sydnor (Infra/DevOps) +**Batch:** Fresh-deploy app role gap remediation + +## Summary + +Sydnor diagnosed HTTP 403 errors on routing-policies endpoints after fresh `azd up` deployment. Root cause: `AIPolicy.Admin` app role was assigned by Terraform only to service principals, not to human users. Fixed via two complementary approaches: + +1. **Immediate:** Granted Zack the role manually via Graph API (`az rest` POST to appRoleAssignedTo) +2. **Durable:** Updated `scripts/postprovision.ps1` and `scripts/postprovision.sh` to auto-assign deploying user to `AIPolicy.Admin` app role, making the portal usable immediately after `azd up` + +## Implementation Details + +- **Postprovision Pattern:** Query signed-in user via `az ad signed-in-user show`, idempotently assign via Graph API +- **Fail-Safe:** Script continues if assignment fails; user can assign manually later +- **Token Warning:** User must log out/login to refresh token and receive role claims +- **Files Modified:** + - `scripts/postprovision.ps1` (lines 133+) + - `scripts/postprovision.sh` (lines 158+) + +## Status + +- ✅ Manual role assignment succeeded +- ✅ Postprovision scripts updated +- ⚠️ Awaiting Zack's logout/login validation before commit + +## Decision Document + +See `.squad/decisions/inbox/sydnor-app-role-assignment-postprovision.md` (merged to decisions.md) + +## Next Steps + +1. Zack validates token refresh (logout/login) and confirms portal access +2. Commit postprovision scripts +3. Validate via fresh `azd up` in test tenant diff --git a/.squad/skills/azd-postprovision-app-registration-reply-urls/SKILL.md b/.squad/skills/azd-postprovision-app-registration-reply-urls/SKILL.md new file mode 100644 index 00000000..ff95daee --- /dev/null +++ b/.squad/skills/azd-postprovision-app-registration-reply-urls/SKILL.md @@ -0,0 +1,78 @@ +# Skill: Azure AD App Registration Redirect URI Setup in azd Postprovision Hooks + +**Category:** Infrastructure / DevOps / Azure +**Applies to:** Projects using azd, Terraform, and Entra ID (Azure AD) app registrations with SPA/web redirect URIs +**Related:** MSAL.js authentication, Container Apps, SPA deployment + +## Problem + +When deploying SPAs (Single Page Applications) to Azure that use MSAL.js for authentication: + +1. The deployed URL isn't known until AFTER infrastructure is provisioned +2. Entra ID app registrations require redirect URIs to be registered BEFORE the app can complete authentication flows +3. Users get AADSTS500113 ("No reply address is registered for the application") if redirect URIs aren't set correctly +4. Multiple app registrations may exist (Terraform-managed, legacy, manual), causing confusion about which one to configure + +## Solution Pattern + +Use azd postprovision hooks to: + +1. Query the ACTUAL deployed URL from Azure (not from Terraform state, which can be stale) +2. Identify the CORRECT app registration (the one the SPA is configured to use) +3. Register the deployed URL as a redirect URI using Microsoft Graph API +4. Make the operation idempotent (safe to re-run without duplicating URIs) + +## Implementation + +See `.squad/decisions/inbox/sydnor-redirect-uri-postprovision-pattern.md` for full implementation details. + +Key script locations: +- `scripts/postprovision.ps1` (Windows) +- `scripts/postprovision.sh` (Linux/macOS) + +## Verification + +After running the postprovision hook: + +```bash +# Get the app ID +APP_ID=$(azd env get-values | grep api_app_id | cut -d= -f2 | tr -d '"') + +# Check registered redirect URIs +az ad app show --id "$APP_ID" --query "{displayName:displayName,spa:spa.redirectUris}" -o json +``` + +## Common Issues + +### AADSTS500113: No reply address is registered for the application + +**Cause:** The redirect URI is missing or doesn't match the URL the SPA is served from. + +**Fix:** Run `azd hooks run postprovision` to register the redirect URI. + +### Redirect URI registered on the wrong app + +**Cause:** Multiple app registrations exist. The postprovision script is using the wrong `APP_ID`. + +**Fix:** Check which app the SPA is configured to use (e.g., `VITE_AZURE_CLIENT_ID` in .env files) and ensure the postprovision script uses the matching Terraform output variable. + +### Postprovision silently skips redirect URI registration + +**Cause (fresh deploy):** Postprovision script queries the wrong environment variable name. Terraform outputs use snake_case (e.g., `resource_group_name`), but the script may query SCREAMING_CASE azd built-ins (e.g., `AZURE_RESOURCE_GROUP`). + +**Symptom:** `azd hooks run postprovision` prints "Skipping: AZURE_RESOURCE_GROUP not set" or similar, even though the resource group exists. + +**Fix:** Update postprovision script to use Terraform's exact output variable names: +```powershell +# WRONG (queries azd built-in, not Terraform output) +$resourceGroup = azd env get-values | Select-String "^AZURE_RESOURCE_GROUP=" + +# CORRECT (queries Terraform output) +$resourceGroup = azd env get-values | Select-String "^resource_group_name=" +``` + +**Why this matters:** On fresh deploys, only Terraform's output names are guaranteed to be in the azd environment. On update deploys, both may exist from prior manual runs, masking the bug. + +## Tags + +#azure #entra-id #app-registration #redirect-uri #msal #spa #azd #postprovision #graph-api #container-apps diff --git a/.squad/skills/azd-postprovision-app-role-assignment/SKILL.md b/.squad/skills/azd-postprovision-app-role-assignment/SKILL.md new file mode 100644 index 00000000..4b801b25 --- /dev/null +++ b/.squad/skills/azd-postprovision-app-role-assignment/SKILL.md @@ -0,0 +1,270 @@ +# Skill: Entra App Role Assignment for Deploying User in azd Postprovision Hooks + +**Category:** Infrastructure / DevOps / Azure +**Applies to:** Projects using azd, Terraform, Entra ID (Azure AD) app registrations with app roles for authorization +**Related:** ASP.NET Core authorization policies, MSAL.js authentication, role-based access control (RBAC) + +## Problem + +When deploying ASP.NET Core APIs to Azure that use Entra ID app roles for authorization: + +1. API endpoints are protected with `[Authorize(Policy = "...")]` attributes that require specific app roles (e.g., `AIPolicy.Admin`) +2. Terraform can define app roles on the app registration and assign them to service principals, but **cannot assign roles to the deploying user** (user object ID is unknown at Terraform apply time) +3. Users who deploy via `azd up` can successfully authenticate (log in), but receive **HTTP 403 Forbidden** on endpoints requiring app roles because their token has no roles +4. Manual role assignment via Azure Portal or Graph API is a poor first-run experience + +**403 vs 401:** +- **HTTP 401 Unauthorized** = authentication failed (no token, expired token, wrong audience, missing redirect URI) +- **HTTP 403 Forbidden** = authenticated BUT not authorized (token is valid but missing required scope/role/claim) + +## Solution Pattern + +Use azd postprovision hooks to: + +1. Query the current signed-in user's object ID (the person who ran `azd up`) +2. Identify the API app registration (from Terraform outputs in azd env) +3. Query the target app role ID (e.g., `AIPolicy.Admin`) from the app registration +4. Assign the user to the app role using Microsoft Graph API (`appRoleAssignedTo`) +5. Make the operation idempotent (check if user already has the role before assigning) +6. Warn the user that they must log out and log back in to receive a fresh token with the role + +## Implementation + +### Postprovision Script Pattern (PowerShell) + +```powershell +# Get the API app ID from azd environment (Terraform output) +$appId = azd env get-values | Select-String "^api_app_id=" | ForEach-Object { $_.Line -replace "^api_app_id=", "" } | ForEach-Object { $_.Trim('"') } + +# Get the service principal object ID +$spObjectId = az ad sp show --id $appId --query "id" -o tsv 2>$null + +# Get the target app role ID (e.g., AIPolicy.Admin) +$adminRoleId = az ad sp show --id $appId --query "appRoles[?value=='AIPolicy.Admin'].id | [0]" -o tsv 2>$null + +# Get the current user's object ID +$userObjectId = az ad signed-in-user show --query "id" -o tsv 2>$null + +# Check if user already has the role (idempotent check) +$existingAssignments = az rest --method GET ` + --uri "https://graph.microsoft.com/v1.0/servicePrincipals/$spObjectId/appRoleAssignedTo" ` + --query "value[?principalId=='$userObjectId' && appRoleId=='$adminRoleId']" ` + -o json 2>$null | ConvertFrom-Json + +if ($existingAssignments.Count -gt 0) { + Write-Host "AIPolicy.Admin role already assigned — skipping." -ForegroundColor Green +} else { + # Assign the role + $bodyObj = @{ + principalId = $userObjectId + resourceId = $spObjectId + appRoleId = $adminRoleId + } + $bodyJson = $bodyObj | ConvertTo-Json -Compress + $tmp = New-TemporaryFile + Set-Content -Path $tmp -Value $bodyJson -Encoding utf8 + az rest --method POST ` + --uri "https://graph.microsoft.com/v1.0/servicePrincipals/$spObjectId/appRoleAssignedTo" ` + --headers "Content-Type=application/json" ` + --body "@$tmp" -o none 2>$null + Remove-Item $tmp -Force + if ($LASTEXITCODE -eq 0) { + Write-Host " ✓ AIPolicy.Admin role assigned successfully." -ForegroundColor Green + Write-Host " ⚠ User must log out and log back in to receive a fresh token with the Admin role." -ForegroundColor Yellow + } +} +``` + +### Postprovision Script Pattern (Bash) + +```bash +# Get the API app ID from azd environment (Terraform output) +APP_ID=$(azd env get-values 2>/dev/null | grep "^api_app_id=" | sed "s/^api_app_id=//" | tr -d '"' || true) + +# Get the service principal object ID +SP_OBJECT_ID=$(az ad sp show --id "$APP_ID" --query "id" -o tsv 2>/dev/null || true) + +# Get the target app role ID (e.g., AIPolicy.Admin) +ADMIN_ROLE_ID=$(az ad sp show --id "$APP_ID" --query "appRoles[?value=='AIPolicy.Admin'].id | [0]" -o tsv 2>/dev/null || true) + +# Get the current user's object ID +USER_OBJECT_ID=$(az ad signed-in-user show --query "id" -o tsv 2>/dev/null || true) + +# Check if user already has the role (idempotent check) +EXISTING_COUNT=$(az rest --method GET \ + --uri "https://graph.microsoft.com/v1.0/servicePrincipals/$SP_OBJECT_ID/appRoleAssignedTo" \ + --query "value[?principalId=='$USER_OBJECT_ID' && appRoleId=='$ADMIN_ROLE_ID'] | length(@)" \ + -o tsv 2>/dev/null || echo "0") + +if [ "$EXISTING_COUNT" -gt 0 ]; then + echo "AIPolicy.Admin role already assigned — skipping." +else + # Assign the role + BODY=$(python3 -c "import json; print(json.dumps({'principalId': '$USER_OBJECT_ID', 'resourceId': '$SP_OBJECT_ID', 'appRoleId': '$ADMIN_ROLE_ID'}))" 2>/dev/null || echo "") + TMP=$(mktemp) + echo "$BODY" > "$TMP" + if az rest --method POST \ + --uri "https://graph.microsoft.com/v1.0/servicePrincipals/$SP_OBJECT_ID/appRoleAssignedTo" \ + --headers "Content-Type=application/json" \ + --body "@$TMP" -o none 2>/dev/null; then + echo " ✓ AIPolicy.Admin role assigned successfully." + echo " ⚠ User must log out and log back in to receive a fresh token with the Admin role." + fi + rm -f "$TMP" +fi +``` + +### ASP.NET Core Authorization Policy (C#) + +```csharp +// Program.cs or Startup.cs +builder.Services.AddAuthorizationBuilder() + .AddPolicy("AdminPolicy", policy => + policy.RequireRole("AIPolicy.Admin")) + .SetFallbackPolicy(new AuthorizationPolicyBuilder() + .RequireAuthenticatedUser() + .Build()); + +// Endpoint registration +routes.MapGet("/api/routing-policies", ListPolicies) + .RequireAuthorization("AdminPolicy"); +``` + +### Terraform App Role Definition + +```hcl +# infra/terraform/modules/identity/main.tf + +resource "random_uuid" "role_admin" {} + +resource "azuread_application" "api" { + display_name = "AI Policy API" + sign_in_audience = "AzureADMyOrg" + + app_role { + id = random_uuid.role_admin.result + display_name = "AI Policy Admin" + description = "Full administrative access to the AI Policy system" + value = "AIPolicy.Admin" + allowed_member_types = ["Application", "User"] + enabled = true + } +} +``` + +## Verification + +After running the postprovision hook: + +```powershell +# Get the app ID +$APP_ID = azd env get-values | Select-String "^api_app_id=" | ForEach-Object { $_.Line -replace "^api_app_id=", "" } | ForEach-Object { $_.Trim('"') } + +# Get the service principal object ID +$SP_OBJECT_ID = az ad sp show --id $APP_ID --query "id" -o tsv + +# List all app role assignments (users and service principals with roles) +az rest --method GET --uri "https://graph.microsoft.com/v1.0/servicePrincipals/$SP_OBJECT_ID/appRoleAssignedTo" --query "value[].{principal:principalDisplayName,role:appRoleId}" -o table +``` + +## Token Refresh Requirement + +**CRITICAL:** App role assignments do NOT affect existing tokens. Tokens are issued by Entra ID at login time and cached by MSAL for ~1 hour (default). + +Users must obtain a fresh token after role assignment: +- **Option 1:** Log out and log back in (MSAL clears token cache on logout) +- **Option 2:** Wait for token expiry (~1 hour by default) +- **Option 3:** Clear browser storage manually (MSAL stores tokens in localStorage) + +MSAL token cache keys: +- `msal..token.keys` +- `msal..idtoken` +- `msal..accesstoken` + +## Common Issues + +### HTTP 403 after successful login + +**Cause:** User is authenticated (token is valid) but their token has no app roles (role assignment missing or token not refreshed). + +**Symptoms:** +- Login succeeds (redirect URI works, no AADSTS errors) +- API returns HTTP 403 on endpoints requiring app roles +- Browser developer tools show no `roles` claim in the ID token or access token + +**Fix:** +1. Check if the user has the role assigned: + ```powershell + az rest --method GET --uri "https://graph.microsoft.com/v1.0/servicePrincipals//appRoleAssignedTo" --query "value[?principalId=='']" + ``` +2. If missing, run `azd hooks run postprovision` to assign the role +3. **Log out and log back in** to receive a fresh token + +### Postprovision silently skips app role assignment + +**Cause:** Script queries the wrong environment variable name (e.g., `AZURE_RESOURCE_GROUP` instead of `resource_group_name`). See the redirect URI skill for details. + +**Fix:** Ensure postprovision script uses Terraform's exact output variable names (snake_case). + +### "Insufficient privileges to complete the operation" + +**Cause:** The deploying user's Entra ID account lacks permission to assign app roles (requires `AppRoleAssignment.ReadWrite.All` or `RoleManagement.ReadWrite.Directory` Graph API permissions). + +**Common scenarios:** +- Guest users in a tenant (external accounts) +- Non-admin users in locked-down tenants + +**Workarounds:** +1. Request a tenant admin to run `azd hooks run postprovision` +2. Request a tenant admin to assign the role manually via Azure Portal: + - Go to Enterprise Applications → AI Policy API → Users and groups → Add user/group + - Select the user and assign the "AI Policy Admin" role +3. Use a service principal with sufficient permissions for `azd up` (advanced) + +### Only the deploying user has admin access + +**Cause:** Postprovision only assigns the role to the person who ran `azd up`. + +**Fix (per-user onboarding):** Each team member can run: +```bash +azd hooks run postprovision +``` +This assigns the role to their account (idempotent — safe to re-run). + +**Fix (bulk assignment):** A tenant admin can assign roles to a security group via Azure Portal: +- Go to Enterprise Applications → AI Policy API → Users and groups → Add user/group +- Select a security group (e.g., "AI Policy Admins") +- Assign the "AI Policy Admin" role +- Add users to the security group (users inherit the role) + +## Why Postprovision (Not Terraform) + +- **Terraform requires user object ID at apply time** — not available for the deploying user (dynamic at runtime) +- **Postprovision has access to the deploying user** via `az ad signed-in-user show` (uses the already-authenticated az CLI context) +- **Consistent with existing patterns** — postprovision already handles redirect URI registration (another runtime-dependent value) + +## App Roles vs Scopes (Key Differences) + +| Concept | App Roles | Scopes (Delegated Permissions) | +|---------|-----------|--------------------------------| +| **Purpose** | Assign roles to users/apps | Grant delegated permissions to call APIs | +| **Claim in token** | `roles` array | `scp` string (space-separated) | +| **Defined on** | Resource app (API app) | Resource app (API app) | +| **Assigned to** | Users, Groups, Service Principals | Client app (via `required_resource_access`) | +| **Consent** | Admin assigns role | User/admin grants consent | +| **ASP.NET Core** | `.RequireRole("AIPolicy.Admin")` | `.RequireScope("access_as_user")` | +| **OAuth2/OIDC** | Not part of OAuth2 spec (Microsoft extension) | Standard OAuth2 `scope` parameter | +| **Use case** | Authorization (what can this user DO) | Authentication (what APIs can this app call) | + +**Rule of thumb:** Use **app roles** for coarse-grained authorization (Admin, User, Viewer). Use **scopes** for API-to-API calls (delegated access). + +## References + +- **Graph API:** `POST /servicePrincipals/{id}/appRoleAssignedTo` (assigns user/app to app role) +- **Graph API:** `GET /servicePrincipals/{id}/appRoleAssignedTo` (list all role assignments) +- **Microsoft Docs:** [App roles in Azure AD](https://learn.microsoft.com/en-us/azure/active-directory/develop/howto-add-app-roles-in-azure-ad-apps) +- **Microsoft Docs:** [Role-based authorization in ASP.NET Core](https://learn.microsoft.com/en-us/aspnet/core/security/authorization/roles) + +## Tags + +#azure #entra-id #app-roles #authorization #rbac #http-403 #msal #azd #postprovision #graph-api #terraform #asp-net-core diff --git a/.squad/skills/container-app-ui-api-url-wiring/SKILL.md b/.squad/skills/container-app-ui-api-url-wiring/SKILL.md new file mode 100644 index 00000000..6d8bd1c0 --- /dev/null +++ b/.squad/skills/container-app-ui-api-url-wiring/SKILL.md @@ -0,0 +1,29 @@ +# Container App UI-to-API URL Wiring + +## Pattern + +When deploying a React SPA alongside a .NET API in the same Azure Container App with `azd`, use **relative URLs** for API calls to avoid hardcoding FQDNs. + +## Why This Matters + +Azure Container Apps FQDNs are assigned AFTER `azd provision` completes. If the UI needs the API URL at build time (e.g., for Vite `import.meta.env.VITE_API_URL`), there is a timing problem: +- **Preprovision** hook runs BEFORE provisioning → CA FQDN unknown +- **UI build** happens inside `dotnet publish` → needs config at that time +- **Postprovision** hook runs AFTER provisioning → too late for build-time config + +## Solution: Same-Origin Relative URLs + +If the UI is served FROM the same Container App as the API, use relative URLs. + +## Files Modified (This Project) + +- `scripts/preprovision.ps1` — Added `VITE_API_URL=` to SPA env file generation +- `scripts/preprovision.sh` — Added `VITE_API_URL=` to SPA env file generation +- `src/aipolicyengine-ui/.env.production.local` — Generated file (git-ignored) + +## Reference + +- Symptom: User reported dashboard timing out on API calls, no inbound requests in container logs +- Root cause: UI had stale hardcoded API URL from previous CA deployment +- Fix: Switched to same-origin relative URLs, regenerated config, redeployed +- Validation: curl confirmed API reachable, UI now calls correct endpoint From 06c32fcb7a67a6be7de0ff7d6738ed49c338ad7f Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 13:55:33 -0400 Subject: [PATCH 04/14] fix(ui): stop /apis page infinite render loop on policy load Root cause: the initial APIM bootstrap callback depended on operationsByApi while also resetting that state, which changed the callback identity and retriggered the mount effect. Fix: keep the latest operations map in a ref so loadInitialData can reconcile the current selection without depending on the mutable operationsByApi object. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/kima/history.md | 2 ++ .../decisions/inbox/kima-apis-render-loop.md | 7 +++++ .../react-render-loop-debugging/SKILL.md | 26 +++++++++++++++++++ src/aipolicyengine-ui/src/pages/Apis.tsx | 11 +++++--- 4 files changed, 43 insertions(+), 3 deletions(-) create mode 100644 .squad/decisions/inbox/kima-apis-render-loop.md create mode 100644 .squad/skills/react-render-loop-debugging/SKILL.md diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index a34ea236..b6fe2314 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -13,6 +13,8 @@ ## Learnings *Core learnings consolidated in Core Context section above (see git history for detailed entries).* +- 2026-05-21: The `/apis` page render loop came from `loadInitialData` depending on `operationsByApi` while also resetting that state to a fresh `{}`, which changed the callback identity and re-fired the mount effect forever. +- Rule: if an effect triggers a callback that mutates local maps/arrays, keep the callback keyed to stable inputs and read the latest collection through a ref or stable ID instead of adding the collection itself to the callback deps. ## Archived Learnings (Pre-May 2026) diff --git a/.squad/decisions/inbox/kima-apis-render-loop.md b/.squad/decisions/inbox/kima-apis-render-loop.md new file mode 100644 index 00000000..44864bb8 --- /dev/null +++ b/.squad/decisions/inbox/kima-apis-render-loop.md @@ -0,0 +1,7 @@ +# Kima — /apis render-loop guardrail + +- **Date:** 2026-05-21 +- **Scope:** `src/aipolicyengine-ui/src/pages/Apis.tsx` +- **Root cause:** The page bootstrapping effect depended on `loadInitialData`, and `loadInitialData` depended on `operationsByApi` even though it also reset `operationsByApi` to a fresh object. That changed the callback identity after every fetch and re-triggered the effect indefinitely. +- **Decision / convention:** Callbacks invoked by mount or refresh effects must depend only on stable values. When they need the latest map/array state for reconciliation, read it through a ref or derive stable IDs first. +- **Why it matters:** This keeps admin pages from self-triggering fetch loops when they reset cached child collections during refresh. diff --git a/.squad/skills/react-render-loop-debugging/SKILL.md b/.squad/skills/react-render-loop-debugging/SKILL.md new file mode 100644 index 00000000..ca66ad8d --- /dev/null +++ b/.squad/skills/react-render-loop-debugging/SKILL.md @@ -0,0 +1,26 @@ +--- +name: "react-render-loop-debugging" +description: "Diagnose and fix React render loops caused by unstable effect or callback dependencies" +domain: "frontend, react, debugging" +confidence: "high" +source: "earned — from fixing the APIM /apis page infinite render loop" +--- + +## Context +Use this skill when a React page keeps re-rendering, refetching, or polling forever after mount or selection changes. + +## Patterns +1. Trace the full effect chain: `useEffect` -> callback/hook -> `setState` -> dependency churn. +2. If a callback is invoked by a mount effect, do not make that callback depend on maps, arrays, or objects that the callback also resets or replaces. +3. Use refs to read the latest mutable collections inside async callbacks when you need current reconciliation data without changing callback identity. +4. Prefer stable IDs/strings in polling and fetch effects over whole selected objects. +5. Check retry toasts and polling callbacks for recursive calls that capture unstable objects. + +## Examples +- `src/aipolicyengine-ui/src/pages/Apis.tsx`: `loadInitialData` originally depended on `operationsByApi` and also called `setOperationsByApi({})`, so the mount effect keyed to `loadInitialData` re-fired continuously. +- Fix pattern: mirror `operationsByApi` into `operationsByApiRef.current` and read the ref inside the callback, leaving the callback deps stable. + +## Anti-Patterns +- `useEffect(() => { void loadInitialData() }, [loadInitialData])` when `loadInitialData` depends on state it also recreates. +- Depending on inline objects/arrays or selected entity objects when only an ID is needed. +- Polling effects that re-arm on every render because status inputs are not stable. diff --git a/src/aipolicyengine-ui/src/pages/Apis.tsx b/src/aipolicyengine-ui/src/pages/Apis.tsx index 4f6659cd..01b17ffb 100644 --- a/src/aipolicyengine-ui/src/pages/Apis.tsx +++ b/src/aipolicyengine-ui/src/pages/Apis.tsx @@ -1,4 +1,4 @@ -import { useCallback, useEffect, useMemo, useState } from "react" +import { useCallback, useEffect, useMemo, useRef, useState } from "react" import { useMsal } from "@azure/msal-react" import { AlertTriangle, Network, RefreshCcw } from "lucide-react" import { ApiTree } from "../components/apis/ApiTree" @@ -158,6 +158,7 @@ export function Apis() { const [loadingOperationApiIds, setLoadingOperationApiIds] = useState([]) const [operationsByApi, setOperationsByApi] = useState>({}) const [operationErrors, setOperationErrors] = useState>({}) + const operationsByApiRef = useRef>({}) const [selectedTarget, setSelectedTarget] = useState(null) const [policyDocument, setPolicyDocument] = useState(null) @@ -180,6 +181,10 @@ export function Apis() { const busy = submittingAssignment || clearingAssignment const planDefaults = useMemo(() => derivePlanDefaults(plans), [plans]) + useEffect(() => { + operationsByApiRef.current = operationsByApi + }, [operationsByApi]) + const showToast = useCallback((message: string, onRetry?: () => void, retryLabel = "Retry") => { setToast({ message, onRetry, retryLabel: onRetry ? retryLabel : undefined }) }, []) @@ -284,7 +289,7 @@ export function Apis() { return { kind: "api", api: matchingApi } } - const existingOperations = operationsByApi[matchingApi.id] ?? [] + const existingOperations = operationsByApiRef.current[matchingApi.id] ?? [] const matchingOperation = existingOperations.find((operation) => operation.id === current.operation.id) return matchingOperation ? { kind: "operation", api: matchingApi, operation: matchingOperation } @@ -308,7 +313,7 @@ export function Apis() { } finally { setInitialLoading(false) } - }, [handleAccessError, operationsByApi, showToast]) + }, [handleAccessError, showToast]) useEffect(() => { if (lacksExplicitAdminRole) { From 7ffa4f30dc33735a1ff4240cee2ffe7983ea73de Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 13:58:27 -0400 Subject: [PATCH 05/14] chore(squad): log /apis render loop fix + react skill - Orchestration log: 2026-05-21T18-35-00Z-kima.md documents the infinite render loop fix - Session log: 2026-05-21T18-35-00Z-apis-render-loop-fix.md brief summary - Merged decision inbox: kima-apis-render-loop.md into decisions.md as 2026-05-21T18:35:00Z entry - Cross-agent update: appended render-loop skill note to bunk/history.md with test coverage suggestion - Deleted inbox file after merge Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/bunk/history.md | 8 ++++ .squad/decisions.md | 11 +++++ .../decisions/inbox/kima-apis-render-loop.md | 7 --- ...26-05-21T18-35-00Z-apis-render-loop-fix.md | 23 ++++++++++ .../2026-05-21T18-35-00Z-kima.md | 44 +++++++++++++++++++ 5 files changed, 86 insertions(+), 7 deletions(-) delete mode 100644 .squad/decisions/inbox/kima-apis-render-loop.md create mode 100644 .squad/log/2026-05-21T18-35-00Z-apis-render-loop-fix.md create mode 100644 .squad/orchestration-log/2026-05-21T18-35-00Z-kima.md diff --git a/.squad/agents/bunk/history.md b/.squad/agents/bunk/history.md index 1eb6a6be..8e10b28b 100644 --- a/.squad/agents/bunk/history.md +++ b/.squad/agents/bunk/history.md @@ -451,3 +451,11 @@ When writing tests for deployed infrastructure: - For Azure.ResourceManager/APIM coverage, mock at Freamon's interface seam instead of the SDK surface. Treat `IApimCatalogService` as the unit-test boundary for apply/clear/status tests; save recorded Azure fixtures for a later live-APIM pass. - Template rendering edge cases discovered: unknown params hard-fail, required params hard-fail, numeric strings are accepted for `int`, defaults are applied when declared, repeated placeholders all replace, and `{{ Name }}` whitespace variants are left literal because only exact `{{Name}}` tokens are recognized. - The shipped APIM templates contain policy-expression syntax (`As`, nested quotes, leading comments) that `XDocument.Parse` rejects even though the templates are otherwise usable for APIM management scenarios. Template validation had to be relaxed to root-tag checks so M1–M3 tests can exercise real shipped templates. +### 2026-05-21 — Cross-Agent Note: React Render-Loop Debugging & Apis.tsx Test Coverage + +**From:** Kima (UI Developer) +**Note:** New skill available: .squad/skills/react-render-loop-debugging/SKILL.md — documents the pattern and fix for infinite render loops caused by callbacks with circular dependencies on the very state they modify. + +**Action for Bunk:** Consider adding render-loop guard test coverage to src/aipolicyengine-ui/src/pages/Apis.tsx (e.g., assertion that the fetch function is called ≤ N times during mount/load). This would catch future regressions where the component re-fetches more than expected. Pattern: wrap render in ct(), mount component, spy on fetch function, verify call count ≤ expected threshold. + +**Context:** Kima fixed an infinite re-fetch loop in Apis.tsx by stabilizing the loadInitialData callback and reading latest state via a ref. See decisions.md entry 2026-05-21T18:35:00Z for full decision. diff --git a/.squad/decisions.md b/.squad/decisions.md index 55c84251..ccd922f1 100644 --- a/.squad/decisions.md +++ b/.squad/decisions.md @@ -12,6 +12,17 @@ ## Active Decisions +### 2026-05-21T18:35:00Z: React effect callback stabilization — /apis render-loop guardrail +**By:** Kima (UI Developer) +**Status:** Implemented +**What:** Fix infinite render loop in `Apis.tsx` by eliminating circular dependency in `loadInitialData` useCallback. The callback depended on `operationsByApi` AND reset it to a fresh object, causing the callback identity to change after every fetch and re-trigger the mount effect indefinitely. +**Solution:** Stabilize the callback by removing the circular dependency and reading the latest operations map via a ref (mutable reference that doesn't trigger re-runs). This maintains all original fetch and update behavior while preventing re-trigger loops. +**Why:** This pattern is common in pages that cache child collections and refresh them on demand. Callbacks invoked by mount or refresh effects must depend only on stable values. When they need the latest map/array state for reconciliation, read it through a ref or derive stable IDs first. +**Files Modified:** `src/aipolicyengine-ui/src/pages/Apis.tsx` +**Validation:** `npm run build` ✅, `npm run lint` ✅ +**Skill:** Kima wrote `.squad/skills/react-render-loop-debugging/SKILL.md` for future reference. +**Cross-Agent Note:** Bunk flagged for render-loop guard test coverage in Apis.tsx (e.g., assertion that fetch is called ≤ N times during mount/load). + ### 2026-05-21T17:43:57Z: APIM ResourceId env binding convention **By:** Freamon (Backend Dev) **Status:** Accepted diff --git a/.squad/decisions/inbox/kima-apis-render-loop.md b/.squad/decisions/inbox/kima-apis-render-loop.md deleted file mode 100644 index 44864bb8..00000000 --- a/.squad/decisions/inbox/kima-apis-render-loop.md +++ /dev/null @@ -1,7 +0,0 @@ -# Kima — /apis render-loop guardrail - -- **Date:** 2026-05-21 -- **Scope:** `src/aipolicyengine-ui/src/pages/Apis.tsx` -- **Root cause:** The page bootstrapping effect depended on `loadInitialData`, and `loadInitialData` depended on `operationsByApi` even though it also reset `operationsByApi` to a fresh object. That changed the callback identity after every fetch and re-triggered the effect indefinitely. -- **Decision / convention:** Callbacks invoked by mount or refresh effects must depend only on stable values. When they need the latest map/array state for reconciliation, read it through a ref or derive stable IDs first. -- **Why it matters:** This keeps admin pages from self-triggering fetch loops when they reset cached child collections during refresh. diff --git a/.squad/log/2026-05-21T18-35-00Z-apis-render-loop-fix.md b/.squad/log/2026-05-21T18-35-00Z-apis-render-loop-fix.md new file mode 100644 index 00000000..dd7622a1 --- /dev/null +++ b/.squad/log/2026-05-21T18-35-00Z-apis-render-loop-fix.md @@ -0,0 +1,23 @@ +# Session Log: 2026-05-21 — /apis render-loop fix + +**Agent:** Kima +**Context:** Fix infinite render loop in Apis.tsx +**Branch:** `seiggy/feature/apim-policy-management` +**Commit:** `06c32fcb` + +## Summary + +Eliminated infinite re-fetch loop by stabilizing useCallback and using ref for latest state. + +## What Was Done + +1. Identified circular dependency: `loadInitialData` callback depended on `operationsByApi` while also resetting it +2. Refactored to read `operationsByApi` via ref instead of dependency +3. Verified no loops with build and lint passing +4. Created react render-loop debugging skill for future reference + +## Validation + +✅ Builds +✅ Lints +✅ No test regressions diff --git a/.squad/orchestration-log/2026-05-21T18-35-00Z-kima.md b/.squad/orchestration-log/2026-05-21T18-35-00Z-kima.md new file mode 100644 index 00000000..fbc57637 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T18-35-00Z-kima.md @@ -0,0 +1,44 @@ +# Orchestration Log: 2026-05-21 — Kima — /apis render-loop fix + +**Agent:** Kima (UI Developer) +**Status:** ✅ Complete +**Branch:** `seiggy/feature/apim-policy-management` +**Commit:** `06c32fcb` + +## Summary + +Fixed an infinite render loop on the `/apis` page caused by a circular dependency in the mount effect's callback. + +## Root Cause + +The `loadInitialData` useCallback depended on `operationsByApi` AND reset it to a fresh object within the same function body. This caused: +1. After first fetch: `loadInitialData` identity changes (because `operationsByApi` ref changed) +2. Mount effect re-triggers (dependency changed) +3. `loadInitialData` runs again, resetting `operationsByApi` again +4. Loop continues indefinitely + +## Solution + +- Stabilized `loadInitialData` by removing the circular dependency on `operationsByApi` +- Read the latest operations map via a ref (mutable reference that doesn't trigger effect re-runs) +- Preserved all original fetch and update behavior + +## Files Changed + +- `src/aipolicyengine-ui/src/pages/Apis.tsx` + +## Validation + +- ✅ `npm run build` (includes `tsc -b`) +- ✅ `npm run lint` +- ✅ No new test failures +- ✅ Manual verification: no re-fetch loops on page load + +## Decision Captured + +See `decisions.md` entry: **2026-05-21 — Kima — /apis render-loop guardrail** + +## Cross-Agent Updates + +- Kima wrote `.squad/skills/react-render-loop-debugging/SKILL.md` +- Bunk flagged for render-loop guard test coverage in `Apis.tsx` From 3aeea0538ba0486ddc4e6da226837248ade85d86 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 14:04:14 -0400 Subject: [PATCH 06/14] fix(ui): polish /apis tree overflow + assign template modal layout - ApiTree: Add min-w-0 flex-1 truncate to API name, flex-shrink-0 on badge - ApiTree: Remove redundant serviceUrl display (path is sufficient) - ApiTree: Simplify operation rows - show method badge + urlTemplate only (removes duplicate verb text) - AssignTemplateForm: Add min-w-0 + truncate to param label, flex-shrink-0 on badges - AssignTemplateForm: Change param grid to sm:grid-cols-2 for better modal fit - Dialog: Add overflow-x-hidden to prevent horizontal scrollbar Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../src/components/apis/ApiTree.tsx | 14 +++++--------- .../src/components/apis/AssignTemplateForm.tsx | 10 +++++----- src/aipolicyengine-ui/src/components/ui/dialog.tsx | 2 +- 3 files changed, 11 insertions(+), 15 deletions(-) diff --git a/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx b/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx index 3f1b91ae..19861fb3 100644 --- a/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx +++ b/src/aipolicyengine-ui/src/components/apis/ApiTree.tsx @@ -82,13 +82,10 @@ export function ApiTree({ aria-current={selectedKey === apiKey ? "page" : undefined} >
- {api.displayName} - {api.isCurrent && Current} + {api.displayName} + {api.isCurrent && Current}
/{api.path} - {api.serviceUrl && ( - {api.serviceUrl} - )}
@@ -123,12 +120,11 @@ export function ApiTree({ aria-current={selectedKey === operationKey ? "page" : undefined} > -
+
- {operation.displayName} - {operation.method} + {operation.method} + {operation.urlTemplate}
-
{operation.urlTemplate}
) diff --git a/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx b/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx index 2ca9a1c4..597c943e 100644 --- a/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx +++ b/src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx @@ -173,15 +173,15 @@ export function AssignTemplateForm({ ) : selectedTemplate.parameters.length === 0 ? (

This template does not require parameters.

) : ( -
+
{selectedTemplate.parameters.map((parameter) => ( -
+
-
{parameter.description && (

{parameter.description}

diff --git a/src/aipolicyengine-ui/src/components/ui/dialog.tsx b/src/aipolicyengine-ui/src/components/ui/dialog.tsx index a0fc0bd5..afc5ea1d 100644 --- a/src/aipolicyengine-ui/src/components/ui/dialog.tsx +++ b/src/aipolicyengine-ui/src/components/ui/dialog.tsx @@ -13,7 +13,7 @@ function Dialog({ open, onOpenChange, children }: DialogProps) { return (
onOpenChange(false)} /> -
+
{children}
From 22cc565a0e81fa4da027616e6ba725f6a19dd03b Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 14:07:09 -0400 Subject: [PATCH 07/14] chore(squad): log /apis UI polish + tailwind skill - Orchestration log: Kima's /apis layout fixes (ApiTree, AssignTemplateForm, dialog) - Session log: Brief summary of UI polish completion - Cross-agent: Append tailwind-flex-truncate-pattern SKILL note to Bunk's history - Commit from latest main (3aeea053) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/bunk/history.md | 9 ++ .squad/agents/kima/history.md | 15 +++- .../2026-05-21T18-05-38Z-apis-ui-polish.md | 10 +++ .../2026-05-21T18-05-38Z-kima.md | 29 +++++++ .../tailwind-flex-truncate-pattern/SKILL.md | 85 +++++++++++++++++++ 5 files changed, 147 insertions(+), 1 deletion(-) create mode 100644 .squad/log/2026-05-21T18-05-38Z-apis-ui-polish.md create mode 100644 .squad/orchestration-log/2026-05-21T18-05-38Z-kima.md create mode 100644 .squad/skills/tailwind-flex-truncate-pattern/SKILL.md diff --git a/.squad/agents/bunk/history.md b/.squad/agents/bunk/history.md index 8e10b28b..1a29d3ca 100644 --- a/.squad/agents/bunk/history.md +++ b/.squad/agents/bunk/history.md @@ -459,3 +459,12 @@ When writing tests for deployed infrastructure: **Action for Bunk:** Consider adding render-loop guard test coverage to src/aipolicyengine-ui/src/pages/Apis.tsx (e.g., assertion that the fetch function is called ≤ N times during mount/load). This would catch future regressions where the component re-fetches more than expected. Pattern: wrap render in ct(), mount component, spy on fetch function, verify call count ≤ expected threshold. **Context:** Kima fixed an infinite re-fetch loop in Apis.tsx by stabilizing the loadInitialData callback and reading latest state via a ref. See decisions.md entry 2026-05-21T18:35:00Z for full decision. + +### 2026-05-21 — Cross-Agent Note: Tailwind Flex/Truncate Pattern for UI Components + +**From:** Kima (UI Developer) +**Note:** New skill available: `.squad/skills/tailwind-flex-truncate-pattern/SKILL.md` — documents layout pattern combining `min-w-0`, `flex-shrink-0`, `flex-1`, and `truncate` for preventing row/card overflow in flex containers and handling badge positioning. + +**Action for Bunk:** Consider applying this pattern to UI component test coverage (ApiTree.tsx, AssignTemplateForm.tsx) to verify no text overflow regressions when grid resizes or content grows. Pattern examples: responsive badge placement with fixed widths, label truncation with dynamic form fields. + +**Context:** Kima fixed `/apis` page layout bugs (ApiTree row overflow, AssignTemplateForm param card overlap, modal horizontal scroll) by applying this Tailwind pattern systematically. Commit `3aeea053` on `seiggy/feature/apim-policy-management`. diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index b6fe2314..436acf4a 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -54,4 +54,17 @@ Freamon fixed a config-binding bug in the APIM infrastructure: the env var `APIM - When frontend reads backend config, expect the same pattern - This is idiomatic ASP.NET Core, not a special case -**Full decision merged into `.squad/decisions.md`.** \ No newline at end of file +**Full decision merged into `.squad/decisions.md`.** + +## 2026-05-22 — Flex+Truncate Pattern for Badge/Title Rows + +**Layout Bug Fix Session:** +- Fixed text overflow in API tree rows and modal parameter cards +- Pattern: when a title and badge(s) share a flex row, the title needs `min-w-0 flex-1 truncate` and badges need `flex-shrink-0` +- Without `min-w-0`, flex items won't shrink below intrinsic content width, causing overflow +- Removed redundant `serviceUrl` from API tree (path is sufficient, URL cluttered the row) +- Simplified operation rows: show method badge + urlTemplate instead of duplicated `displayName` + verb + badge +- Modal horizontal scroll fixed with `overflow-x-hidden` on dialog container +- Parameter card grid changed from `md:grid-cols-2` to `sm:grid-cols-2` for narrower modal viewport fit + +**Rule:** For any flex row with text + badges: `TextLabel` \ No newline at end of file diff --git a/.squad/log/2026-05-21T18-05-38Z-apis-ui-polish.md b/.squad/log/2026-05-21T18-05-38Z-apis-ui-polish.md new file mode 100644 index 00000000..efff1b20 --- /dev/null +++ b/.squad/log/2026-05-21T18-05-38Z-apis-ui-polish.md @@ -0,0 +1,10 @@ +# Session: /apis UI Polish +**Date:** 2026-05-21T18:05:38Z +**Agent:** Kima +**Branch:** seiggy/feature/apim-policy-management + +## Summary +Fixed UI layout bugs on `/apis` page (ApiTree row overflow, AssignTemplateForm param card overlap, modal horizontal scroll). Commit `3aeea053` pushed. New Tailwind flex/truncate pattern captured in `.squad/skills/tailwind-flex-truncate-pattern/SKILL.md`. + +## Status +✅ Complete diff --git a/.squad/orchestration-log/2026-05-21T18-05-38Z-kima.md b/.squad/orchestration-log/2026-05-21T18-05-38Z-kima.md new file mode 100644 index 00000000..2470ff4d --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T18-05-38Z-kima.md @@ -0,0 +1,29 @@ +# Orchestration: Kima @ 2026-05-21T18:05:38Z + +## Agent +**Kima** — UI polish & layout fix + +## Scope +Fixed `/apis` page UI layout bugs per Zack's screenshots. + +## Changes +- **ApiTree.tsx**: Min-w-0 + flex-shrink-0 on badge; dropped redundant serviceUrl + duplicate verb text +- **AssignTemplateForm.tsx**: Min-w-0/flex-1/truncate on label; flex-shrink-0 on badges; sm:grid-cols-2 +- **dialog.tsx**: Added overflow-x-hidden for modal horizontal scroll fix + +## Artifacts +- Files: `src/aipolicyengine-ui/src/components/apis/ApiTree.tsx`, `src/aipolicyengine-ui/src/components/apis/AssignTemplateForm.tsx`, `src/aipolicyengine-ui/src/components/ui/dialog.tsx` +- Skill: `.squad/skills/tailwind-flex-truncate-pattern/SKILL.md` (created) + +## Validation +- `npm run build` ✅ +- `npm run lint` ✅ + +## Branch +`seiggy/feature/apim-policy-management` (PR #32) + +## Commit +`3aeea053` + +## Requestor +Zack Way diff --git a/.squad/skills/tailwind-flex-truncate-pattern/SKILL.md b/.squad/skills/tailwind-flex-truncate-pattern/SKILL.md new file mode 100644 index 00000000..8ff9274a --- /dev/null +++ b/.squad/skills/tailwind-flex-truncate-pattern/SKILL.md @@ -0,0 +1,85 @@ +--- +name: "tailwind-flex-truncate-pattern" +description: "Proper flex truncation for title+badge rows in Tailwind CSS" +domain: "ui-layout" +confidence: "high" +source: "earned" +--- + +## Context +When building UI components with text labels alongside badges, pills, or action buttons in a flex row, text often overflows or prevents proper truncation. This is a common issue in tree views, list items, card headers, and modal forms. + +## Patterns + +### Title + Badge Row +```tsx +
+ {title} + Label +
+``` + +**Key classes:** +- `min-w-0` — allows flex item to shrink below intrinsic content width (critical for truncation) +- `flex-1` — title takes remaining space after badge +- `truncate` — adds `text-overflow: ellipsis; overflow: hidden; white-space: nowrap` +- `flex-shrink-0` — badge never shrinks, always shows full text + +### Form Label + Multiple Badges +```tsx +
+ + {isRequired && Required} + {type} +
+``` + +### Nested Flex Container +When the flex row is inside another flex container (e.g., a button or clickable area), the parent also needs `min-w-0`: +```tsx + +``` + +## Examples + +### Before (broken) +```tsx +
+ {api.displayName} + Current +
+``` +Result: Badge pushes text, but text doesn't truncate properly — overflows container. + +### After (fixed) +```tsx +
+ {api.displayName} + Current +
+``` +Result: Text truncates with ellipsis, badge stays visible. + +## Anti-Patterns + +❌ **Missing `min-w-0`** — without it, flexbox respects intrinsic width and won't truncate +```tsx +{text} // Won't truncate! +``` + +❌ **Badge without `flex-shrink-0`** — badge may shrink and hide text on narrow containers +```tsx +{label} // May shrink awkwardly +``` + +❌ **Using `whitespace-nowrap` without container constraints** — causes horizontal overflow +```tsx +
{longText}
// Overflows parent +``` + +❌ **Setting `overflow-hidden` on wrong element** — must be on the element with `truncate`, not a distant parent From 828d6c0ce02debe0d142ac76f92a828f9b2c8fd4 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 17:31:40 -0400 Subject: [PATCH 08/14] docs(squad): record AAA architecture decisions + kick off implementation - Merge McNulty's AAA per-client architecture (M1-M6 phasing) - Merge pre/post endpoint contracts addendum - Archive both decision docs to archive/ - Cross-note Freamon (M1-M3 in flight) and Bunk (test matrix in flight) - Capture Zack's pre/post integration directive - Append AAA kickoff note to Kima (M6 UI pending) and Sydnor (no infra changes) - Bump scoped-override-cascade SKILL confidence to medium (architecture approved) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/bunk/history.md | 13 + .squad/agents/freamon/history.md | 19 + .squad/agents/kima/history.md | 25 +- .squad/agents/mcnulty/history.md | 10 + .squad/agents/sydnor/history.md | 16 + .squad/decisions.md | 40 ++ .../archive/mcnulty-aaa-per-client-arch.md | 387 +++++++++++++ ...mcnulty-aaa-pre-post-endpoint-contracts.md | 522 ++++++++++++++++++ ...-21T21-28-06Z-aaa-architecture-greenlit.md | 22 + .../2026-05-21T21-28-06Z-bunk.md | 25 + .../2026-05-21T21-28-06Z-freamon.md | 23 + .../2026-05-21T21-28-06Z-mcnulty.md | 29 + .../skills/scoped-override-cascade/SKILL.md | 63 +++ 13 files changed, 1193 insertions(+), 1 deletion(-) create mode 100644 .squad/decisions/archive/mcnulty-aaa-per-client-arch.md create mode 100644 .squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md create mode 100644 .squad/log/2026-05-21T21-28-06Z-aaa-architecture-greenlit.md create mode 100644 .squad/orchestration-log/2026-05-21T21-28-06Z-bunk.md create mode 100644 .squad/orchestration-log/2026-05-21T21-28-06Z-freamon.md create mode 100644 .squad/orchestration-log/2026-05-21T21-28-06Z-mcnulty.md create mode 100644 .squad/skills/scoped-override-cascade/SKILL.md diff --git a/.squad/agents/bunk/history.md b/.squad/agents/bunk/history.md index 1a29d3ca..0b53a67a 100644 --- a/.squad/agents/bunk/history.md +++ b/.squad/agents/bunk/history.md @@ -468,3 +468,16 @@ When writing tests for deployed infrastructure: **Action for Bunk:** Consider applying this pattern to UI component test coverage (ApiTree.tsx, AssignTemplateForm.tsx) to verify no text overflow regressions when grid resizes or content grows. Pattern examples: responsive badge placement with fixed widths, label truncation with dynamic form fields. **Context:** Kima fixed `/apis` page layout bugs (ApiTree row overflow, AssignTemplateForm param card overlap, modal horizontal scroll) by applying this Tailwind pattern systematically. Commit `3aeea053` on `seiggy/feature/apim-policy-management`. + +### 2026-05-21 — Cross-Agent Note: AAA Architecture M1-M3 Kickoff + +**From:** McNulty (Architect) → All agents +**Note:** AAA per-client access-profile architecture APPROVED by Zack. Freamon and Bunk now in-flight on parallel implementation. + +**For Bunk:** 21-test matrix in flight — Access Profile resolver cascade logic (6 levels), precheck backward compat guards (with/without apiId), log integration (AccessProfileId/PlanId context flow), template render diffs (all 5 templates), end-to-end cascade flow. + +**For Kima:** M6 UI (`/access` page) pending — will start after M3 precheck contract is firm. Page layout: client selector, API grid with per-operation drill-down, assign form with Plan/Routing/Deployment selectors, bulk assign action. + +**For Sydnor:** No new Terraform changes expected for AAA work itself — infrastructure is done. M5 template updates are pure APIM policy XML changes (not infrastructure); Sydnor may assist with template version bump and APIM SDK testing if needed. + +**Context:** Full architecture at `.squad/decisions/archive/mcnulty-aaa-per-client-arch.md` (387 lines) and pre/post contracts at `.squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md` (522 lines). Decisions merged to `.squad/decisions.md` entry 2026-05-21T21:28:06Z. diff --git a/.squad/agents/freamon/history.md b/.squad/agents/freamon/history.md index 8ffaffe2..5ceb0166 100644 --- a/.squad/agents/freamon/history.md +++ b/.squad/agents/freamon/history.md @@ -37,6 +37,25 @@ API fully functional with: Backend is feature-complete and awaiting infrastructure deployment. +## 2026-05-21 — AAA Per-Client Architecture (M1-M6 Kickoff) + +**Status:** In-flight (kicked off parallel to architecture approval) + +**Assigned Work:** M1-M3 implementation (Access Profiles authorization layer) +- **M1:** AccessProfile model + Cosmos repository + IAccessProfileResolver service with cascade logic +- **M2:** Admin CRUD endpoints (`/api/access-profiles/*`) + bulk assign +- **M3:** Precheck endpoint integration — inject resolver, add apiId/operationId query param parsing, extend response with planId/accessProfileId/allowedDeployments, backward-compat fallback + +**Supporting M4-M5 (coordination with Sydnor):** +- **M4:** Log-ingest endpoint — add AccessProfileId + PlanId + ApiId + OperationId to audit trail +- **M5:** Template updates — all 5 APIM templates: add apiId/operationId variables, extend precheck URL, extract profile metadata from response, add fields to log payload, version bump 1.0→1.1 + +**Architecture:** Most-specific-wins cascade resolution: `(client+operation)` > `(client+api)` > `(client+global)` > `ClientPlanAssignment` (level 4). Backward-compatible (no breaking changes to existing clients/templates). + +**Test Coverage (Bunk):** 21 tests planned — cascade logic, backward compat, integration flows. + +**Next Phase:** UI (`/access` page, Kima) starts after M2 API contracts firm. + ## Learnings *Core learnings consolidated in Core Context section above (see git history for detailed entries).* diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index 436acf4a..e0b8c94a 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -67,4 +67,27 @@ Freamon fixed a config-binding bug in the APIM infrastructure: the env var `APIM - Modal horizontal scroll fixed with `overflow-x-hidden` on dialog container - Parameter card grid changed from `md:grid-cols-2` to `sm:grid-cols-2` for narrower modal viewport fit -**Rule:** For any flex row with text + badges: `TextLabel` \ No newline at end of file +**Rule:** For any flex row with text + badges: `TextLabel` + +## 2026-05-21 — AAA M6 UI Pending (Access Profile Admin Page) + +**Status:** Pending — awaiting M3 precheck contract finalization + +**Scope:** Build `/access` page (new admin UI for Access Profiles) + +**Layout & Components:** +- **Top:** Client selector (dropdown/search from existing `GET /api/clients`) +- **Main grid:** APIs (rows) with columns: Plan, Routing Policy, Deployments allowed, Enable toggle +- **Drill-down:** Click API row to expand operations with per-operation overrides +- **Add/Edit form:** Select Plan (dropdown from existing plans), optionally select Routing Policy, optionally restrict deployments +- **Bulk action:** "Apply to multiple APIs" — select APIs from checklist, assign same profile to all in one shot + +**Reuse:** Plan selector dropdown (already built for client assignment), Routing Policy selector (already built for Plans page) + +**Client First Workflow:** Primary user journey is "configure THIS client's access to various APIs" — not "which clients use this API". So the layout starts with client selector, then shows their API matrix. + +**Integration:** POST/PUT/DELETE via `/api/access-profiles/*` (Freamon M2). Trigger profile creation when form submits. + +**Validation:** Contract awaits M3 precheck integration (apiId/operationId handling) and M4 log-ingest (audit trail). + +**Next:** Start after M2 API contracts firm (2-3 days out). \ No newline at end of file diff --git a/.squad/agents/mcnulty/history.md b/.squad/agents/mcnulty/history.md index f548765e..f8801f16 100644 --- a/.squad/agents/mcnulty/history.md +++ b/.squad/agents/mcnulty/history.md @@ -57,6 +57,16 @@ All backend features (routing, pricing, observability) complete and tested. Infr - Storage: existing `configuration` container, new `policy-assignment` partition key document type. - Spec delivered to `.squad/decisions/inbox/mcnulty-apim-management-architecture.md`. +**2026-05-21 — AAA Per-Client Endpoint Authorization Architecture:** +- **Three-layer mental model confirmed:** Transport (APIM template → installs XML) → Authorization (Access Profiles → resolves which Plan/Routing applies) → Enforcement (Precheck → enforces quotas, rate limits, routing). Each layer is independent and composable. +- **Resolution is a cascade, not a rules engine:** Most-specific match wins (`client+operation` > `client+api` > `client+global` > `ClientPlanAssignment` fallback). No merging between levels. Deterministic, cacheable, debuggable. +- **Backward-compatible by design:** If `apiId`/`operationId` query params are absent from precheck call, resolver falls through to existing `ClientPlanAssignment` logic. Zero migration needed for existing deployments. +- **Reusable "policy-on-top-of-policy" pattern:** When adding scoped overrides to a global default, use a cascade document with composite ID (`{scope}:{entity}:{qualifier}`), point-read by ID at each level, first-match-wins. Same pattern can apply to future features (e.g., per-API pricing overrides, per-operation DLP policies). +- **Template integration is a query param addition, not structural change:** APIM has `context.Api.Id` and `context.Operation.Id` natively available. Passing them to precheck is a one-line URL append in the template. Doesn't require template re-architecture. +- Spec delivered to `.squad/decisions/inbox/mcnulty-aaa-per-client-arch.md`. +- **Endpoint contract addendum:** Pre/post endpoint integration is first-class scope. Precheck gets `apiId`/`operationId` as query params (backward-compat: absent = legacy path). Response gains `planId`/`accessProfileId`. Log endpoint gains `AccessProfileId`/`PlanId`/`ApiId`/`OperationId` fields. Profile ID flows via APIM `context.Variables` slot (precheck response → variable extraction → log payload). Resolver lives ONLY in precheck; log endpoint trusts the passed-in planId. +- Addendum spec: `.squad/decisions/inbox/mcnulty-aaa-pre-post-endpoint-contracts.md`. + *Core learnings consolidated in Core Context section above (see git history for detailed entries).* ## Archived Learnings (Pre-May 2026) diff --git a/.squad/agents/sydnor/history.md b/.squad/agents/sydnor/history.md index 441b719c..b865d4c0 100644 --- a/.squad/agents/sydnor/history.md +++ b/.squad/agents/sydnor/history.md @@ -28,6 +28,22 @@ All backend phases complete and tested (235+ tests passing): - Terraform provider configured in `azure.yaml` with variable substitution template (`main.tfvars.json`) - Authentication aligned: azd + az CLI on same tenant (99e1e9a1-3a8f-4088-ad5d-60be65ecc59a) - All services operational: Container App API, APIM gateway, Cosmos DB, Redis Enterprise, Key Vault, Log Analytics + +## 2026-05-21 — AAA Infrastructure: No Terraform Changes Expected + +**Status:** Pending — infrastructure is complete + +**Coordination Note:** AAA per-client authorization layer (M1-M6) does NOT require new Terraform changes. Infrastructure is already deployed: +- Cosmos DB (configuration container holds AccessProfiles) +- Redis (resolver caching) +- APIM with 5 base/DLP policy templates +- Aspire orchestration (API ready for new endpoints) + +**Sydnor Role in AAA:** +- **M5 Template Updates:** May assist with template version bump (1.0 → 1.1) and APIM SDK testing if needed, but templates are APIM policy XML changes, not infrastructure +- **No Breaking Changes:** AAA is fully backward-compatible; existing clients/APIs continue working without modification + +**Next:** Monitor Freamon's M1-M5 delivery; assist with template testing/deployment when M5 reaches APIM staging. - App IDs registered via Terraform (api_app_id: d5bd33f4-09b1-4602-af88-29c5ec7728e0) **Current Issue (2026-05-14): AADSTS500113 — Reply URL Mismatch** diff --git a/.squad/decisions.md b/.squad/decisions.md index ccd922f1..26edeac2 100644 --- a/.squad/decisions.md +++ b/.squad/decisions.md @@ -12,6 +12,46 @@ ## Active Decisions +### 2026-05-21T21:28:06Z: User directive — AAA access-profile architecture approved (M1-M6) +**By:** Zack Way (via McNulty proposal review) +**Status:** Approved +**What:** Zack greenlit McNulty's AAA per-client endpoint authorization architecture (Access Profiles) with recommended defaults on all 6 open questions: +- **Client Identity Model:** Keep dual pattern (Entra JWT + subscription-key) for v1; unify in v2 if needed +- **Routing Paired with Plan:** `planId` is REQUIRED (routing without quota enforcement is meaningless) +- **Client Lifecycle Drift:** (B) Orphaned profiles are harmless for v1; (C) show stale badge in UI for v2 +- **Multi-Tenancy Scope:** Global to engine (same client across APIs); Access Profiles add per-API scoping +- **Naming:** "Access Profile" (RADIUS analogy: NAS-Port=endpoint, User-Name=clientAppId:tenantId, Service-Type=Plan+Routing) +- **Default-Deny vs Default-Allow:** No API-default profiles for v1 (explicit client registration is a feature, not a bug) + +**Architecture Summary:** +- New document type: `AccessProfile` (Cosmos `configuration` container, partition key `"access-profile"`) +- Resolution cascade: `(client+operation)` > `(client+api)` > `(client+global)` > legacy `ClientPlanAssignment` (level 4) +- Backward-compatible precheck integration: `apiId`/`operationId` query params optional; fall through to legacy if absent +- New admin endpoints: `/api/access-profiles/*` (list, get, create, update, delete, bulk) +- UI: `/access` page (client selector, API grid, per-operation drill-down, assign form) — Kima, starts after M3/M4 contract firm + +**Phasing (M1-M6):** +- **M1:** AccessProfile model + Cosmos repo + IAccessProfileResolver cascade service (Freamon) +- **M2:** Admin CRUD endpoints + bulk assign (Freamon) +- **M3:** Precheck integration — apiId/operationId param parsing, extended response (planId, accessProfileId, allowedDeployments) (Freamon) +- **M4:** Log-ingest integration — flow AccessProfileId + PlanId + ApiId + OperationId through to audit trail (Freamon) +- **M5:** Template updates — all 5 APIM templates get apiId/operationId variables + precheck URL extension + profile extraction + log payload updates, version bump 1.0→1.1 (Freamon/Sydnor) +- **M6:** UI `/access` page (Kima) — parallel to M1-M5 after M2 API ready + +**Test Coverage (Bunk):** 21 tests anticipated — resolver cascade (6 levels), precheck backward compat, log integration, template render, end-to-end flow + +**Why:** This architecture enables per-client per-API policy overrides while remaining fully backward-compatible. Existing templates/clients keep working unchanged. Access Profiles sit above transport layer (APIM templates) and above enforcement layer (precheck/log) — cleanly layered authorization. + +**Files:** Archived decisions: +- `.squad/decisions/archive/mcnulty-aaa-per-client-arch.md` — Full 387-line architecture spec +- `.squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md` — Full 522-line endpoint contracts addendum + +### 2026-05-21T14:16:20Z: User directive — AAA pre/post endpoint integration scope (CAPTURED) +**By:** Zack Way (via Copilot) +**Status:** Captured (merged into approved architecture) +**What:** The new AAA per-client access-profile layer MUST integrate into the pre/post (precheck + log) endpoints. The endpoints must accept the API/operation context, resolve via Access Profiles (most-specific-wins cascade), and use the resolved Plan/Routing for enforcement and accounting — not just the legacy global ClientPlanAssignment. +**Why:** Architecture scope confirmation — captured for team memory. + ### 2026-05-21T18:35:00Z: React effect callback stabilization — /apis render-loop guardrail **By:** Kima (UI Developer) **Status:** Implemented diff --git a/.squad/decisions/archive/mcnulty-aaa-per-client-arch.md b/.squad/decisions/archive/mcnulty-aaa-per-client-arch.md new file mode 100644 index 00000000..51b0c935 --- /dev/null +++ b/.squad/decisions/archive/mcnulty-aaa-per-client-arch.md @@ -0,0 +1,387 @@ +# AAA Per-Client Endpoint Authorization Architecture + +**Author:** McNulty (Lead / Architect) +**Date:** 2026-05-21 +**Status:** Proposal — awaiting Zack approval +**Requested by:** Zack Way + +--- + +## TL;DR — Consequential Decisions + +1. **New "Access Profile" document type.** Per-client, per-endpoint policy bindings stored in Cosmos `configuration` container with partition key `"access-profile"`. +2. **Resolution is a lookup, not a rules engine.** Most-specific match wins: `(client + operation)` > `(client + api)` > `(client + global)` > endpoint default. Deterministic, cacheable, no regex. +3. **Strictly additive to M1-M4 APIM template work.** `PolicyAssignment` doesn't change. Access Profiles sit *above* it — they resolve which Plan/Routing to enforce *when the request arrives*, not what XML is installed. +4. **Client identity is the existing `clientAppId:tenantId` composite key.** No new identity abstraction in v1. Both Entra JWT clients and subscription-key clients use the same key (sub-key clients already use `context.Subscription.Name:key-based`). +5. **AAA naming: "Access Profile."** RADIUS analogy: NAS-Port = API endpoint, User-Name = clientAppId:tenantId, Service-Type = Plan+Routing policy. The Access Profile is the RADIUS Access-Accept record. + +--- + +## 1. Mental Model Confirmation + +The three-layer architecture is correct: + +| Layer | Responsibility | What Ships Here | +|-------|---------------|-----------------| +| **Transport** (APIM template assignment) | Install XML that calls our REST endpoints with `clientAppId`, `tenantId`, `deploymentId`, body | M1-M4 `PolicyAssignment` — already built. Installs the "wiring." | +| **Authorization** (Access Profiles — THIS PROPOSAL) | Given `(client, apiId, operationId)`, resolve WHICH Plan and WHICH Routing policy apply | New `AccessProfile` doc. New resolution service. New admin endpoints. | +| **Enforcement** (Precheck / Log Ingest) | Given a resolved Plan+Routing, enforce rate limits, quotas, deployment access, route rewrites | Existing `PrecheckEndpoints.cs`, `RoutingEvaluator.cs`, `LogIngestEndpoints.cs` — unchanged. | + +**Key insight from reading the code:** Today the enforcement layer gets its Plan from `ClientPlanAssignment.PlanId` — a flat, global binding. There is no concept of "this client gets Plan X for API-A but Plan Y for API-B." The `ClientPlanAssignment` is a **global** client-to-plan binding. Access Profiles add the **endpoint-scoped** override layer on top of that global default. + +--- + +## 2. Data Model — Access Profile + +### Document Shape + +```json +{ + "id": "ap:{clientAppId}:{tenantId}:{apiId}:{operationId|_all}", + "partitionKey": "access-profile", + "clientAppId": "my-app-client-id", + "tenantId": "contoso-tenant-id", + "apiId": "azure-openai-jwt-based-api", + "operationId": null, + "planId": "enterprise-plan", + "routingPolicyId": "cost-optimized-routing", + "allowedDeployments": ["gpt-4o", "gpt-4o-mini"], + "enabled": true, + "createdAt": "2026-05-21T10:00:00Z", + "updatedAt": "2026-05-21T10:00:00Z", + "createdBy": "admin@contoso.com" +} +``` + +### Fields + +| Field | Type | Required | Description | +|-------|------|----------|-------------| +| `id` | string | auto | Composite key: `ap:{clientAppId}:{tenantId}:{apiId}:{operationId\|_all}` | +| `partitionKey` | string | auto | Always `"access-profile"` | +| `clientAppId` | string | yes | From JWT `azp`/`appid` or APIM subscription name | +| `tenantId` | string | yes | From JWT `tid` or `"key-based"` for sub-key clients | +| `apiId` | string | yes | APIM API ID (matches `PolicyAssignment.ApiId`). Use `"_global"` for a client-wide default. | +| `operationId` | string? | no | APIM operation ID. Null = applies to entire API. | +| `planId` | string | yes | Reference to `PlanData.Id` — the billing/quota plan for this scope | +| `routingPolicyId` | string? | no | Reference to `ModelRoutingPolicy.Id`. Null = inherit from plan. | +| `allowedDeployments` | string[] | no | Override deployment allowlist. Empty = inherit from plan. | +| `enabled` | bool | yes | Toggle without deleting | +| `createdAt` | DateTime | auto | | +| `updatedAt` | DateTime | auto | | +| `createdBy` | string | auto | Admin UPN | + +### Cosmos Strategy + +- **Container:** existing `configuration` (same as Plans, Routing Policies, PolicyAssignments) +- **Partition key:** `"access-profile"` — all profiles in one logical partition for efficient cross-client queries (admin page lists all) +- **ID format:** deterministic composite → enables point reads (fastest Cosmos operation) +- **Expected volume:** low hundreds to low thousands. Single-partition is fine at this scale. + +### Relationship Diagram + +``` +ClientPlanAssignment (global default) + └── planId → PlanData + └── modelRoutingPolicyOverride → ModelRoutingPolicy + +AccessProfile (endpoint-scoped override) ← NEW + └── planId → PlanData + └── routingPolicyId → ModelRoutingPolicy + └── clientAppId:tenantId → ClientPlanAssignment (same client) + └── apiId → PolicyAssignment.ApiId (same API) + +PolicyAssignment (APIM template) + └── templateId → which XML is installed + └── apiId → APIM API +``` + +--- + +## 3. Resolution Algorithm + +### Lookup Order (most-specific wins) + +When `/api/precheck/{clientAppId}/{tenantId}` is called, the engine resolves the effective policy through this cascade: + +1. **Operation-specific profile:** `ap:{clientAppId}:{tenantId}:{apiId}:{operationId}` — exact operation override +2. **API-specific profile:** `ap:{clientAppId}:{tenantId}:{apiId}:_all` — covers all operations on this API +3. **Global client profile:** `ap:{clientAppId}:{tenantId}:_global:_all` — client-wide default +4. **Legacy fallback:** `ClientPlanAssignment.PlanId` — today's global binding (backward-compatible) + +**First match wins. No merging. No inheritance between levels.** A profile at level 1 completely determines Plan + Routing for that request; we don't partially inherit from level 4. + +### How Precheck Gets `apiId` and `operationId` + +**This is the critical integration point.** Today the APIM templates pass `clientAppId` and `tenantId` to precheck. They do NOT pass `apiId` or `operationId`. + +**Required change:** The APIM policy templates must pass `apiId` and `operationId` as query parameters to precheck: + +``` +/api/precheck/{clientAppId}/{tenantId}?deploymentId=gpt-4o&apiId=my-api&operationId=chat-completions +``` + +The APIM policy has access to `context.Api.Id` and `context.Operation.Id` natively. We add two `set-variable` statements and append them to the precheck URL. **This is a template parameter change, not a structural change** — existing templates continue to work without `apiId`/`operationId` (precheck falls through to level 4). + +### Fall-Through Behavior + +- If no Access Profile matches AND no `ClientPlanAssignment` exists → **401 Unauthorized** (today's behavior, unchanged) +- If no Access Profile matches BUT `ClientPlanAssignment` exists → use global plan (today's behavior, unchanged) +- If Access Profile exists with `enabled: false` → skip it, fall to next level + +### Caching Strategy + +- **Redis:** Cache resolved Access Profiles with key `access-profile:{clientAppId}:{tenantId}:{apiId}:{operationId}` and 30-second TTL (same as routing policy cache today) +- **In-memory:** Hot path uses `ConcurrentDictionary` with 30-second TTL (same pattern as `RoutingPolicyCache` in PrecheckEndpoints.cs) +- **Invalidation:** Admin writes invalidate Redis key immediately. In-memory cache expires naturally (30s staleness is acceptable for admin config changes) + +### Worked Example + +**Setup:** +- Client `app-123:tenant-456` has a global `ClientPlanAssignment` with `PlanId = "basic-plan"` +- An `AccessProfile` exists: `ap:app-123:tenant-456:openai-api:_all` → `planId = "premium-plan", routingPolicyId = "fast-routing"` +- Another `AccessProfile`: `ap:app-123:tenant-456:openai-api:chat-completions` → `planId = "unlimited-plan"` + +**Request 1:** Client calls `openai-api`, operation `embeddings` +- Check level 1: `ap:app-123:tenant-456:openai-api:embeddings` → not found +- Check level 2: `ap:app-123:tenant-456:openai-api:_all` → **MATCH** → use `premium-plan` + `fast-routing` + +**Request 2:** Client calls `openai-api`, operation `chat-completions` +- Check level 1: `ap:app-123:tenant-456:openai-api:chat-completions` → **MATCH** → use `unlimited-plan`, no routing override (null → inherit from plan) + +**Request 3:** Client calls `internal-api`, operation `anything` +- Check level 1: not found +- Check level 2: not found +- Check level 3: `ap:app-123:tenant-456:_global:_all` → not found (none created) +- Check level 4: `ClientPlanAssignment.PlanId = "basic-plan"` → **FALLBACK** → use `basic-plan` + +--- + +## 4. API Surface + +All endpoints require `AdminPolicy` authorization. Prefix: `/api/access-profiles`. + +``` +GET /api/access-profiles + Query: ?clientAppId=&tenantId=&apiId= (all optional filters) + → 200: { profiles: AccessProfile[] } + +GET /api/access-profiles/{profileId} + → 200: AccessProfile + → 404 + +POST /api/access-profiles + Body: { clientAppId, tenantId, apiId, operationId?, planId, routingPolicyId?, allowedDeployments?, enabled? } + → 201: AccessProfile + → 409: if exact scope already exists + → 400: if planId doesn't exist + +PUT /api/access-profiles/{profileId} + Body: { planId?, routingPolicyId?, allowedDeployments?, enabled? } + → 200: AccessProfile + → 404 + +DELETE /api/access-profiles/{profileId} + → 204 + → 404 + +POST /api/access-profiles/bulk + Body: { profiles: [{ clientAppId, tenantId, apiId, operationId?, planId, ... }] } + → 200: { created: int, failed: [{ index, error }] } + (Admin assigns same plan to multiple clients for an API in one shot) +``` + +### Client List Source + +Clients already exist as `ClientPlanAssignment` documents (partition key `"client"`). The existing `GET /api/clients` endpoint returns them. **No new client identity management needed** — Access Profiles reference the same `clientAppId:tenantId` key that's already in the system. + +### Resolution Endpoint (internal — called by precheck, not exposed to UI) + +The resolution logic lives in a new `IAccessProfileResolver` service injected into `PrecheckEndpoints`. No separate HTTP endpoint needed — it's an internal service call on the hot path. + +--- + +## 5. UI Implications (Kima) + +### Recommended Approach: New `/access` Page + +**Not** bolted onto the existing `/apis` page (which is about APIM template assignment — transport layer). Access Profiles are a different concern (authorization layer). + +**Layout:** +- **Top:** Client selector (dropdown/search of existing clients from `GET /api/clients`) +- **Main grid:** APIs (rows) × Columns: Plan, Routing Policy, Deployments, Status toggle +- **Drill-down:** Click an API row to expand operations with per-operation overrides +- **Action:** "Assign" button opens a form: select Plan (from existing Plans), optionally select Routing Policy, optionally restrict deployments +- **Bulk action:** "Apply to multiple APIs" — select APIs from checklist, assign same profile + +**Alternative considered:** Tree view (APIs → operations → clients). Rejected because the primary workflow is "configure THIS client's access to various APIs" — client-first, not API-first. + +**Reuse:** The Plan selector dropdown already exists in the client assignment flow. The Routing Policy selector exists in the Plans page. Kima reuses both. + +--- + +## 6. Intersection with In-Flight APIM Template Work + +### `PolicyAssignment` — NO CHANGES NEEDED + +The `PolicyAssignment` doc type is purely about which XML template is installed on which APIM API. Access Profiles are orthogonal — they don't change what's installed, they change what gets resolved *when the installed policy calls precheck*. + +### Template XML — MINOR ADDITIVE CHANGE + +The APIM policy templates need to pass `apiId` and `operationId` to the precheck URL. This is a **backward-compatible addition**: + +```xml + + + + + +@((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + + (string)context.Variables["clientAppId"] + "/" + + (string)context.Variables["tenantId"] + + "?deploymentId=" + (string)context.Variables["deploymentId"] + + "&apiId=" + (string)context.Variables["apiId"] + + "&operationId=" + (string)context.Variables["operationId"]) +``` + +**This does NOT require re-applying existing templates.** If `apiId`/`operationId` are absent from the precheck request, the resolver falls through to the global `ClientPlanAssignment` (level 4). Old templates keep working. + +### Timing + +This work can start immediately. It doesn't block the in-flight PR #32. The template changes ship in the NEXT template version bump (existing `templateVersion: "1.0"` keeps working). + +--- + +## 7. Open Questions for Zack + +### Q1: Client Identity Model + +Today we have two patterns: +- **Entra JWT clients:** `clientAppId` = JWT `azp`/`appid`, `tenantId` = JWT `tid` +- **Subscription-key clients:** `clientAppId` = APIM subscription name, `tenantId` = `"key-based"` + +**Do we keep this dual model, or unify into an engine-owned "client" abstraction?** + +My recommendation: Keep the dual model for v1. It works, the precheck endpoint already handles both, and adding an abstraction layer adds complexity without solving a real problem today. Unify in v2 if we need cross-auth-type client identity. + +### Q2: Are Routing Rules Always Paired with a Plan? + +Can an Access Profile specify ONLY a routing policy override without changing the Plan? Or must every profile have a `planId`? + +My recommendation: `planId` is REQUIRED. A routing policy without quota/rate enforcement is meaningless in this engine — you'd get unlimited access. If Zack wants "same plan, different routing," the profile specifies the same planId explicitly + a different routingPolicyId. + +### Q3: Drift Detection — Client Lifecycle + +If an APIM subscription is deleted (sub-key client) or an Entra app registration is removed, do we: +- (A) Auto-detect and prune Access Profiles? (Requires periodic reconciliation job) +- (B) Leave orphaned profiles (they're inert — precheck just won't match them)? +- (C) Show "stale" badge in UI but don't auto-delete? + +My recommendation: (B) for v1 — orphans are harmless. (C) for v2 — surface it in the UI with a manual "clean up" button. + +### Q4: Multi-Tenancy — Client Scope + +Is a client (`clientAppId:tenantId`) global to the engine, or scoped per-API? + +Based on current code: **Global.** A `ClientPlanAssignment` exists once and the same client can hit any API the template is installed on. Access Profiles add per-API scoping on top. This means the same client can have different plans for different APIs — which is exactly what Zack asked for. + +**Confirm:** Is there ever a case where the same `clientAppId:tenantId` should mean different things on different APIs? (E.g., subscription "team-alpha" on API-A is a different logical client than "team-alpha" on API-B?) If yes, we need a namespace. I assume no. + +### Q5: AAA Naming + +Options in RADIUS vernacular: +- **Access Profile** (my recommendation) — maps to RADIUS Access-Accept: "here's what this user gets on this port" +- **Service Authorization** — more formal, maps to Service-Type attribute +- **Network Access Policy** — too generic, confusable with APIM policy +- **Authorization Binding** — accurate but not evocative + +**Recommendation: "Access Profile."** Short. Clear. Scannable in UI. Maps cleanly to RADIUS mental model. + +### Q6: Default-Deny vs Default-Allow + +When a client hits an API that has NO Access Profile AND NO global `ClientPlanAssignment`: +- Today: **401 Unauthorized** (deny) +- Should we add a concept of "API default profile" — a profile with `clientAppId = "_default"` that any unrecognized client falls into? + +My recommendation: No for v1. Explicit client registration is a feature, not a bug. If Zack wants "open" APIs, they simply don't install the precheck template on those APIs. + +--- + +## 8. Phasing + +| Milestone | Scope | Agent | Depends On | +|-----------|-------|-------|------------| +| **M1** | `AccessProfile` model + Cosmos repository + `IAccessProfileResolver` service with cascade logic | Freamon | None (additive) | +| **M2** | CRUD endpoints (`/api/access-profiles/*`) + bulk assign | Freamon | M1 | +| **M3** | Precheck integration — modify `PrecheckEndpoints.cs` to call resolver when `apiId` query param present; fall through to existing behavior when absent | Freamon | M1 | +| **M4** | Template update — add `apiId`/`operationId` to precheck URL in all 5 templates + version bump | Freamon/Sydnor | M3 (needs endpoint ready to receive) | +| **M5** | UI — `/access` page: client selector, API grid, per-operation drill-down, assign form | Kima | M2 (needs CRUD API) | +| **M6** | Redis caching for resolver hot path + invalidation on write | Freamon | M3 | + +### Thin Slice (M1-M3): Ship without UI + +M1-M3 can ship behind the existing admin API. The resolver works on the hot path. Admins can CRUD profiles via API (Postman/curl). Kima builds UI in parallel. + +**Total estimate:** M1-M3 is ~2 days of Freamon work. M4 is trivial (template string changes). M5 is Kima's standard page (3-4 days based on prior pages). + +--- + +## 9. Risks & Non-Goals + +### Risks + +| Risk | Severity | Mitigation | +|------|----------|------------| +| Precheck latency increase (4 Cosmos reads in worst case) | Medium | Redis cache + in-memory cache. Realistic case is 1-2 reads (most clients have a level 2 or level 4 match). Point reads by ID are <5ms in Cosmos. | +| Stale cache serves wrong plan for up to 30s after admin change | Low | Acceptable for admin config changes. Add "changes take up to 30 seconds to propagate" note in UI. | +| Access Profiles with invalid `planId` references | Low | Validate on write (POST/PUT checks plan exists). UI uses dropdown populated from existing plans. | +| Schema migration — existing `ClientPlanAssignment` still works? | None | Fully backward-compatible. If no Access Profile matches, precheck uses `ClientPlanAssignment` exactly as today. Zero migration needed. | + +### Non-Goals (v1) + +- **Dynamic policy evaluation** (rules engine, ABAC, time-based policies) — out of scope. Deterministic lookups only. +- **Client self-service** — clients can't request their own access. Admin-only. +- **Audit trail for profile changes** — reuse existing audit log pattern but don't build a dedicated "who changed what" timeline. +- **Cross-APIM-instance profiles** — 1:1 engine-to-APIM, same as PolicyAssignment. +- **Inheritance/merging between levels** — first match wins, no partial override. Keeps resolution simple and debuggable. +- **Rate limit pooling across APIs** — each API enforces its own limits independently, even if same plan is used. + +--- + +## Appendix A: Precheck Flow (After This Work) + +``` +APIM policy XML (transport) + → calls /api/precheck/{clientAppId}/{tenantId}?deploymentId=X&apiId=Y&operationId=Z + +PrecheckEndpoints.cs: + 1. If apiId present → call AccessProfileResolver.ResolveAsync(clientAppId, tenantId, apiId, operationId) + → Returns (planId, routingPolicyId, allowedDeployments) or null + 2. If resolver returns a match → load PlanData by resolved planId + 3. If resolver returns null → fall back to ClientPlanAssignment.PlanId (today's path) + 4. Continue with existing quota/rate-limit/routing enforcement (unchanged) +``` + +## Appendix B: IAccessProfileResolver Interface + +```csharp +public interface IAccessProfileResolver +{ + /// + /// Resolves the effective access profile for a client+endpoint tuple. + /// Returns null if no profile matches (caller should fall back to ClientPlanAssignment). + /// + Task ResolveAsync( + string clientAppId, string tenantId, + string apiId, string? operationId, + CancellationToken ct = default); +} + +public sealed class ResolvedAccess +{ + public string PlanId { get; set; } = string.Empty; + public string? RoutingPolicyId { get; set; } + public List AllowedDeployments { get; set; } = []; + public string SourceProfileId { get; set; } = string.Empty; // for audit/debug +} +``` diff --git a/.squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md b/.squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md new file mode 100644 index 00000000..1511a167 --- /dev/null +++ b/.squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md @@ -0,0 +1,522 @@ +# AAA Per-Client Architecture — Pre/Post Endpoint Contracts Addendum + +**Author:** McNulty (Lead / Architect) +**Date:** 2026-05-21 +**Status:** Addendum to approved architecture +**Parent:** `.squad/decisions/inbox/mcnulty-aaa-per-client-arch.md` + +--- + +## 1. Precheck Endpoint — Request Contract + +### Current Signature + +``` +GET /api/precheck/{clientAppId}/{tenantId}?deploymentId={deploymentId} +``` + +Route params: `clientAppId`, `tenantId` +Query params: `deploymentId` + +### New Signature (backward-compatible additions) + +``` +GET /api/precheck/{clientAppId}/{tenantId}?deploymentId={deploymentId}&apiId={apiId}&operationId={operationId}&subscriptionId={subscriptionId} +``` + +New query parameters: + +| Param | Source in APIM XML | Required | Fallback if absent | +|-------|-------------------|----------|-------------------| +| `apiId` | `context.Api.Id` | No | Resolver skips to level 4 (legacy `ClientPlanAssignment`) | +| `operationId` | `context.Operation.Id` | No | Resolver treats as `_all` (API-level match) | +| `subscriptionId` | `context.Subscription.Id` | No | Not used for resolution — informational for audit/correlation only | + +### Backward Compatibility + +The endpoint handler checks for `apiId` presence: + +```csharp +var apiId = context.Request.Query["apiId"].ToString(); +var operationId = context.Request.Query["operationId"].ToString(); + +if (!string.IsNullOrEmpty(apiId)) +{ + // NEW PATH: Access Profile resolution + var resolved = await accessProfileResolver.ResolveAsync( + clientAppId, tenantId, apiId, + string.IsNullOrEmpty(operationId) ? null : operationId, ct); + + if (resolved is not null) + { + // Use resolved.PlanId, resolved.RoutingPolicyId, resolved.AllowedDeployments + plan = await planRepo.GetAsync(resolved.PlanId); + // ... continue with resolved plan + } + else + { + // No profile matched → fall through to ClientPlanAssignment (existing code) + } +} +else +{ + // LEGACY PATH: no apiId means old template, use ClientPlanAssignment directly (unchanged) +} +``` + +**Zero breaking change.** Existing templates without `apiId`/`operationId` params hit the exact same code path as today. + +### Why `subscriptionId` (not `subscriptionName`) + +For sub-key flows, `clientAppId` is already set to `context.Subscription.Name` in the template XML. The `subscriptionId` is additive metadata — useful for: +- Audit: correlate requests to the APIM subscription resource +- Future: lifecycle events (subscription revoked → log it) + +It does NOT participate in resolution. Resolution key is always `clientAppId:tenantId`. + +--- + +## 2. Precheck Endpoint — Response Contract + +### Current Response (200 OK) + +```json +{ + "status": "authorized", + "clientAppId": "my-app", + "tenantId": "contoso-tid", + "plan": "Enterprise Plan", + "usage": 45000, + "limit": 100000, + "currentRpm": 5, + "rpmLimit": 60, + "currentTpm": 1200, + "tpmLimit": 10000, + "routedDeployment": "gpt-4o-eastus", + "requestedDeployment": "gpt-4o", + "routingPolicyId": "cost-optimized" +} +``` + +### New Response (200 OK) — additive fields + +```json +{ + "status": "authorized", + "clientAppId": "my-app", + "tenantId": "contoso-tid", + "plan": "Enterprise Plan", + "planId": "enterprise-plan", + "usage": 45000, + "limit": 100000, + "currentRpm": 5, + "rpmLimit": 60, + "currentTpm": 1200, + "tpmLimit": 10000, + "routedDeployment": "gpt-4o-eastus", + "requestedDeployment": "gpt-4o", + "routingPolicyId": "cost-optimized", + "accessProfileId": "ap:my-app:contoso-tid:openai-api:_all", + "allowedDeployments": ["gpt-4o", "gpt-4o-mini"] +} +``` + +**New fields:** + +| Field | Type | When present | Purpose | +|-------|------|-------------|---------| +| `planId` | string | Always (new) | Machine-readable plan identifier. Today we only return `plan` (display name). Templates need the ID for the log payload. | +| `accessProfileId` | string? | When resolved via Access Profile (levels 1-3) | Identifies which profile authorized this request. Null when falling through to legacy `ClientPlanAssignment`. | +| `allowedDeployments` | string[]? | When profile or plan has restrictions | Effective deployment allowlist after resolution. Null = unrestricted. | + +**Deny responses (401/403/429) — no change.** The existing error shapes are sufficient. Adding one optional field to denials: + +```json +{ + "error": "Client not authorized — no plan assigned", + "clientAppId": "my-app", + "tenantId": "contoso-tid", + "accessProfileId": null, + "deniedBy": "no-profile-no-assignment" +} +``` + +`deniedBy` values: `"no-profile-no-assignment"`, `"profile-disabled"`, `"quota-exceeded"`, `"rate-limit"`, `"deployment-denied"`, `"routing-denied"`. Purely informational for debugging. + +--- + +## 3. Log Endpoint — Request Contract + +### Current `LogIngestRequest` Fields + +```csharp +TenantId, ClientAppId, Audience, DeploymentId, +RequestBody, ResponseBody, RoutingPolicyId, CorrelationId +``` + +### New Fields (additive) + +```csharp +/// Access Profile ID that authorized this request (from precheck response). +public string? AccessProfileId { get; set; } + +/// Plan ID resolved for this request (from precheck response). +public string? PlanId { get; set; } + +/// APIM API ID for endpoint-scoped accounting. +public string? ApiId { get; set; } + +/// APIM Operation ID for operation-scoped accounting. +public string? OperationId { get; set; } +``` + +### How Profile ID Flows: Precheck → APIM → Log + +**Mechanism: APIM `context.Variables` slot.** + +The template already parses the precheck response body to extract `routedDeployment` and `routingPolicyId`. We add extraction of `accessProfileId` and `planId` using the same pattern: + +```xml + +(preserveContent: true)["accessProfileId"]?.ToString())" /> +(preserveContent: true)["planId"]?.ToString())" /> +``` + +Then in the outbound `send-one-way-request` log payload: + +```xml +payload.Add(new Newtonsoft.Json.Linq.JProperty("accessProfileId", + context.Variables.GetValueOrDefault("accessProfileId") ?? "")); +payload.Add(new Newtonsoft.Json.Linq.JProperty("planId", + context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); +payload.Add(new Newtonsoft.Json.Linq.JProperty("apiId", context.Api.Id)); +payload.Add(new Newtonsoft.Json.Linq.JProperty("operationId", context.Operation.Id)); +``` + +**Key:** `apiId` and `operationId` in the log payload come directly from `context.Api.Id` / `context.Operation.Id` — NOT from the precheck response. The precheck response carries the resolved profile metadata; the APIM context carries the request topology. + +### `AuditLogItem` Additions + +```csharp +public string? AccessProfileId { get; set; } +public string? ResolvedPlanId { get; set; } +public string? ApiId { get; set; } +public string? OperationId { get; set; } +``` + +These flow into the Cosmos audit document for full traceability. + +--- + +## 4. Resolver Placement — Decision + +**Decision: Resolve at precheck time only. Log endpoint trusts the profile id passed from APIM.** + +Justification: + +| Approach | Pros | Cons | +|----------|------|------| +| **Resolve at precheck only** (recommended) | Single hot-path lookup. Log endpoint is fire-and-forget (no blocking Cosmos read). Keeps outbound latency at zero additional hops. | If profile id is missing/corrupted in log payload, we lose the association. | +| Resolve at both precheck and log | Resilient to data loss in transit. | Doubles Cosmos reads on the hot path. Log endpoint already has the client lock contention concern — adding a resolver read compounds latency. | + +**Resilience strategy for missing profile id at log time:** + +```csharp +// In LogIngestEndpoints.cs: +// If accessProfileId is absent (old template or data loss), +// fall back to ClientPlanAssignment.PlanId for accounting (existing behavior). +var effectivePlanId = !string.IsNullOrEmpty(ingestRequest.PlanId) + ? ingestRequest.PlanId + : clientAssignment.PlanId; +``` + +The log endpoint uses `PlanId` from the request body to load the correct plan for cost calculation. If absent, it uses `ClientPlanAssignment.PlanId` (today's behavior). **No second resolver call needed.** + +**The log endpoint does NOT need the full resolver service.** It only needs to read `ingestRequest.PlanId` (a string) and call `planRepo.GetAsync(planId)`. The resolver is injected ONLY into `PrecheckEndpoints`. + +--- + +## 5. Migration / Coexistence — Decision + +**Decision: Single `IAccessProfileResolver` call that internally falls through to `ClientPlanAssignment`.** + +The resolver encapsulates the entire cascade. Precheck calls it once and gets either a `ResolvedAccess` or `null`. If `null`, precheck uses `ClientPlanAssignment` directly (the existing code path, untouched). + +### Implementation Shape + +```csharp +// In PrecheckEndpoints.cs — the new integration point: + +var apiId = context.Request.Query["apiId"].ToString(); +var operationId = context.Request.Query["operationId"].ToString(); + +// Resolution +string effectivePlanId; +string? effectiveRoutingPolicyId; +List? effectiveAllowedDeployments; +string? accessProfileId = null; + +if (!string.IsNullOrEmpty(apiId)) +{ + var resolved = await resolver.ResolveAsync(clientAppId, tenantId, apiId, + string.IsNullOrEmpty(operationId) ? null : operationId); + + if (resolved is not null) + { + effectivePlanId = resolved.PlanId; + effectiveRoutingPolicyId = resolved.RoutingPolicyId; + effectiveAllowedDeployments = resolved.AllowedDeployments; + accessProfileId = resolved.SourceProfileId; + } + else + { + // No profile → use legacy path + effectivePlanId = assignment.PlanId; + effectiveRoutingPolicyId = assignment.ModelRoutingPolicyOverride; + effectiveAllowedDeployments = assignment.AllowedDeployments; + } +} +else +{ + // Old template, no apiId → entirely legacy path (zero behavior change) + effectivePlanId = assignment.PlanId; + effectiveRoutingPolicyId = assignment.ModelRoutingPolicyOverride; + effectiveAllowedDeployments = assignment.AllowedDeployments; +} + +// Load plan using effectivePlanId (rest of precheck unchanged) +var plan = await planRepo.GetAsync(effectivePlanId); +``` + +### Why NOT merge the resolver into `ClientPlanAssignment` lookup: + +- **Separation of concerns.** `ClientPlanAssignment` is a billing/usage tracking entity (it has `CurrentPeriodUsage`, `OverbilledTokens`, etc.). Access Profiles are authorization entities. They serve different purposes. +- **The `ClientPlanAssignment` MUST still exist** for usage tracking regardless of Access Profile. Even when a profile overrides the plan, the usage counters live on `ClientPlanAssignment`. The profile says "which plan," but the assignment tracks "how much used." +- **Clean dependency:** `IAccessProfileResolver` depends only on `IRepository` + Redis cache. It has no knowledge of usage, billing periods, or rate limits. + +### Log Endpoint — No Resolver Change + +The log endpoint continues to load `ClientPlanAssignment` by `{clientAppId}:{tenantId}` (for usage counters). It loads the plan using `ingestRequest.PlanId` (from precheck→template→log) or falls back to `clientAssignment.PlanId`: + +```csharp +// Log endpoint plan resolution (line ~112-118 today): +var planId = !string.IsNullOrEmpty(ingestRequest.PlanId) + ? ingestRequest.PlanId + : clientAssignment.PlanId; +var plan = await planRepo.GetAsync(planId); +``` + +This is the **only change** to `LogIngestEndpoints.cs` — a two-line modification. The rest of the accounting logic (quota tracking, overbilling, multiplier billing) uses the resolved `plan` variable exactly as today. + +--- + +## 6. Template Updates — Exact Diffs + +All 5 templates need the same mechanical changes. Here's the diff for `entra-jwt-ai/policy.xml` (representative — others are identical in the affected sections): + +### Inbound: Add `apiId`/`operationId` variables (after existing claim extraction) + +```xml + + + +``` + +### Inbound: Append to precheck URL + +```xml + +@((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + + +@((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"] + "&apiId=" + (string)context.Variables["apiId"] + "&operationId=" + (string)context.Variables["operationId"]) +``` + +### Inbound: Extract profile metadata from precheck response (after existing `routedDeployment` extraction) + +```xml + +(preserveContent: true)["accessProfileId"]?.ToString())" /> +(preserveContent: true)["planId"]?.ToString())" /> +``` + +### Outbound: Add fields to log payload (in the `send-one-way-request` body) + +```xml + +payload.Add(new Newtonsoft.Json.Linq.JProperty("accessProfileId", context.Variables.GetValueOrDefault("accessProfileId") ?? "")); +payload.Add(new Newtonsoft.Json.Linq.JProperty("planId", context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); +payload.Add(new Newtonsoft.Json.Linq.JProperty("apiId", context.Variables.GetValueOrDefault("apiId") ?? "")); +payload.Add(new Newtonsoft.Json.Linq.JProperty("operationId", context.Variables.GetValueOrDefault("operationId") ?? "")); +``` + +### Template Version Bump + +All 5 `template.json` files: `"version": "1.0"` → `"version": "1.1"`. + +No new parameters needed — `apiId`/`operationId` come from APIM context (free), not from user-supplied config. + +### Which Templates Need Updates + +| Template | Precheck? | Log? | Update needed? | +|----------|-----------|------|----------------| +| `entra-jwt-ai` | ✅ | ✅ | **Yes** — all 4 diffs above | +| `entra-jwt-ai-dlp` | ✅ | ✅ | **Yes** — all 4 diffs above | +| `subscription-key-ai` | ✅ | ✅ | **Yes** — all 4 diffs above | +| `subscription-key-ai-dlp` | ✅ | ✅ | **Yes** — all 4 diffs above | +| `entra-jwt-rest` | ❌ (uses native APIM limits) | ✅ (has log-rest) | **Yes** — outbound log diff only + add `apiId`/`operationId` variables | + +--- + +## 7. Updated Milestone Breakdown + +| Milestone | Scope | Agent | Depends On | Delta from prior spec | +|-----------|-------|-------|------------|----------------------| +| **M1** | `AccessProfile` model + `CosmosAccessProfileRepository` + `IAccessProfileResolver` with cascade + Redis cache layer | Freamon | None | Unchanged | +| **M2** | CRUD endpoints (`/api/access-profiles/*`) + bulk assign | Freamon | M1 | Unchanged | +| **M3** | **Precheck integration:** Inject resolver, add `apiId`/`operationId` query param parsing, extend response with `planId`/`accessProfileId`/`allowedDeployments`, backward-compat guard | Freamon | M1 | **Expanded** — now includes response contract changes | +| **M4** | **Log integration:** Add `AccessProfileId`/`PlanId`/`ApiId`/`OperationId` to `LogIngestRequest` + `AuditLogItem`, use `ingestRequest.PlanId` for plan resolution with fallback | Freamon | M3 | **NEW milestone** — split from old M4 | +| **M5** | **Template updates:** All 5 templates get the 4-diff treatment (variables, precheck URL, profile extraction, log payload). Version bump to 1.1. | Freamon/Sydnor | M3, M4 (endpoint must accept new fields before templates send them) | **Expanded** — was "trivial", now mechanical but spans 5 files | +| **M6** | UI — `/access` page: client selector, API grid, per-operation drill-down, assign form | Kima | M2 | Unchanged (renumbered from M5) | +| **M7** | Redis caching optimization — if M1's cache layer needs tuning after load test | Freamon | M3 | Unchanged (renumbered from M6) | + +### Critical Path + +``` +M1 → M2 (UI can start) +M1 → M3 → M4 → M5 (backend pipeline, strictly serial) +M2 → M6 (UI, parallel to M3-M5) +``` + +**Thin slice for validation:** M1 + M3 + M5 (one template). Gets the full request pipeline working end-to-end for one template. M2/M4/M6 are additive. + +--- + +## 8. Test Surface + +### Resolver Unit Tests (Bunk — M1) + +| Test Case | Input | Expected | +|-----------|-------|----------| +| Operation-level match | `clientAppId=X, tenantId=Y, apiId=A, operationId=O` with profile at level 1 | Returns level 1 profile | +| API-level match (no op match) | Same client, different operationId, profile only at level 2 | Returns level 2 profile | +| Global client match | Client+API with no profiles, but `_global:_all` exists | Returns level 3 profile | +| No match → null | Client with no profiles at any level | Returns null | +| Disabled profile skipped | Level 1 exists but `enabled=false`, level 2 exists | Returns level 2 | +| First-match-wins (no merge) | Profiles at level 1 and 2 with different planIds | Returns level 1 only | + +### Precheck Integration Tests (Bunk — M3) + +| Test Case | Setup | Assert | +|-----------|-------|--------| +| Legacy path (no apiId) | Call precheck without `apiId` param | Uses `ClientPlanAssignment.PlanId`, response has no `accessProfileId` | +| Profile-resolved path | Call with `apiId`, Access Profile exists | Response contains `accessProfileId`, `planId` from profile | +| Profile fallback to legacy | Call with `apiId`, no profile matches | Uses `ClientPlanAssignment.PlanId`, `accessProfileId` is null | +| Profile with routing override | Profile has `routingPolicyId` | `routedDeployment` in response reflects profile's routing, not plan's | +| Profile with deployment restriction | Profile has `allowedDeployments: ["gpt-4o"]`, request asks for `gpt-35` | 403 with `deniedBy: "deployment-denied"` | +| Disabled profile cascade | Op-level profile disabled, API-level exists | Uses API-level profile | + +### Log Integration Tests (Bunk — M4) + +| Test Case | Setup | Assert | +|-----------|-------|--------| +| Log with profile id | `LogIngestRequest` includes `accessProfileId`, `planId` | Audit log item has both fields, plan loaded from `planId` | +| Log without profile id (legacy) | `LogIngestRequest` omits `accessProfileId`/`planId` | Falls back to `ClientPlanAssignment.PlanId` (today's behavior) | +| Log with mismatched planId | `ingestRequest.PlanId` references non-existent plan | Falls back to `clientAssignment.PlanId` | +| Audit item carries all fields | Full log with profile id, apiId, operationId | `AuditLogItem` persisted with all 4 new fields | + +### Template Render Tests (Bunk — M5) + +| Test Case | Assert | +|-----------|--------| +| Rendered XML contains `apiId` variable extraction | `? AllowedDeployments); +``` + +This is optional cleanup — the anonymous object works fine — but having a named type helps Bunk write assertions and Kima consume the API. + +--- + +## Appendix: Request Flow Diagram (After All Milestones) + +``` +Client → APIM (XML template v1.1) + │ + ├─ INBOUND: + │ set-variable: apiId = context.Api.Id + │ set-variable: operationId = context.Operation.Id + │ send-request → GET /api/precheck/{clientAppId}/{tenantId} + │ ?deploymentId=X&apiId=Y&operationId=Z + │ ← 200: { planId, accessProfileId, routedDeployment, ... } + │ set-variable: accessProfileId = response.accessProfileId + │ set-variable: resolvedPlanId = response.planId + │ (routing rewrite if needed — unchanged) + │ + ├─ BACKEND: → Azure OpenAI (or non-AI backend) + │ + └─ OUTBOUND: + send-one-way-request → POST /api/log + { clientAppId, tenantId, deploymentId, + accessProfileId, planId, apiId, operationId, + routingPolicyId, correlationId, responseBody } + ← (fire-and-forget) +``` diff --git a/.squad/log/2026-05-21T21-28-06Z-aaa-architecture-greenlit.md b/.squad/log/2026-05-21T21-28-06Z-aaa-architecture-greenlit.md new file mode 100644 index 00000000..77fc3977 --- /dev/null +++ b/.squad/log/2026-05-21T21-28-06Z-aaa-architecture-greenlit.md @@ -0,0 +1,22 @@ +# Session: AAA Architecture Greenlit +**Date:** 2026-05-21T21:28:06Z +**Participants:** McNulty (architecture), Zack Way (approval), Freamon + Bunk (M1-M3 in flight) +**Branch:** seiggy/feature/apim-policy-management + +## Summary +Zack greenlit McNulty's AAA per-client access-profile architecture. Approved all 6 open questions with recommended defaults: +- Access Profile cascade resolution (most-specific-wins) +- Backward-compatible precheck/log-ingest endpoint contracts +- New Access Profile document in Cosmos configuration container +- Integration into existing enforcement layer (no breaking changes) + +Freamon and Bunk kicked off M1-M3 implementation in parallel (repository, resolution service, endpoint integration, test matrix). + +## Status +✅ Proposal approved → M1-M3 in flight + +## Next +- Freamon: M1-M3 delivery on feature/apim-policy-management +- Bunk: 21-test matrix in parallel +- McNulty: Architect M4+ (Access Profile admin UI) +- Kima: M6 UI pending; will start after M3/M4 contract is firm diff --git a/.squad/orchestration-log/2026-05-21T21-28-06Z-bunk.md b/.squad/orchestration-log/2026-05-21T21-28-06Z-bunk.md new file mode 100644 index 00000000..a42c8368 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-28-06Z-bunk.md @@ -0,0 +1,25 @@ +# Orchestration: Bunk @ 2026-05-21T21:28:06Z + +## Agent +**Bunk** — Test coverage / QA + +## Scope +AAA test matrix (21 tests anticipated) — anticipatory coverage for M1-M3 Access Profile endpoints. + +## Status +🔄 IN-FLIGHT — kicked off parallel to architecture review + Freamon implementation + +## Expected Coverage +- Access Profile repository CRUD + edge cases +- Resolution cascade logic (6 levels: client+op, client+api, client+global, plan defaults, endpoint defaults, deny) +- Precheck endpoint with/without apiId/operationId params (backward compatibility) +- Log-ingest endpoint Access Profile context capture +- Integration: Cosmos persistence, cache invalidation, concurrent updates + +## Parallel with +- McNulty: Architecture approval (now complete) +- Freamon: M1-M3 implementation (now in flight) + +## Notes +- Will build on existing endpoint test patterns in ChargebackApiFactory +- Reuse FakeRedis + NSubstitute patterns diff --git a/.squad/orchestration-log/2026-05-21T21-28-06Z-freamon.md b/.squad/orchestration-log/2026-05-21T21-28-06Z-freamon.md new file mode 100644 index 00000000..d3c71dd4 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-28-06Z-freamon.md @@ -0,0 +1,23 @@ +# Orchestration: Freamon @ 2026-05-21T21:28:06Z + +## Agent +**Freamon** — Backend / APIM management + +## Scope +M1-M3 implementation (AAA Access Profiles integration into precheck + log-ingest endpoints). + +## Status +🔄 IN-FLIGHT — kicked off parallel to McNulty architecture review + +## Parallel with +- McNulty: Architecture proposal (now approved) +- Bunk: Test matrix (now in flight) + +## Expected Deliverables +- M1: Access Profile repository + resolution service +- M2: Precheck endpoint integration (backward-compatible apiId/operationId params) +- M3: Log-ingest endpoint integration (Access Profile context in audit trail) + +## Notes +- Uses McNulty's architecture specs from approved inbox documents +- Freamon will track on feature/apim-policy-management diff --git a/.squad/orchestration-log/2026-05-21T21-28-06Z-mcnulty.md b/.squad/orchestration-log/2026-05-21T21-28-06Z-mcnulty.md new file mode 100644 index 00000000..06def562 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-28-06Z-mcnulty.md @@ -0,0 +1,29 @@ +# Orchestration: McNulty @ 2026-05-21T21:28:06Z + +## Agent +**McNulty** — Architecture lead / AAA design + +## Scope +AAA per-client endpoint authorization architecture proposal + pre/post contract addendum. + +## Changes +- **mcnulty-aaa-per-client-arch.md**: Complete architecture proposal with 6 open questions and design decisions +- **mcnulty-aaa-pre-post-endpoint-contracts.md**: Pre/post endpoint integration contract with backward-compatible request/response schemas +- **SKILL:** `.squad/skills/scoped-override-cascade/SKILL.md` — reusable pattern for cascade resolution + +## Key Decisions +- New Access Profile document type (Cosmos `configuration` container) +- Resolution: most-specific-wins cascade `(client+op) > (client+api) > (client+global) > default` +- Backward-compatible endpoint contracts (apiId/operationId optional query params) +- Precheck and log-ingest get Access Profile context; enforcement layer unchanged + +## Artifacts +- Proposal: `.squad/decisions/inbox/mcnulty-aaa-per-client-arch.md` +- Contracts: `.squad/decisions/inbox/mcnulty-aaa-pre-post-endpoint-contracts.md` +- Directive: `.squad/decisions/inbox/copilot-directive-2026-05-21T14-16-20Z.md` + +## Status +✅ Complete (turn 0 main proposal + turn 1 pre/post addendum) + +## Reviewer +Zack Way (greenlit with recommended defaults on all 6 open questions) diff --git a/.squad/skills/scoped-override-cascade/SKILL.md b/.squad/skills/scoped-override-cascade/SKILL.md new file mode 100644 index 00000000..bf022e3e --- /dev/null +++ b/.squad/skills/scoped-override-cascade/SKILL.md @@ -0,0 +1,63 @@ +# Scoped Override Cascade + +**Confidence:** medium +**Validated:** AAA Access Profile architecture independently reviewed and approved by Zack (2026-05-21) + +## When to Use + +When adding **per-scope overrides** on top of a **global default** setting. Examples: +- Per-client-per-endpoint Plan/Routing overrides (Access Profiles) +- Per-API pricing overrides on top of global pricing +- Per-operation DLP policy overrides on top of API-level settings + +## Pattern + +### Data Model +- **Composite ID:** `{prefix}:{entity}:{scope1}:{scope2}:{scopeN|_all}` +- **Single partition key:** All override docs in one logical partition for admin queries +- **Deterministic IDs:** Enable point-reads (fastest Cosmos operation) at each cascade level + +### Resolution Algorithm +1. Define an ordered list of scopes from most-specific to least-specific +2. For each level, attempt a point-read by composite ID +3. **First match wins** — no merging, no inheritance between levels +4. Final fallback = existing global default (backward-compatible) + +### Key Principles +- **No merging:** A match at level N completely determines the result. Don't partially inherit from level N+1. +- **Backward-compatible:** If the scoping context is absent (e.g., query params not passed), the resolver skips all levels and falls through to the existing global behavior. +- **Cacheable:** Deterministic ID → Redis/in-memory cache by exact key, 30s TTL. +- **Auditable:** Store `sourceProfileId` in resolution result so logs show which override was used. + +### Cosmos Implementation + +```csharp +public interface IScopedOverrideResolver +{ + Task ResolveAsync(params string[] scopeValues); +} +``` + +Resolution attempts point-reads in order: +``` +{prefix}:{entity}:{scope1}:{scope2} → most specific +{prefix}:{entity}:{scope1}:_all → scope1 only +{prefix}:{entity}:_global:_all → entity-wide default +null → fall through to legacy +``` + +### Anti-Patterns +- ❌ Merging fields from multiple levels (complex, hard to debug) +- ❌ Regex or glob matching in scope values (unpredictable, uncacheable) +- ❌ Implicit defaults (always require explicit creation of override docs) +- ❌ Dynamic evaluation order (levels are fixed at design time) + +## Examples in This Codebase + +- `AccessProfile` (access-profile partition): resolves Plan+Routing per `(client, api, operation)` +- Future: per-API pricing tiers, per-operation content policies + +## Related Skills + +- `cosmos-repository-pattern` — base repository class used for storage +- `additive-feature-extension` — how to add features without breaking existing behavior From 3d409d2428354d036d50b3cd478d033109763ee7 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 17:41:06 -0400 Subject: [PATCH 09/14] Implement AAA access profiles Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../decisions/inbox/freamon-aaa-m1-m3-impl.md | 31 ++ .../Endpoints/AccessProfileEndpoints.cs | 336 +++++++++++++++ .../Endpoints/LogIngestEndpoints.cs | 15 +- .../Endpoints/PrecheckEndpoints.cs | 390 +++++++++++++++--- .../Models/AccessProfile.cs | 35 ++ .../Models/AccessProfileRequests.cs | 26 ++ .../Models/AccessProfileResponses.cs | 32 ++ .../Models/AuditLogDocument.cs | 12 + src/AIPolicyEngine.Api/Models/AuditLogItem.cs | 4 + .../Models/LogIngestRequest.cs | 12 + src/AIPolicyEngine.Api/Models/UsageData.cs | 12 + src/AIPolicyEngine.Api/Program.cs | 4 + .../AccessProfiles/AccessProfileResolver.cs | 63 +++ .../CosmosAccessProfileRepository.cs | 155 +++++++ .../IAccessProfileRepository.cs | 12 + .../AccessProfiles/IAccessProfileResolver.cs | 8 + .../Services/AuditLogWriter.cs | 6 +- 17 files changed, 1083 insertions(+), 70 deletions(-) create mode 100644 .squad/decisions/inbox/freamon-aaa-m1-m3-impl.md create mode 100644 src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs create mode 100644 src/AIPolicyEngine.Api/Models/AccessProfile.cs create mode 100644 src/AIPolicyEngine.Api/Models/AccessProfileRequests.cs create mode 100644 src/AIPolicyEngine.Api/Models/AccessProfileResponses.cs create mode 100644 src/AIPolicyEngine.Api/Services/AccessProfiles/AccessProfileResolver.cs create mode 100644 src/AIPolicyEngine.Api/Services/AccessProfiles/CosmosAccessProfileRepository.cs create mode 100644 src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileRepository.cs create mode 100644 src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileResolver.cs diff --git a/.squad/decisions/inbox/freamon-aaa-m1-m3-impl.md b/.squad/decisions/inbox/freamon-aaa-m1-m3-impl.md new file mode 100644 index 00000000..9f1a4de6 --- /dev/null +++ b/.squad/decisions/inbox/freamon-aaa-m1-m3-impl.md @@ -0,0 +1,31 @@ +# Freamon AAA Access Profiles M1-M3 implementation + +## Scope delivered +- Added `AccessProfile` data model and Cosmos-backed repository on the shared `configuration` container with partition key `access-profile` and deterministic IDs `ap:{clientAppId}:{tenantId}:{apiId}:{operationId|_all}`. +- Added `IAccessProfileResolver` + resolver cascade for operation-specific, API-wide, client-global, then legacy client assignment fallback. +- Added admin CRUD and bulk endpoints under `/api/access-profiles` guarded by `AdminPolicy`. +- Integrated access-profile-aware precheck behavior with optional `apiId`, `operationId`, and additive response fields `planId`, `accessProfileId`, and `allowedDeployments`. +- Extended log-ingest contracts to accept optional `accessProfileId`, `planId`, `apiId`, and `operationId`, and persisted those fields to the audit stream. + +## Key implementation decisions +- Kept client metering on `ClientPlanAssignment`; access profiles resolve plan/routing/deployment policy, but precheck still requires a client assignment for quota/rate-limit state so log-ingest and precheck stay aligned. +- Preserved legacy precheck callers when `apiId` is absent, while still surfacing `planId` for additive contract compatibility. +- Used plan inheritance semantics for `routingPolicyId` and `allowedDeployments`: access-profile overrides win when populated; otherwise plan defaults apply. +- Did not touch APIM template XML or UI code. + +## Main files +- `src/AIPolicyEngine.Api/Models/AccessProfile*.cs` +- `src/AIPolicyEngine.Api/Services/AccessProfiles/*` +- `src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs` +- `src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs` +- `src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs` +- `src/AIPolicyEngine.Api/Models/LogIngestRequest.cs` +- `src/AIPolicyEngine.Api/Models/AuditLogItem.cs` +- `src/AIPolicyEngine.Api/Models/AuditLogDocument.cs` +- `src/AIPolicyEngine.Api/Services/AuditLogWriter.cs` +- `src/AIPolicyEngine.Api/Program.cs` + +## Validation +- `dotnet build src\AIPolicyEngine.Api\AIPolicyEngine.Api.csproj --nologo` +- `dotnet test src\AIPolicyEngine.Tests\AIPolicyEngine.Tests.csproj --no-restore --nologo` +- Test run passed locally: 311 succeeded, 8 skipped. diff --git a/src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs b/src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs new file mode 100644 index 00000000..5464e87c --- /dev/null +++ b/src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs @@ -0,0 +1,336 @@ +using System.Security.Claims; +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; +using AIPolicyEngine.Api.Services.AccessProfiles; + +namespace AIPolicyEngine.Api.Endpoints; + +public static class AccessProfileEndpoints +{ + public static IEndpointRouteBuilder MapAccessProfileEndpoints(this IEndpointRouteBuilder routes) + { + routes.MapGet("/api/access-profiles", ListProfiles) + .WithName("ListAccessProfiles") + .WithDescription("List access profiles") + .RequireAuthorization("AdminPolicy") + .Produces(); + + routes.MapGet("/api/access-profiles/{profileId}", GetProfile) + .WithName("GetAccessProfile") + .WithDescription("Get a specific access profile") + .RequireAuthorization("AdminPolicy") + .Produces() + .Produces(StatusCodes.Status404NotFound); + + routes.MapPost("/api/access-profiles", CreateProfile) + .WithName("CreateAccessProfile") + .WithDescription("Create an access profile") + .RequireAuthorization("AdminPolicy") + .Produces(StatusCodes.Status201Created) + .Produces(StatusCodes.Status400BadRequest) + .Produces(StatusCodes.Status409Conflict) + .Produces(StatusCodes.Status500InternalServerError); + + routes.MapPut("/api/access-profiles/{profileId}", UpdateProfile) + .WithName("UpdateAccessProfile") + .WithDescription("Update an access profile") + .RequireAuthorization("AdminPolicy") + .Produces() + .Produces(StatusCodes.Status400BadRequest) + .Produces(StatusCodes.Status404NotFound) + .Produces(StatusCodes.Status500InternalServerError); + + routes.MapDelete("/api/access-profiles/{profileId}", DeleteProfile) + .WithName("DeleteAccessProfile") + .WithDescription("Delete an access profile") + .RequireAuthorization("AdminPolicy") + .Produces(StatusCodes.Status204NoContent) + .Produces(StatusCodes.Status404NotFound); + + routes.MapPost("/api/access-profiles/bulk", BulkCreateProfiles) + .WithName("BulkCreateAccessProfiles") + .WithDescription("Create access profiles in bulk") + .RequireAuthorization("AdminPolicy") + .Produces() + .Produces(StatusCodes.Status400BadRequest) + .Produces(StatusCodes.Status500InternalServerError); + + return routes; + } + + private static async Task ListProfiles( + string? clientAppId, + string? tenantId, + string? apiId, + IAccessProfileRepository repository, + ILogger logger) + { + try + { + var profiles = await repository.ListAsync(clientAppId, tenantId, apiId); + logger.LogInformation("Fetched {Count} access profiles", profiles.Count); + return Results.Json(new AccessProfilesResponse { Profiles = profiles }, JsonConfig.Default); + } + catch (Exception ex) + { + logger.LogError(ex, "Error fetching access profiles"); + return Results.Json(new { error = "Failed to fetch access profiles" }, statusCode: StatusCodes.Status500InternalServerError); + } + } + + private static async Task GetProfile( + string profileId, + IAccessProfileRepository repository, + ILogger logger) + { + try + { + var profile = await repository.GetAsync(profileId); + if (profile is null) + return Results.NotFound(new { error = $"Access profile '{profileId}' not found" }); + + return Results.Json(profile, JsonConfig.Default); + } + catch (Exception ex) + { + logger.LogError(ex, "Error fetching access profile {ProfileId}", profileId); + return Results.Json(new { error = "Failed to fetch access profile" }, statusCode: StatusCodes.Status500InternalServerError); + } + } + + private static async Task CreateProfile( + AccessProfileCreateRequest body, + ClaimsPrincipal user, + IAccessProfileRepository repository, + IRepository planRepository, + IRepository routingPolicyRepository, + ILogger logger) + { + try + { + var buildResult = await BuildProfileAsync(body, repository, planRepository, routingPolicyRepository, user, logger); + if (buildResult.ErrorResult is not null) + return buildResult.ErrorResult; + + var persisted = await repository.UpsertAsync(buildResult.Profile!); + logger.LogInformation("Access profile created: {ProfileId}", persisted.Id); + return Results.Json(persisted, JsonConfig.Default, statusCode: StatusCodes.Status201Created); + } + catch (Exception ex) + { + logger.LogError(ex, "Error creating access profile"); + return Results.Json(new { error = "Failed to create access profile" }, statusCode: StatusCodes.Status500InternalServerError); + } + } + + private static async Task UpdateProfile( + string profileId, + AccessProfileUpdateRequest body, + IAccessProfileRepository repository, + IRepository planRepository, + IRepository routingPolicyRepository, + ILogger logger) + { + try + { + var profile = await repository.GetAsync(profileId); + if (profile is null) + return Results.NotFound(new { error = $"Access profile '{profileId}' not found" }); + + if (body.PlanId is not null) + { + if (string.IsNullOrWhiteSpace(body.PlanId)) + return Results.BadRequest("planId cannot be empty"); + + var planExists = await planRepository.GetAsync(body.PlanId.Trim()); + if (planExists is null) + return Results.BadRequest($"Plan '{body.PlanId.Trim()}' not found"); + + profile.PlanId = body.PlanId.Trim(); + } + + if (body.RoutingPolicyId is not null) + { + var normalizedRoutingPolicyId = NormalizeOptional(body.RoutingPolicyId); + if (normalizedRoutingPolicyId is not null) + { + var routingPolicy = await routingPolicyRepository.GetAsync(normalizedRoutingPolicyId); + if (routingPolicy is null) + return Results.BadRequest($"Routing policy '{normalizedRoutingPolicyId}' not found"); + } + + profile.RoutingPolicyId = normalizedRoutingPolicyId; + } + + if (body.AllowedDeployments is not null) + profile.AllowedDeployments = NormalizeAllowedDeployments(body.AllowedDeployments); + + if (body.Enabled.HasValue) + profile.Enabled = body.Enabled.Value; + + profile.UpdatedAt = DateTime.UtcNow; + var persisted = await repository.UpsertAsync(profile); + logger.LogInformation("Access profile updated: {ProfileId}", persisted.Id); + return Results.Json(persisted, JsonConfig.Default); + } + catch (Exception ex) + { + logger.LogError(ex, "Error updating access profile {ProfileId}", profileId); + return Results.Json(new { error = "Failed to update access profile" }, statusCode: StatusCodes.Status500InternalServerError); + } + } + + private static async Task DeleteProfile( + string profileId, + IAccessProfileRepository repository, + ILogger logger) + { + try + { + var deleted = await repository.DeleteAsync(profileId); + if (!deleted) + return Results.NotFound(new { error = $"Access profile '{profileId}' not found" }); + + logger.LogInformation("Access profile deleted: {ProfileId}", profileId); + return Results.NoContent(); + } + catch (Exception ex) + { + logger.LogError(ex, "Error deleting access profile {ProfileId}", profileId); + return Results.Json(new { error = "Failed to delete access profile" }, statusCode: StatusCodes.Status500InternalServerError); + } + } + + private static async Task BulkCreateProfiles( + BulkAccessProfilesRequest body, + ClaimsPrincipal user, + IAccessProfileRepository repository, + IRepository planRepository, + IRepository routingPolicyRepository, + ILogger logger) + { + if (body.Profiles is null || body.Profiles.Count == 0) + return Results.BadRequest("At least one profile is required"); + + var response = new BulkAccessProfilesResponse(); + + for (var index = 0; index < body.Profiles.Count; index++) + { + try + { + var buildResult = await BuildProfileAsync(body.Profiles[index], repository, planRepository, routingPolicyRepository, user, logger); + if (buildResult.ErrorResult is not null) + { + response.Failed.Add(new BulkAccessProfileFailure + { + Index = index, + Error = buildResult.ErrorMessage ?? "Access profile request failed", + ProfileId = buildResult.ProfileId + }); + continue; + } + + await repository.UpsertAsync(buildResult.Profile!); + response.Created++; + } + catch (Exception ex) + { + logger.LogError(ex, "Error creating access profile at bulk index {Index}", index); + response.Failed.Add(new BulkAccessProfileFailure + { + Index = index, + Error = "Failed to create access profile", + ProfileId = TryBuildProfileId(body.Profiles[index]) + }); + } + } + + return Results.Json(response, JsonConfig.Default); + } + + private static async Task BuildProfileAsync( + AccessProfileCreateRequest body, + IAccessProfileRepository repository, + IRepository planRepository, + IRepository routingPolicyRepository, + ClaimsPrincipal user, + ILogger logger) + { + if (string.IsNullOrWhiteSpace(body.ClientAppId) || + string.IsNullOrWhiteSpace(body.TenantId) || + string.IsNullOrWhiteSpace(body.ApiId) || + string.IsNullOrWhiteSpace(body.PlanId)) + { + return new BuildProfileResult(Results.BadRequest("clientAppId, tenantId, apiId, and planId are required"), "clientAppId, tenantId, apiId, and planId are required", null, TryBuildProfileId(body)); + } + + var planId = body.PlanId.Trim(); + var plan = await planRepository.GetAsync(planId); + if (plan is null) + return new BuildProfileResult(Results.BadRequest($"Plan '{planId}' not found"), $"Plan '{planId}' not found", null, TryBuildProfileId(body)); + + var routingPolicyId = NormalizeOptional(body.RoutingPolicyId); + if (routingPolicyId is not null) + { + var routingPolicy = await routingPolicyRepository.GetAsync(routingPolicyId); + if (routingPolicy is null) + return new BuildProfileResult(Results.BadRequest($"Routing policy '{routingPolicyId}' not found"), $"Routing policy '{routingPolicyId}' not found", null, TryBuildProfileId(body)); + } + + var apiId = body.ApiId.Trim(); + var operationId = NormalizeOptional(body.OperationId); + var profileId = AccessProfile.BuildId(body.ClientAppId, body.TenantId, apiId, operationId); + + var existing = await repository.GetAsync(profileId); + if (existing is not null) + return new BuildProfileResult(Results.Conflict(new { error = $"Access profile '{profileId}' already exists" }), $"Access profile '{profileId}' already exists", null, profileId); + + var now = DateTime.UtcNow; + var profile = new AccessProfile + { + ClientAppId = body.ClientAppId.Trim(), + TenantId = body.TenantId.Trim(), + ApiId = apiId, + OperationId = operationId, + PlanId = planId, + RoutingPolicyId = routingPolicyId, + AllowedDeployments = NormalizeAllowedDeployments(body.AllowedDeployments), + Enabled = body.Enabled, + CreatedBy = GetActor(user), + CreatedAt = now, + UpdatedAt = now + }; + + return new BuildProfileResult(null, null, profile, profileId); + } + + private static List NormalizeAllowedDeployments(IEnumerable? allowedDeployments) + => (allowedDeployments ?? []) + .Where(static deployment => !string.IsNullOrWhiteSpace(deployment)) + .Select(static deployment => deployment.Trim()) + .Distinct(StringComparer.OrdinalIgnoreCase) + .ToList(); + + private static string? NormalizeOptional(string? value) + => string.IsNullOrWhiteSpace(value) ? null : value.Trim(); + + private static string GetActor(ClaimsPrincipal user) + => user.FindFirstValue("preferred_username") + ?? user.FindFirstValue(ClaimTypes.Upn) + ?? user.Identity?.Name + ?? "unknown"; + + private static string? TryBuildProfileId(AccessProfileCreateRequest body) + { + if (string.IsNullOrWhiteSpace(body.ClientAppId) || + string.IsNullOrWhiteSpace(body.TenantId) || + string.IsNullOrWhiteSpace(body.ApiId)) + { + return null; + } + + return AccessProfile.BuildId(body.ClientAppId, body.TenantId, body.ApiId, NormalizeOptional(body.OperationId)); + } + + private sealed record BuildProfileResult(IResult? ErrorResult, string? ErrorMessage, AccessProfile? Profile, string? ProfileId); +} diff --git a/src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs b/src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs index fceff859..921f54d9 100644 --- a/src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs +++ b/src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs @@ -110,11 +110,14 @@ private static async Task IngestLog( } // --- 2. Plan Lookup --- - var plan = await planRepo.GetAsync(clientAssignment.PlanId); + var resolvedPlanId = string.IsNullOrWhiteSpace(ingestRequest.PlanId) + ? clientAssignment.PlanId + : ingestRequest.PlanId.Trim(); + var plan = await planRepo.GetAsync(resolvedPlanId); if (plan is null) { - logger.LogError("Plan not found: {PlanId} for client {ClientAppId}", clientAssignment.PlanId, ingestRequest.ClientAppId); - return Results.Json(new { error = "Plan configuration not found" }, statusCode: StatusCodes.Status500InternalServerError); + logger.LogError("Plan not found: {PlanId} for client {ClientAppId}", resolvedPlanId, ingestRequest.ClientAppId); + return Results.Json(new { error = "Plan configuration not found", planId = resolvedPlanId }, statusCode: StatusCodes.Status500InternalServerError); } // --- 3. Update rate limit meters (outbound — record actual token usage) --- @@ -307,7 +310,11 @@ await db.LockExtendAsync( EffectiveRequestCost = effectiveRequestCost, TierName = tierName, MultiplierOverageCost = multiplierOverageCost, - CorrelationId = ingestRequest.CorrelationId + CorrelationId = ingestRequest.CorrelationId, + AccessProfileId = ingestRequest.AccessProfileId, + PlanId = string.IsNullOrWhiteSpace(ingestRequest.PlanId) ? clientAssignment?.PlanId : ingestRequest.PlanId?.Trim(), + ApiId = ingestRequest.ApiId, + OperationId = ingestRequest.OperationId }); return Results.Ok("Log data processed and stored successfully"); diff --git a/src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs b/src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs index 7cbd595a..82d0fed3 100644 --- a/src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs +++ b/src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs @@ -2,6 +2,7 @@ using System.Text.Json; using AIPolicyEngine.Api.Models; using AIPolicyEngine.Api.Services; +using AIPolicyEngine.Api.Services.AccessProfiles; using StackExchange.Redis; namespace AIPolicyEngine.Api.Endpoints; @@ -35,11 +36,136 @@ private static async Task Precheck( IRepository clientRepo, IRepository planRepo, IRepository routingPolicyRepo, + IAccessProfileResolver accessProfileResolver, IUsagePolicyStore usagePolicyStore, IConnectionMultiplexer redis, ILogger logger) { - // 1. Checkclient assignment exists (reads from Redis cache via CachedRepository) + var requestedDeploymentId = context.Request.Query["deploymentId"].ToString(); + var apiId = NormalizeOptional(context.Request.Query["apiId"].ToString()); + var operationId = NormalizeOptional(context.Request.Query["operationId"].ToString()); + _ = context.Request.Query["subscriptionId"].ToString(); + + if (apiId is null) + { + return await LegacyPrecheck( + clientAppId, + tenantId, + requestedDeploymentId, + clientRepo, + planRepo, + routingPolicyRepo, + usagePolicyStore, + redis); + } + + var resolved = await accessProfileResolver.ResolveAsync(clientAppId, tenantId, apiId, operationId); + var assignment = await clientRepo.GetAsync($"{clientAppId}:{tenantId}"); + if (resolved is null) + { + if (assignment is null) + { + return Results.Json( + new + { + error = "Client not authorized — no access profile or plan assigned", + clientAppId, + tenantId, + apiId, + operationId, + deniedBy = "no-profile-no-assignment" + }, + statusCode: StatusCodes.Status401Unauthorized); + } + + var fallbackPlan = await planRepo.GetAsync(assignment.PlanId); + if (fallbackPlan is null) + { + logger.LogError("Plan not found during AAA fallback precheck: {PlanId} for client {ClientAppId}/{TenantId}", assignment.PlanId, clientAppId, tenantId); + return Results.Json( + new { error = "Plan configuration not found", planId = assignment.PlanId }, + statusCode: StatusCodes.Status500InternalServerError); + } + + var fallbackAllowedDeployments = assignment.AllowedDeployments is { Count: > 0 } + ? assignment.AllowedDeployments + : fallbackPlan.AllowedDeployments; + + return await EvaluatePrecheck( + clientAppId, + tenantId, + requestedDeploymentId, + assignment, + fallbackPlan, + assignment.ModelRoutingPolicyOverride ?? fallbackPlan.ModelRoutingPolicyId, + fallbackAllowedDeployments, + routingPolicyRepo, + usagePolicyStore, + redis, + includeAccessProfileMetadata: true, + resolvedPlanId: assignment.PlanId, + accessProfileId: null, + apiId: apiId, + operationId: operationId); + } + + if (assignment is null) + { + return Results.Json( + new + { + error = "Client not authorized — no plan assigned", + clientAppId, + tenantId, + apiId, + operationId, + accessProfileId = resolved.AccessProfileId, + deniedBy = "no-client-assignment" + }, + statusCode: StatusCodes.Status401Unauthorized); + } + + var plan = await planRepo.GetAsync(resolved.PlanId); + if (plan is null) + { + logger.LogError("Plan not found during AAA precheck: {PlanId} for client {ClientAppId}/{TenantId}", resolved.PlanId, clientAppId, tenantId); + return Results.Json( + new { error = "Plan configuration not found", planId = resolved.PlanId, accessProfileId = resolved.AccessProfileId }, + statusCode: StatusCodes.Status500InternalServerError); + } + + var effectiveAllowedDeployments = resolved.AllowedDeployments is { Count: > 0 } + ? resolved.AllowedDeployments + : plan.AllowedDeployments; + + return await EvaluatePrecheck( + clientAppId, + tenantId, + requestedDeploymentId, + assignment, + plan, + resolved.RoutingPolicyId ?? plan.ModelRoutingPolicyId, + effectiveAllowedDeployments, + routingPolicyRepo, + usagePolicyStore, + redis, + includeAccessProfileMetadata: true, + resolvedPlanId: resolved.PlanId, + accessProfileId: resolved.AccessProfileId, + apiId: apiId, + operationId: operationId); + } + + private static async Task LegacyPrecheck( + string clientAppId, + string tenantId, + string requestedDeploymentId, + IRepository clientRepo, + IRepository planRepo, + IRepository routingPolicyRepo, + IUsagePolicyStore usagePolicyStore, + IConnectionMultiplexer redis) + { var clientId = $"{clientAppId}:{tenantId}"; var assignment = await clientRepo.GetAsync(clientId); if (assignment is null) @@ -49,7 +175,6 @@ private static async Task Precheck( statusCode: StatusCodes.Status401Unauthorized); } - // 2. Check plan exists (reads from Redis cache via CachedRepository) var plan = await planRepo.GetAsync(assignment.PlanId); if (plan is null) { @@ -58,45 +183,85 @@ private static async Task Precheck( statusCode: StatusCodes.Status500InternalServerError); } - // 3. Routing evaluation — determine effective deployment - var requestedDeploymentId = context.Request.Query["deploymentId"].ToString(); + var effectiveAllowedDeployments = assignment.AllowedDeployments is { Count: > 0 } + ? assignment.AllowedDeployments + : plan.AllowedDeployments; + + return await EvaluatePrecheck( + clientAppId, + tenantId, + requestedDeploymentId, + assignment, + plan, + assignment.ModelRoutingPolicyOverride ?? plan.ModelRoutingPolicyId, + effectiveAllowedDeployments, + routingPolicyRepo, + usagePolicyStore, + redis, + includeAccessProfileMetadata: false, + resolvedPlanId: assignment.PlanId); + } + + private static async Task EvaluatePrecheck( + string clientAppId, + string tenantId, + string requestedDeploymentId, + ClientPlanAssignment assignment, + PlanData plan, + string? effectivePolicyId, + IReadOnlyCollection effectiveAllowedDeployments, + IRepository routingPolicyRepo, + IUsagePolicyStore usagePolicyStore, + IConnectionMultiplexer redis, + bool includeAccessProfileMetadata, + string resolvedPlanId, + string? accessProfileId = null, + string? apiId = null, + string? operationId = null) + { string? routedDeploymentId = null; string? routingPolicyId = null; - var effectivePolicyId = assignment.ModelRoutingPolicyOverride ?? plan.ModelRoutingPolicyId; if (!string.IsNullOrEmpty(effectivePolicyId) && !string.IsNullOrEmpty(requestedDeploymentId)) { routingPolicyId = effectivePolicyId; var policy = await GetCachedRoutingPolicy(effectivePolicyId, routingPolicyRepo); - var routingResult = RoutingEvaluator.Evaluate(requestedDeploymentId, policy); if (!routingResult.IsAllowed) { - return Results.Json( - new { error = "Deployment denied by routing policy", deploymentId = requestedDeploymentId, routingPolicyId }, - statusCode: StatusCodes.Status403Forbidden); + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Deployment denied by routing policy", + deploymentId = requestedDeploymentId, + routingPolicyId, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "routing-denied" + }, + statusCode: StatusCodes.Status403Forbidden) + : Results.Json( + new { error = "Deployment denied by routing policy", deploymentId = requestedDeploymentId, routingPolicyId }, + statusCode: StatusCodes.Status403Forbidden); } if (routingResult.WasRouted) routedDeploymentId = routingResult.DeploymentId; } - // effectiveDeployment = routed target (if routing changed it), otherwise what was requested var effectiveDeployment = routedDeploymentId ?? requestedDeploymentId; - // 4. Check billing period rollover (read-only in precheck to avoid write-side effects) var usagePolicy = await usagePolicyStore.GetAsync(); var currentDateUtc = DateTime.UtcNow; var expectedPeriodStart = BillingPeriodCalculator.GetCurrentPeriodStartUtc(currentDateUtc, usagePolicy.BillingCycleStartDay); - var newBillingPeriod = - assignment.CurrentPeriodStart != expectedPeriodStart; + var newBillingPeriod = assignment.CurrentPeriodStart != expectedPeriodStart; var effectiveUsage = newBillingPeriod ? 0 : assignment.CurrentPeriodUsage; - var effectiveDeploymentUsage = newBillingPeriod - ? new Dictionary() - : assignment.DeploymentUsage; + var effectiveDeploymentUsage = newBillingPeriod ? new Dictionary() : assignment.DeploymentUsage; - // 5. Check quota (with per-deployment support) — uses effective (routed) deployment if (!plan.RollUpAllDeployments) { if (!string.IsNullOrEmpty(effectiveDeployment) && plan.DeploymentQuotas.TryGetValue(effectiveDeployment, out var deploymentLimit)) @@ -104,35 +269,73 @@ private static async Task Precheck( var deploymentUsage = effectiveDeploymentUsage.GetValueOrDefault(effectiveDeployment, 0); if (deploymentUsage >= deploymentLimit && !plan.AllowOverbilling) { - return Results.Json( - new { error = "Per-deployment quota exceeded", deploymentId = effectiveDeployment, usage = deploymentUsage, limit = deploymentLimit }, - statusCode: StatusCodes.Status429TooManyRequests); + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Per-deployment quota exceeded", + deploymentId = effectiveDeployment, + usage = deploymentUsage, + limit = deploymentLimit, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "quota-exceeded" + }, + statusCode: StatusCodes.Status429TooManyRequests) + : Results.Json( + new { error = "Per-deployment quota exceeded", deploymentId = effectiveDeployment, usage = deploymentUsage, limit = deploymentLimit }, + statusCode: StatusCodes.Status429TooManyRequests); } } } - else + else if (effectiveUsage >= plan.MonthlyTokenQuota && !plan.AllowOverbilling) { - if (effectiveUsage >= plan.MonthlyTokenQuota && !plan.AllowOverbilling) - { - return Results.Json( + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Quota exceeded", + usage = effectiveUsage, + limit = plan.MonthlyTokenQuota, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "quota-exceeded" + }, + statusCode: StatusCodes.Status429TooManyRequests) + : Results.Json( new { error = "Quota exceeded", usage = effectiveUsage, limit = plan.MonthlyTokenQuota }, statusCode: StatusCodes.Status429TooManyRequests); - } } - // Check multiplier request quota if (plan.UseMultiplierBilling && plan.MonthlyRequestQuota > 0) { var effectiveRequests = newBillingPeriod ? 0 : assignment.CurrentPeriodRequests; if (effectiveRequests >= plan.MonthlyRequestQuota && !plan.AllowOverbilling) { - return Results.Json( - new { error = "Request quota exceeded", usage = effectiveRequests, limit = plan.MonthlyRequestQuota }, - statusCode: StatusCodes.Status429TooManyRequests); + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Request quota exceeded", + usage = effectiveRequests, + limit = plan.MonthlyRequestQuota, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "request-quota-exceeded" + }, + statusCode: StatusCodes.Status429TooManyRequests) + : Results.Json( + new { error = "Request quota exceeded", usage = effectiveRequests, limit = plan.MonthlyRequestQuota }, + statusCode: StatusCodes.Status429TooManyRequests); } } - // 6. Check rate limits— deployment-scoped keys use the ROUTED deployment var db = redis.GetDatabase(); var now = DateTimeOffset.UtcNow; var minuteWindow = now.ToUnixTimeSeconds() / 60; @@ -141,7 +344,6 @@ private static async Task Precheck( if (plan.RequestsPerMinuteLimit > 0) { - // Use deployment-scoped key if we have a deployment, else fall back to legacy key var rpmKey = !string.IsNullOrEmpty(effectiveDeployment) ? RedisKeys.RateLimitRpm(clientAppId, tenantId, effectiveDeployment, minuteWindow) : RedisKeys.RateLimitRpm(clientAppId, tenantId, minuteWindow); @@ -150,9 +352,23 @@ private static async Task Precheck( await db.KeyExpireAsync(rpmKey, TimeSpan.FromSeconds(120)); if (currentRpm > plan.RequestsPerMinuteLimit) { - return Results.Json( - new { error = "Rate limit exceeded — requests per minute", limit = plan.RequestsPerMinuteLimit, current = currentRpm }, - statusCode: StatusCodes.Status429TooManyRequests); + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Rate limit exceeded — requests per minute", + limit = plan.RequestsPerMinuteLimit, + current = currentRpm, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "rpm-exceeded" + }, + statusCode: StatusCodes.Status429TooManyRequests) + : Results.Json( + new { error = "Rate limit exceeded — requests per minute", limit = plan.RequestsPerMinuteLimit, current = currentRpm }, + statusCode: StatusCodes.Status429TooManyRequests); } } @@ -164,45 +380,86 @@ private static async Task Precheck( currentTpm = (long)(await db.StringGetAsync(tpmKey)); if (currentTpm >= plan.TokensPerMinuteLimit) { - return Results.Json( - new { error = "Rate limit exceeded — tokens per minute", limit = plan.TokensPerMinuteLimit, current = currentTpm }, - statusCode: StatusCodes.Status429TooManyRequests); + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Rate limit exceeded — tokens per minute", + limit = plan.TokensPerMinuteLimit, + current = currentTpm, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "tpm-exceeded" + }, + statusCode: StatusCodes.Status429TooManyRequests) + : Results.Json( + new { error = "Rate limit exceeded — tokens per minute", limit = plan.TokensPerMinuteLimit, current = currentTpm }, + statusCode: StatusCodes.Status429TooManyRequests); } } - // 7. Check deployment access control — runs on the ROUTED deployment - if (!string.IsNullOrEmpty(effectiveDeployment)) + if (!string.IsNullOrEmpty(effectiveDeployment) && + effectiveAllowedDeployments.Count > 0 && + !effectiveAllowedDeployments.Contains(effectiveDeployment, StringComparer.OrdinalIgnoreCase)) { - var effectiveAllowedDeployments = (assignment.AllowedDeployments is { Count: > 0 }) - ? assignment.AllowedDeployments - : plan.AllowedDeployments; - - if (effectiveAllowedDeployments is { Count: > 0 } && - !effectiveAllowedDeployments.Contains(effectiveDeployment, StringComparer.OrdinalIgnoreCase)) - { - return Results.Json( + return includeAccessProfileMetadata + ? Results.Json( + new + { + error = "Deployment not allowed", + deploymentId = effectiveDeployment, + allowedDeployments = effectiveAllowedDeployments, + planId = resolvedPlanId, + accessProfileId, + apiId, + operationId, + deniedBy = "deployment-denied" + }, + statusCode: StatusCodes.Status403Forbidden) + : Results.Json( new { error = "Deployment not allowed", deploymentId = effectiveDeployment, allowedDeployments = effectiveAllowedDeployments }, statusCode: StatusCodes.Status403Forbidden); - } } - // 8. Return enriched response with routing metadata - return Results.Ok(new - { - status = "authorized", - clientAppId, - tenantId, - plan = plan.Name, - usage = effectiveUsage, - limit = plan.MonthlyTokenQuota, - currentRpm, - rpmLimit = plan.RequestsPerMinuteLimit, - currentTpm, - tpmLimit = plan.TokensPerMinuteLimit, - routedDeployment = routedDeploymentId, - requestedDeployment = requestedDeploymentId, - routingPolicyId - }); + return includeAccessProfileMetadata + ? Results.Ok(new + { + status = "authorized", + clientAppId, + tenantId, + plan = plan.Name, + planId = resolvedPlanId, + accessProfileId, + allowedDeployments = effectiveAllowedDeployments, + usage = effectiveUsage, + limit = plan.MonthlyTokenQuota, + currentRpm, + rpmLimit = plan.RequestsPerMinuteLimit, + currentTpm, + tpmLimit = plan.TokensPerMinuteLimit, + routedDeployment = routedDeploymentId, + requestedDeployment = requestedDeploymentId, + routingPolicyId + }) + : Results.Ok(new + { + status = "authorized", + clientAppId, + tenantId, + plan = plan.Name, + planId = resolvedPlanId, + usage = effectiveUsage, + limit = plan.MonthlyTokenQuota, + currentRpm, + rpmLimit = plan.RequestsPerMinuteLimit, + currentTpm, + tpmLimit = plan.TokensPerMinuteLimit, + routedDeployment = routedDeploymentId, + requestedDeployment = requestedDeploymentId, + routingPolicyId + }); } /// @@ -270,6 +527,9 @@ private static async Task ContentCheck( return Results.Ok(new { blocked = false }); } + private static string? NormalizeOptional(string? value) + => string.IsNullOrWhiteSpace(value) ? null : value.Trim(); + /// Invalidates the in-memory routing policy cache (for testing). internal static void ClearRoutingPolicyCache() => RoutingPolicyCache.Clear(); } diff --git a/src/AIPolicyEngine.Api/Models/AccessProfile.cs b/src/AIPolicyEngine.Api/Models/AccessProfile.cs new file mode 100644 index 00000000..44053f80 --- /dev/null +++ b/src/AIPolicyEngine.Api/Models/AccessProfile.cs @@ -0,0 +1,35 @@ +namespace AIPolicyEngine.Api.Models; + +/// +/// Per-client access profile that scopes plan/routing/deployment access to an API or operation. +/// Stored in Cosmos configuration container under the access-profile partition. +/// +public sealed class AccessProfile +{ + public const string PartitionKeyValue = "access-profile"; + public const string GlobalApiId = "_global"; + public const string AllOperations = "_all"; + + public string Id { get; set; } = string.Empty; + public string PartitionKey { get; set; } = PartitionKeyValue; + public string ClientAppId { get; set; } = string.Empty; + public string TenantId { get; set; } = string.Empty; + public string ApiId { get; set; } = string.Empty; + public string? OperationId { get; set; } + public string PlanId { get; set; } = string.Empty; + public string? RoutingPolicyId { get; set; } + public List AllowedDeployments { get; set; } = []; + public bool Enabled { get; set; } = true; + public string CreatedBy { get; set; } = string.Empty; + public DateTime CreatedAt { get; set; } = DateTime.UtcNow; + public DateTime UpdatedAt { get; set; } = DateTime.UtcNow; + + public static string BuildId(string clientAppId, string tenantId, string apiId, string? operationId) + { + var normalizedOperationId = string.IsNullOrWhiteSpace(operationId) + ? AllOperations + : operationId.Trim(); + + return $"ap:{clientAppId.Trim()}:{tenantId.Trim()}:{apiId.Trim()}:{normalizedOperationId}"; + } +} diff --git a/src/AIPolicyEngine.Api/Models/AccessProfileRequests.cs b/src/AIPolicyEngine.Api/Models/AccessProfileRequests.cs new file mode 100644 index 00000000..d9cd185e --- /dev/null +++ b/src/AIPolicyEngine.Api/Models/AccessProfileRequests.cs @@ -0,0 +1,26 @@ +namespace AIPolicyEngine.Api.Models; + +public sealed class AccessProfileCreateRequest +{ + public string ClientAppId { get; set; } = string.Empty; + public string TenantId { get; set; } = string.Empty; + public string ApiId { get; set; } = string.Empty; + public string? OperationId { get; set; } + public string PlanId { get; set; } = string.Empty; + public string? RoutingPolicyId { get; set; } + public List? AllowedDeployments { get; set; } + public bool Enabled { get; set; } = true; +} + +public sealed class AccessProfileUpdateRequest +{ + public string? PlanId { get; set; } + public string? RoutingPolicyId { get; set; } + public List? AllowedDeployments { get; set; } + public bool? Enabled { get; set; } +} + +public sealed class BulkAccessProfilesRequest +{ + public List Profiles { get; set; } = []; +} diff --git a/src/AIPolicyEngine.Api/Models/AccessProfileResponses.cs b/src/AIPolicyEngine.Api/Models/AccessProfileResponses.cs new file mode 100644 index 00000000..fb0363d9 --- /dev/null +++ b/src/AIPolicyEngine.Api/Models/AccessProfileResponses.cs @@ -0,0 +1,32 @@ +namespace AIPolicyEngine.Api.Models; + +public sealed class AccessProfilesResponse +{ + public List Profiles { get; set; } = []; +} + +public sealed class BulkAccessProfilesResponse +{ + public int Created { get; set; } + public List Failed { get; set; } = []; +} + +public sealed class BulkAccessProfileFailure +{ + public int Index { get; set; } + public string Error { get; set; } = string.Empty; + public string? ProfileId { get; set; } +} + +public sealed class ResolvedAccessProfile +{ + public string PlanId { get; set; } = string.Empty; + public string? RoutingPolicyId { get; set; } + public List AllowedDeployments { get; set; } = []; + public string? AccessProfileId { get; set; } + public string? SourceProfileId + { + get => AccessProfileId; + set => AccessProfileId = value; + } +} diff --git a/src/AIPolicyEngine.Api/Models/AuditLogDocument.cs b/src/AIPolicyEngine.Api/Models/AuditLogDocument.cs index c7f73aa7..4c6132d8 100644 --- a/src/AIPolicyEngine.Api/Models/AuditLogDocument.cs +++ b/src/AIPolicyEngine.Api/Models/AuditLogDocument.cs @@ -79,6 +79,18 @@ public sealed class AuditLogDocument [JsonPropertyName("tierName")] public string? TierName { get; set; } + [JsonPropertyName("accessProfileId")] + public string? AccessProfileId { get; set; } + + [JsonPropertyName("planId")] + public string? PlanId { get; set; } + + [JsonPropertyName("apiId")] + public string? ApiId { get; set; } + + [JsonPropertyName("operationId")] + public string? OperationId { get; set; } + /// /// Billing period in YYYY-MM format for efficient querying. /// diff --git a/src/AIPolicyEngine.Api/Models/AuditLogItem.cs b/src/AIPolicyEngine.Api/Models/AuditLogItem.cs index 9c83d676..3f56cf45 100644 --- a/src/AIPolicyEngine.Api/Models/AuditLogItem.cs +++ b/src/AIPolicyEngine.Api/Models/AuditLogItem.cs @@ -31,4 +31,8 @@ public sealed class AuditLogItem /// APIM request ID for deduplication in deterministic hashing. public string? CorrelationId { get; set; } + public string? AccessProfileId { get; set; } + public string? PlanId { get; set; } + public string? ApiId { get; set; } + public string? OperationId { get; set; } } diff --git a/src/AIPolicyEngine.Api/Models/LogIngestRequest.cs b/src/AIPolicyEngine.Api/Models/LogIngestRequest.cs index 3b3b86d4..63ef3be2 100644 --- a/src/AIPolicyEngine.Api/Models/LogIngestRequest.cs +++ b/src/AIPolicyEngine.Api/Models/LogIngestRequest.cs @@ -29,4 +29,16 @@ public sealed class LogIngestRequest /// APIM request ID (context.RequestId) for correlation and deduplication. public string? CorrelationId { get; set; } + + /// Resolved access profile ID from precheck. + public string? AccessProfileId { get; set; } + + /// Resolved plan ID from precheck. + public string? PlanId { get; set; } + + /// API identifier used during precheck resolution. + public string? ApiId { get; set; } + + /// Operation identifier used during precheck resolution. + public string? OperationId { get; set; } } diff --git a/src/AIPolicyEngine.Api/Models/UsageData.cs b/src/AIPolicyEngine.Api/Models/UsageData.cs index 596b0272..5cfc027e 100644 --- a/src/AIPolicyEngine.Api/Models/UsageData.cs +++ b/src/AIPolicyEngine.Api/Models/UsageData.cs @@ -9,12 +9,24 @@ public sealed class UsageData [System.Text.Json.Serialization.JsonPropertyName("prompt_tokens")] public int PromptTokens { get; set; } + [System.Text.Json.Serialization.JsonPropertyName("promptTokens")] + public int PromptTokensCamelCase { set => PromptTokens = value; } + [System.Text.Json.Serialization.JsonPropertyName("completion_tokens")] public int CompletionTokens { get; set; } + [System.Text.Json.Serialization.JsonPropertyName("completionTokens")] + public int CompletionTokensCamelCase { set => CompletionTokens = value; } + [System.Text.Json.Serialization.JsonPropertyName("total_tokens")] public int TotalTokens { get; set; } + [System.Text.Json.Serialization.JsonPropertyName("totalTokens")] + public int TotalTokensCamelCase { set => TotalTokens = value; } + [System.Text.Json.Serialization.JsonPropertyName("image_tokens")] public int ImageTokens { get; set; } + + [System.Text.Json.Serialization.JsonPropertyName("imageTokens")] + public int ImageTokensCamelCase { set => ImageTokens = value; } } diff --git a/src/AIPolicyEngine.Api/Program.cs b/src/AIPolicyEngine.Api/Program.cs index eedbbbd8..6aeddd08 100644 --- a/src/AIPolicyEngine.Api/Program.cs +++ b/src/AIPolicyEngine.Api/Program.cs @@ -5,6 +5,7 @@ using AIPolicyEngine.Api.Endpoints; using AIPolicyEngine.Api.Models; using AIPolicyEngine.Api.Services; +using AIPolicyEngine.Api.Services.AccessProfiles; using AIPolicyEngine.Api.Services.ApimManagement; using Microsoft.Identity.Web; using StackExchange.Redis; @@ -70,6 +71,8 @@ builder.Services.AddSingleton(); builder.Services.AddSingleton(); builder.Services.AddSingleton(); +builder.Services.AddSingleton(); +builder.Services.AddSingleton(); builder.Services.AddSingleton>(sp => new CachedRepository( @@ -194,6 +197,7 @@ app.MapUsagePolicyEndpoints(); app.MapDeploymentEndpoints(); app.MapRoutingPolicyEndpoints(); +app.MapAccessProfileEndpoints(); app.MapRequestBillingEndpoints(); app.MapApimManagementEndpoints(); diff --git a/src/AIPolicyEngine.Api/Services/AccessProfiles/AccessProfileResolver.cs b/src/AIPolicyEngine.Api/Services/AccessProfiles/AccessProfileResolver.cs new file mode 100644 index 00000000..5ce26c2e --- /dev/null +++ b/src/AIPolicyEngine.Api/Services/AccessProfiles/AccessProfileResolver.cs @@ -0,0 +1,63 @@ +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; + +namespace AIPolicyEngine.Api.Services.AccessProfiles; + +public sealed class AccessProfileResolver : IAccessProfileResolver +{ + private readonly IAccessProfileRepository _accessProfileRepository; + private readonly IRepository _clientRepository; + + public AccessProfileResolver( + IAccessProfileRepository accessProfileRepository, + IRepository clientRepository) + { + _accessProfileRepository = accessProfileRepository; + _clientRepository = clientRepository; + } + + public async Task ResolveAsync( + string clientAppId, + string tenantId, + string apiId, + string? operationId, + CancellationToken ct = default) + { + var normalizedOperationId = string.IsNullOrWhiteSpace(operationId) ? null : operationId.Trim(); + + if (normalizedOperationId is not null) + { + var operationProfile = await _accessProfileRepository.GetForScopeAsync(clientAppId, tenantId, apiId, normalizedOperationId, ct); + if (operationProfile is { Enabled: true }) + return ToResolved(operationProfile); + } + + var apiProfile = await _accessProfileRepository.GetForScopeAsync(clientAppId, tenantId, apiId, null, ct); + if (apiProfile is { Enabled: true }) + return ToResolved(apiProfile); + + var globalProfile = await _accessProfileRepository.GetForScopeAsync(clientAppId, tenantId, AccessProfile.GlobalApiId, null, ct); + if (globalProfile is { Enabled: true }) + return ToResolved(globalProfile); + + var clientAssignment = await _clientRepository.GetAsync($"{clientAppId}:{tenantId}", ct); + if (clientAssignment is null) + return null; + + return new ResolvedAccessProfile + { + PlanId = clientAssignment.PlanId, + RoutingPolicyId = clientAssignment.ModelRoutingPolicyOverride, + AllowedDeployments = clientAssignment.AllowedDeployments?.ToList() ?? [], + AccessProfileId = null + }; + } + + private static ResolvedAccessProfile ToResolved(AccessProfile profile) => new() + { + PlanId = profile.PlanId, + RoutingPolicyId = profile.RoutingPolicyId, + AllowedDeployments = profile.AllowedDeployments?.ToList() ?? [], + AccessProfileId = profile.Id + }; +} diff --git a/src/AIPolicyEngine.Api/Services/AccessProfiles/CosmosAccessProfileRepository.cs b/src/AIPolicyEngine.Api/Services/AccessProfiles/CosmosAccessProfileRepository.cs new file mode 100644 index 00000000..f20a29d8 --- /dev/null +++ b/src/AIPolicyEngine.Api/Services/AccessProfiles/CosmosAccessProfileRepository.cs @@ -0,0 +1,155 @@ +using System.Net; +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; +using Microsoft.Azure.Cosmos; + +namespace AIPolicyEngine.Api.Services.AccessProfiles; + +public sealed class CosmosAccessProfileRepository : IAccessProfileRepository +{ + private readonly ConfigurationContainerProvider _provider; + private readonly ILogger _logger; + + public CosmosAccessProfileRepository( + ConfigurationContainerProvider provider, + ILogger logger) + { + _provider = provider; + _logger = logger; + } + + public async Task GetAsync(string profileId, CancellationToken ct = default) + { + await _provider.EnsureInitializedAsync(ct); + + try + { + var response = await _provider.Container.ReadItemAsync( + profileId, + new PartitionKey(AccessProfile.PartitionKeyValue), + cancellationToken: ct); + + return response.Resource; + } + catch (CosmosException ex) when (ex.StatusCode == HttpStatusCode.NotFound) + { + return null; + } + } + + public Task GetForScopeAsync( + string clientAppId, + string tenantId, + string apiId, + string? operationId, + CancellationToken ct = default) + => GetAsync(AccessProfile.BuildId(clientAppId, tenantId, apiId, operationId), ct); + + public async Task> ListAsync( + string? clientAppId = null, + string? tenantId = null, + string? apiId = null, + CancellationToken ct = default) + { + await _provider.EnsureInitializedAsync(ct); + + var queryText = "SELECT * FROM c WHERE c.partitionKey = @pk"; + var query = new QueryDefinition(queryText) + .WithParameter("@pk", AccessProfile.PartitionKeyValue); + + if (!string.IsNullOrWhiteSpace(clientAppId)) + { + queryText += " AND c.clientAppId = @clientAppId"; + query.WithParameter("@clientAppId", clientAppId.Trim()); + } + + if (!string.IsNullOrWhiteSpace(tenantId)) + { + queryText += " AND c.tenantId = @tenantId"; + query.WithParameter("@tenantId", tenantId.Trim()); + } + + if (!string.IsNullOrWhiteSpace(apiId)) + { + queryText += " AND c.apiId = @apiId"; + query.WithParameter("@apiId", apiId.Trim()); + } + + query = new QueryDefinition(queryText) + .WithParameter("@pk", AccessProfile.PartitionKeyValue); + + if (!string.IsNullOrWhiteSpace(clientAppId)) + query.WithParameter("@clientAppId", clientAppId.Trim()); + if (!string.IsNullOrWhiteSpace(tenantId)) + query.WithParameter("@tenantId", tenantId.Trim()); + if (!string.IsNullOrWhiteSpace(apiId)) + query.WithParameter("@apiId", apiId.Trim()); + + var results = new List(); + using var iterator = _provider.Container.GetItemQueryIterator( + query, + requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey(AccessProfile.PartitionKeyValue) }); + + while (iterator.HasMoreResults) + { + var response = await iterator.ReadNextAsync(ct); + results.AddRange(response); + } + + return results + .OrderBy(p => p.ClientAppId, StringComparer.OrdinalIgnoreCase) + .ThenBy(p => p.TenantId, StringComparer.OrdinalIgnoreCase) + .ThenBy(p => p.ApiId, StringComparer.OrdinalIgnoreCase) + .ThenBy(p => p.OperationId ?? AccessProfile.AllOperations, StringComparer.OrdinalIgnoreCase) + .ToList(); + } + + public async Task UpsertAsync(AccessProfile profile, CancellationToken ct = default) + { + await _provider.EnsureInitializedAsync(ct); + PrepareForCosmos(profile); + + var response = await _provider.Container.UpsertItemAsync( + profile, + new PartitionKey(AccessProfile.PartitionKeyValue), + cancellationToken: ct); + + _logger.LogInformation("Access profile upserted: {ProfileId}", profile.Id); + return response.Resource; + } + + public async Task DeleteAsync(string profileId, CancellationToken ct = default) + { + await _provider.EnsureInitializedAsync(ct); + + try + { + await _provider.Container.DeleteItemAsync( + profileId, + new PartitionKey(AccessProfile.PartitionKeyValue), + cancellationToken: ct); + return true; + } + catch (CosmosException ex) when (ex.StatusCode == HttpStatusCode.NotFound) + { + return false; + } + } + + private static void PrepareForCosmos(AccessProfile profile) + { + profile.ClientAppId = profile.ClientAppId.Trim(); + profile.TenantId = profile.TenantId.Trim(); + profile.ApiId = profile.ApiId.Trim(); + profile.OperationId = string.IsNullOrWhiteSpace(profile.OperationId) ? null : profile.OperationId.Trim(); + profile.PlanId = profile.PlanId.Trim(); + profile.RoutingPolicyId = string.IsNullOrWhiteSpace(profile.RoutingPolicyId) ? null : profile.RoutingPolicyId.Trim(); + profile.AllowedDeployments = (profile.AllowedDeployments ?? []) + .Where(static deployment => !string.IsNullOrWhiteSpace(deployment)) + .Select(static deployment => deployment.Trim()) + .Distinct(StringComparer.OrdinalIgnoreCase) + .ToList(); + profile.Id = AccessProfile.BuildId(profile.ClientAppId, profile.TenantId, profile.ApiId, profile.OperationId); + profile.PartitionKey = AccessProfile.PartitionKeyValue; + } +} diff --git a/src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileRepository.cs b/src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileRepository.cs new file mode 100644 index 00000000..7b56bea2 --- /dev/null +++ b/src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileRepository.cs @@ -0,0 +1,12 @@ +using AIPolicyEngine.Api.Models; + +namespace AIPolicyEngine.Api.Services.AccessProfiles; + +public interface IAccessProfileRepository +{ + Task GetAsync(string profileId, CancellationToken ct = default); + Task GetForScopeAsync(string clientAppId, string tenantId, string apiId, string? operationId, CancellationToken ct = default); + Task> ListAsync(string? clientAppId = null, string? tenantId = null, string? apiId = null, CancellationToken ct = default); + Task UpsertAsync(AccessProfile profile, CancellationToken ct = default); + Task DeleteAsync(string profileId, CancellationToken ct = default); +} diff --git a/src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileResolver.cs b/src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileResolver.cs new file mode 100644 index 00000000..3b7c4226 --- /dev/null +++ b/src/AIPolicyEngine.Api/Services/AccessProfiles/IAccessProfileResolver.cs @@ -0,0 +1,8 @@ +using AIPolicyEngine.Api.Models; + +namespace AIPolicyEngine.Api.Services.AccessProfiles; + +public interface IAccessProfileResolver +{ + Task ResolveAsync(string clientAppId, string tenantId, string apiId, string? operationId, CancellationToken ct = default); +} diff --git a/src/AIPolicyEngine.Api/Services/AuditLogWriter.cs b/src/AIPolicyEngine.Api/Services/AuditLogWriter.cs index a3e24c22..ef3b47e5 100644 --- a/src/AIPolicyEngine.Api/Services/AuditLogWriter.cs +++ b/src/AIPolicyEngine.Api/Services/AuditLogWriter.cs @@ -141,7 +141,11 @@ private async Task FlushBatchWithRetryAsync(List batch, Cancellati RoutingPolicyId = item.RoutingPolicyId, Multiplier = item.Multiplier, EffectiveRequestCost = item.EffectiveRequestCost, - TierName = item.TierName + TierName = item.TierName, + AccessProfileId = item.AccessProfileId, + PlanId = item.PlanId, + ApiId = item.ApiId, + OperationId = item.OperationId }).ToList(); for (var attempt = 1; attempt <= MaxRetries; attempt++) From 6c858b96cb54c6f4a7ca5ffe83c9669d03ea84f7 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 17:45:56 -0400 Subject: [PATCH 10/14] Add AAA access profile tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../decisions/inbox/bunk-aaa-test-matrix.md | 100 ++++ .../AccessProfileCascadeE2ETests.cs | 135 ++++++ .../Integration/AccessProfileLogTests.cs | 192 ++++++++ .../Integration/AccessProfilePrecheckTests.cs | 226 +++++++++ .../AccessProfileResolverTests.cs | 116 +++++ .../AccessProfileTestSupport.cs | 433 ++++++++++++++++++ 6 files changed, 1202 insertions(+) create mode 100644 .squad/decisions/inbox/bunk-aaa-test-matrix.md create mode 100644 src/AIPolicyEngine.Tests/Integration/AccessProfileCascadeE2ETests.cs create mode 100644 src/AIPolicyEngine.Tests/Integration/AccessProfileLogTests.cs create mode 100644 src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs create mode 100644 src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileResolverTests.cs create mode 100644 src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileTestSupport.cs diff --git a/.squad/decisions/inbox/bunk-aaa-test-matrix.md b/.squad/decisions/inbox/bunk-aaa-test-matrix.md new file mode 100644 index 00000000..925c64a6 --- /dev/null +++ b/.squad/decisions/inbox/bunk-aaa-test-matrix.md @@ -0,0 +1,100 @@ +# Bunk AAA Access Profile Test Matrix + +Date: 2026-05-21 + +## Scope delivered + +Implemented the requested 21-test AAA anticipatory matrix for Access Profiles: + +- Resolver unit tests: 6 +- Precheck integration tests: 6 +- Log integration tests: 4 +- End-to-end cascade integration test: 1 +- Pending M4 template assertions: 4 skipped + +## Files + +- `src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileTestSupport.cs` +- `src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileResolverTests.cs` +- `src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs` +- `src/AIPolicyEngine.Tests/Integration/AccessProfileLogTests.cs` +- `src/AIPolicyEngine.Tests/Integration/AccessProfileCascadeE2ETests.cs` + +## Matrix coverage + +### Resolver unit tests (6) + +1. Operation-specific profile beats API-wide profile +2. API-wide profile beats client-global profile +3. Client-global profile beats legacy assignment +4. No profiles returns legacy assignment +5. No profiles and no legacy assignment returns null +6. Disabled profile is skipped during cascade + +### Precheck integration tests (6) + +1. `apiId` + `operationId` path invokes resolver and returns `planId` + `accessProfileId` +2. No `apiId` uses legacy path and omits access-profile metadata +3. `apiId` with no matching profile falls back to legacy assignment +4. Disabled operation profile yields API-level profile +5. AAA response carries `allowedDeployments` +6. Legacy callers keep the current authorized response contract + +### Log integration tests (4) + +1. `PlanId` in request drives plan lookup +2. Missing `PlanId` falls back to `ClientPlanAssignment.PlanId` +3. `AccessProfileId` is written to the audit item +4. Legacy payload with no new AAA fields still works + +### End-to-end cascade test (1) + +1. Precheck resolves in order: operation -> API -> global -> legacy + +### Pending M4 template assertions (4 skipped) + +1. Template extracts `apiId` +2. Precheck URL carries `apiId` and `operationId` +3. Outbound log payload carries `accessProfileId`, `planId`, `apiId`, and `operationId` +4. Template manifest/version bumps to `1.1` + +## Contract issues / decisions surfaced + +1. **Resolver result type naming** + - Architecture appendix referenced `ResolvedAccess`. + - Implementation and approved tester contract use `ResolvedAccessProfile`. + - Current code aligns to `ResolvedAccessProfile` and keeps `SourceProfileId` as an alias of `AccessProfileId` for compatibility. + +2. **Legacy fallback ownership** + - Appendix prose suggested the resolver internally falls through to legacy `ClientPlanAssignment`. + - Sample precheck pseudocode still implied fallback might happen in the endpoint. + - Current implementation resolves legacy fallback inside `AccessProfileResolver`, which matches the approved test matrix. + +3. **Precheck additive contract** + - AAA path returns additive fields (`planId`, `accessProfileId`, `allowedDeployments`) without breaking existing success fields. + - Legacy precheck path still avoids access-profile metadata, preserving older callers. + +4. **Log plan-resolution edge case** + - The addendum discussed a possible mismatched-`planId` fallback-to-legacy case. + - The approved 21-test matrix did not require that behavior, so it is not asserted here. + - Tests cover the explicit contract only: supplied `PlanId` wins, otherwise legacy assignment wins. + +## Validation + +Ran: + +`dotnet test src\AIPolicyEngine.Tests\AIPolicyEngine.Tests.csproj --no-restore --nologo` + +Result: + +- Total: 320 +- Succeeded: 312 +- Failed: 0 +- Skipped: 8 + +AAA-specific result: + +- Total: 21 +- Succeeded: 17 +- Skipped: 4 pending M4 +- Failed: 0 diff --git a/src/AIPolicyEngine.Tests/Integration/AccessProfileCascadeE2ETests.cs b/src/AIPolicyEngine.Tests/Integration/AccessProfileCascadeE2ETests.cs new file mode 100644 index 00000000..d732f7de --- /dev/null +++ b/src/AIPolicyEngine.Tests/Integration/AccessProfileCascadeE2ETests.cs @@ -0,0 +1,135 @@ +using System.Net; +using System.Text.Json; +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; +using AIPolicyEngine.Tests.Services.AccessProfiles; +using Microsoft.AspNetCore.TestHost; +using Microsoft.Extensions.DependencyInjection; +using static AIPolicyEngine.Tests.Services.AccessProfiles.AccessProfileTestSupport; + +namespace AIPolicyEngine.Tests.Integration; + +public sealed class AccessProfileCascadeE2ETests : IClassFixture +{ + private readonly ChargebackApiFactory _factory; + private static readonly JsonSerializerOptions JsonOpts = JsonConfig.Default; + private const string TenantId = "tenant-1"; + private const string ProfileClientAppId = "cascade-client"; + private const string LegacyClientAppId = "cascade-legacy-client"; + + public AccessProfileCascadeE2ETests(ChargebackApiFactory factory) + { + _factory = factory; + _factory.Redis.Clear(); + } + + [Fact] + public async Task Precheck_EndToEnd_CascadesOperationApiGlobalThenLegacy() + { + SeedPlan(CreatePlan("operation-plan", "Operation Plan")); + SeedPlan(CreatePlan("api-plan", "API Plan")); + SeedPlan(CreatePlan("global-plan", "Global Plan")); + SeedPlan(CreatePlan("legacy-plan", "Legacy Plan")); + + var profiledAssignment = CreateLegacyAssignment(ProfileClientAppId, "legacy-plan"); + var legacyAssignment = CreateLegacyAssignment(LegacyClientAppId, "legacy-plan"); + SeedClientAssignment(profiledAssignment); + SeedClientAssignment(legacyAssignment); + + using var resolverHarness = CreateResolverHarness( + [ + CreateAccessProfile(ProfileClientAppId, TenantId, "_global", null, "global-plan"), + CreateAccessProfile(ProfileClientAppId, TenantId, "openai-api", null, "api-plan"), + CreateAccessProfile(ProfileClientAppId, TenantId, "openai-api", "chat", "operation-plan") + ], + [ + profiledAssignment, + legacyAssignment + ]); + + using var client = CreateClient(resolverHarness.Instance); + + var operationResponse = await client.GetAsync($"/api/precheck/{ProfileClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=openai-api&operationId=chat"); + var operationJson = await ReadJsonAsync(operationResponse); + Assert.Equal(HttpStatusCode.OK, operationResponse.StatusCode); + Assert.Equal("operation-plan", operationJson.RootElement.GetProperty("planId").GetString()); + Assert.Equal($"ap:{ProfileClientAppId}:{TenantId}:openai-api:chat", operationJson.RootElement.GetProperty("accessProfileId").GetString()); + + var apiResponse = await client.GetAsync($"/api/precheck/{ProfileClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=openai-api&operationId=embeddings"); + var apiJson = await ReadJsonAsync(apiResponse); + Assert.Equal(HttpStatusCode.OK, apiResponse.StatusCode); + Assert.Equal("api-plan", apiJson.RootElement.GetProperty("planId").GetString()); + Assert.Equal($"ap:{ProfileClientAppId}:{TenantId}:openai-api:_all", apiJson.RootElement.GetProperty("accessProfileId").GetString()); + + var globalResponse = await client.GetAsync($"/api/precheck/{ProfileClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=embeddings-api&operationId=create"); + var globalJson = await ReadJsonAsync(globalResponse); + Assert.Equal(HttpStatusCode.OK, globalResponse.StatusCode); + Assert.Equal("global-plan", globalJson.RootElement.GetProperty("planId").GetString()); + Assert.Equal($"ap:{ProfileClientAppId}:{TenantId}:_global:_all", globalJson.RootElement.GetProperty("accessProfileId").GetString()); + + var legacyResponse = await client.GetAsync($"/api/precheck/{LegacyClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=embeddings-api&operationId=create"); + var legacyJson = await ReadJsonAsync(legacyResponse); + Assert.Equal(HttpStatusCode.OK, legacyResponse.StatusCode); + Assert.Equal("legacy-plan", legacyJson.RootElement.GetProperty("planId").GetString()); + Assert.True(!legacyJson.RootElement.TryGetProperty("accessProfileId", out var accessProfileId) || accessProfileId.ValueKind == JsonValueKind.Null); + } + + private HttpClient CreateClient(object resolver) + { + var factory = _factory.WithWebHostBuilder(builder => + { + builder.ConfigureTestServices(services => + { + var resolverType = RequireType("IAccessProfileResolver"); + RemoveService(services, resolverType); + services.AddSingleton(resolverType, resolver); + }); + }); + + var client = factory.CreateClient(); + client.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", "test-token"); + return client; + } + + private void SeedPlan(PlanData plan) + => _factory.Redis.SeedString(RedisKeys.Plan(plan.Id), JsonSerializer.Serialize(plan, JsonOpts)); + + private void SeedClientAssignment(ClientPlanAssignment assignment) + => _factory.Redis.SeedString($"client:{assignment.ClientAppId}:{assignment.TenantId}", JsonSerializer.Serialize(assignment, JsonOpts)); + + private static PlanData CreatePlan(string id, string name) => new() + { + Id = id, + Name = name, + MonthlyRate = 99m, + MonthlyTokenQuota = 1_000_000, + TokensPerMinuteLimit = 0, + RequestsPerMinuteLimit = 0, + AllowOverbilling = true, + CostPerMillionTokens = 5m, + RollUpAllDeployments = true + }; + + private static ClientPlanAssignment CreateLegacyAssignment(string clientAppId, string planId) => new() + { + Id = $"{clientAppId}:{TenantId}", + ClientAppId = clientAppId, + TenantId = TenantId, + PlanId = planId, + DisplayName = "Cascade Client", + CurrentPeriodStart = new DateTime(2026, 05, 01, 0, 0, 0, DateTimeKind.Utc), + AllowedDeployments = [] + }; + + private static async Task ReadJsonAsync(HttpResponseMessage response) + => JsonDocument.Parse(await response.Content.ReadAsStringAsync()); + + private static void RemoveService(IServiceCollection services, Type serviceType) + { + var descriptors = services.Where(descriptor => descriptor.ServiceType == serviceType).ToList(); + foreach (var descriptor in descriptors) + { + services.Remove(descriptor); + } + } +} diff --git a/src/AIPolicyEngine.Tests/Integration/AccessProfileLogTests.cs b/src/AIPolicyEngine.Tests/Integration/AccessProfileLogTests.cs new file mode 100644 index 00000000..cab639e3 --- /dev/null +++ b/src/AIPolicyEngine.Tests/Integration/AccessProfileLogTests.cs @@ -0,0 +1,192 @@ +using System.Net; +using System.Text; +using System.Text.Json; +using System.Threading.Channels; +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; +using Microsoft.AspNetCore.TestHost; +using Microsoft.Extensions.DependencyInjection; +using Microsoft.Extensions.Hosting; + +namespace AIPolicyEngine.Tests.Integration; + +public sealed class AccessProfileLogTests : IClassFixture +{ + private readonly ChargebackApiFactory _factory; + private static readonly JsonSerializerOptions JsonOpts = JsonConfig.Default; + private const string ClientAppId = "log-access-client"; + private const string TenantId = "tenant-1"; + + public AccessProfileLogTests(ChargebackApiFactory factory) + { + _factory = factory; + _factory.Redis.Clear(); + } + + [Fact] + public async Task LogRequest_WithPlanId_UsesPlanIdForPlanLookup() + { + SeedPlan(CreatePlan(id: "legacy-plan", useMultiplierBilling: false)); + SeedPlan(CreatePlan(id: "profile-plan", useMultiplierBilling: true)); + SeedClientAssignment(CreateClientAssignment(planId: "legacy-plan")); + + using var client = CreateClient(out _); + var response = await client.PostAsync("/api/log", CreateJsonContent(CreateLogRequest(planId: "profile-plan"))); + var updated = await ReadClientAssignmentAsync(); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.NotNull(updated); + Assert.True(updated!.CurrentPeriodRequests > 0m, "Resolved planId should enable multiplier billing request accounting."); + } + + [Fact] + public async Task LogRequest_WithoutPlanId_FallsBackToClientPlanAssignmentLookup() + { + SeedPlan(CreatePlan(id: "legacy-plan", useMultiplierBilling: true)); + SeedClientAssignment(CreateClientAssignment(planId: "legacy-plan")); + + using var client = CreateClient(out _); + var response = await client.PostAsync("/api/log", CreateJsonContent(CreateLogRequest())); + var updated = await ReadClientAssignmentAsync(); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.NotNull(updated); + Assert.True(updated!.CurrentPeriodRequests > 0m); + } + + [Fact] + public async Task LogRequest_WithAccessProfileId_RecordsItInAuditEntry() + { + SeedPlan(CreatePlan(id: "legacy-plan", useMultiplierBilling: false)); + SeedClientAssignment(CreateClientAssignment(planId: "legacy-plan")); + + using var client = CreateClient(out var channel); + var response = await client.PostAsync("/api/log", CreateJsonContent(CreateLogRequest(accessProfileId: "ap:client:tenant:openai-api:_all", planId: "legacy-plan", apiId: "openai-api", operationId: "chat"))); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.True(channel.Reader.TryRead(out var auditItem), "Audit channel should receive an item."); + Assert.Equal("ap:client:tenant:openai-api:_all", auditItem!.GetType().GetProperty("AccessProfileId")?.GetValue(auditItem)?.ToString()); + } + + [Fact] + public async Task LogRequest_WithoutNewFields_LegacyCallerStillWorks() + { + SeedPlan(CreatePlan(id: "legacy-plan", useMultiplierBilling: false)); + SeedClientAssignment(CreateClientAssignment(planId: "legacy-plan")); + + using var client = CreateClient(out _); + var response = await client.PostAsync("/api/log", CreateJsonContent(CreateLogRequest())); + var updated = await ReadClientAssignmentAsync(); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.NotNull(updated); + Assert.Equal(150, updated!.CurrentPeriodUsage); + } + + private HttpClient CreateClient(out Channel auditChannel) + { + var channel = Channel.CreateUnbounded(new UnboundedChannelOptions { SingleReader = true, SingleWriter = false }); + auditChannel = channel; + var factory = _factory.WithWebHostBuilder(builder => + { + builder.ConfigureTestServices(services => + { + RemoveHostedService(services); + RemoveService>(services); + services.AddSingleton(channel); + }); + }); + + var client = factory.CreateClient(); + client.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", "test-token"); + return client; + } + + private void SeedPlan(PlanData plan) + => _factory.Redis.SeedString(RedisKeys.Plan(plan.Id), JsonSerializer.Serialize(plan, JsonOpts)); + + private void SeedClientAssignment(ClientPlanAssignment assignment) + => _factory.Redis.SeedString($"client:{assignment.ClientAppId}:{assignment.TenantId}", JsonSerializer.Serialize(assignment, JsonOpts)); + + private async Task ReadClientAssignmentAsync() + { + var value = await _factory.Redis.Database.StringGetAsync($"client:{ClientAppId}:{TenantId}"); + return !value.HasValue ? null : JsonSerializer.Deserialize((string)value!, JsonOpts); + } + + private static StringContent CreateJsonContent(object payload) + => new(JsonSerializer.Serialize(payload, JsonOpts), Encoding.UTF8, "application/json"); + + private static LogIngestRequest CreateLogRequest(string? accessProfileId = null, string? planId = null, string? apiId = null, string? operationId = null) + => new() + { + TenantId = TenantId, + ClientAppId = ClientAppId, + Audience = "api://engine", + DeploymentId = "gpt-4.1", + AccessProfileId = accessProfileId, + PlanId = planId, + ApiId = apiId, + OperationId = operationId, + CorrelationId = Guid.NewGuid().ToString("N"), + ResponseBody = new OpenAiResponseBody + { + Model = "gpt-4.1", + Object = "chat.completion", + Usage = new UsageData + { + PromptTokens = 100, + CompletionTokens = 50, + TotalTokens = 150, + ImageTokens = 0 + } + } + }; + + private static PlanData CreatePlan(string id, bool useMultiplierBilling) => new() + { + Id = id, + Name = id, + MonthlyRate = 99m, + MonthlyTokenQuota = 1_000_000, + RequestsPerMinuteLimit = 0, + TokensPerMinuteLimit = 0, + AllowOverbilling = true, + CostPerMillionTokens = 5m, + RollUpAllDeployments = true, + UseMultiplierBilling = useMultiplierBilling, + MonthlyRequestQuota = 100m, + OverageRatePerRequest = 1.0m + }; + + private static ClientPlanAssignment CreateClientAssignment(string planId) => new() + { + Id = $"{ClientAppId}:{TenantId}", + ClientAppId = ClientAppId, + TenantId = TenantId, + PlanId = planId, + DisplayName = "Log Access Client", + CurrentPeriodStart = new DateTime(2026, 05, 01, 0, 0, 0, DateTimeKind.Utc) + }; + + private static void RemoveService(IServiceCollection services) + { + var descriptors = services.Where(descriptor => descriptor.ServiceType == typeof(T)).ToList(); + foreach (var descriptor in descriptors) + { + services.Remove(descriptor); + } + } + + private static void RemoveHostedService(IServiceCollection services) where T : class + { + var descriptors = services.Where(descriptor => + descriptor.ServiceType == typeof(IHostedService) && + (descriptor.ImplementationType == typeof(T) || + (descriptor.ImplementationFactory?.Method.ReturnType?.IsAssignableFrom(typeof(T)) ?? false))).ToList(); + foreach (var descriptor in descriptors) + { + services.Remove(descriptor); + } + } +} diff --git a/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs b/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs new file mode 100644 index 00000000..5db92e32 --- /dev/null +++ b/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs @@ -0,0 +1,226 @@ +using System.Net; +using System.Text.Json; +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; +using AIPolicyEngine.Tests.Services.AccessProfiles; +using Microsoft.AspNetCore.TestHost; +using Microsoft.Extensions.DependencyInjection; +using static AIPolicyEngine.Tests.Services.AccessProfiles.AccessProfileTestSupport; + +namespace AIPolicyEngine.Tests.Integration; + +public sealed class AccessProfilePrecheckTests : IClassFixture +{ + private readonly ChargebackApiFactory _factory; + private static readonly JsonSerializerOptions JsonOpts = JsonConfig.Default; + private const string ClientAppId = "access-client"; + private const string TenantId = "tenant-1"; + + public AccessProfilePrecheckTests(ChargebackApiFactory factory) + { + _factory = factory; + _factory.Redis.Clear(); + } + + [Fact] + public async Task Precheck_WithApiAndOperation_InvokesResolver_AndReturnsAccessProfileAndPlanId() + { + SeedPlan(CreatePlan(id: "profile-plan", name: "Profile Plan")); + SeedClientAssignment(CreateLegacyAssignment(planId: "legacy-plan")); + + var (resolverProxy, tracker) = CreateResolverProxy((clientAppId, tenantId, apiId, operationId) => + new ResolvedAccessSnapshot( + PlanId: "profile-plan", + RoutingPolicyId: null, + AllowedDeployments: ["gpt-4o"], + SourceProfileId: $"ap:{clientAppId}:{tenantId}:{apiId}:{operationId}")); + + using var client = CreateClient(resolverProxy); + var response = await client.GetAsync($"/api/precheck/{ClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=openai-api&operationId=chat"); + var json = await ReadJsonAsync(response); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.Single(tracker.Calls); + Assert.Equal((ClientAppId, TenantId, "openai-api", "chat"), tracker.Calls[0]); + Assert.Equal("profile-plan", json.RootElement.GetProperty("planId").GetString()); + Assert.Equal($"ap:{ClientAppId}:{TenantId}:openai-api:chat", json.RootElement.GetProperty("accessProfileId").GetString()); + } + + [Fact] + public async Task Precheck_WithoutApiId_UsesLegacyPath_AndOmitsAccessProfileId() + { + SeedPlan(CreatePlan(id: "legacy-plan", name: "Legacy Plan")); + SeedClientAssignment(CreateLegacyAssignment(planId: "legacy-plan")); + + using var client = CreateClient(); + var response = await client.GetAsync($"/api/precheck/{ClientAppId}/{TenantId}?deploymentId=gpt-4o"); + var json = await ReadJsonAsync(response); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.Equal("legacy-plan", json.RootElement.GetProperty("planId").GetString()); + Assert.False(json.RootElement.TryGetProperty("accessProfileId", out var accessProfileId) && accessProfileId.ValueKind != JsonValueKind.Null); + } + + [Fact] + public async Task Precheck_WithApiIdAndNoProfile_FallsBackToLegacyAssignment() + { + SeedPlan(CreatePlan(id: "legacy-plan", name: "Legacy Plan")); + SeedClientAssignment(CreateLegacyAssignment(planId: "legacy-plan")); + + var (resolverProxy, tracker) = CreateResolverProxy((_, _, _, _) => null); + using var client = CreateClient(resolverProxy); + + var response = await client.GetAsync($"/api/precheck/{ClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=openai-api"); + var json = await ReadJsonAsync(response); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.Single(tracker.Calls); + Assert.Equal("legacy-plan", json.RootElement.GetProperty("planId").GetString()); + Assert.False(json.RootElement.TryGetProperty("accessProfileId", out var accessProfileId) && accessProfileId.ValueKind != JsonValueKind.Null); + } + + [Fact] + public async Task Precheck_DisabledOperationProfile_IsSkipped_InFavorOfApiProfile() + { + SeedPlan(CreatePlan(id: "api-plan", name: "API Plan")); + SeedClientAssignment(CreateLegacyAssignment(planId: "legacy-plan")); + + using var resolverHarness = CreateResolverHarness( + [ + CreateAccessProfile(ClientAppId, TenantId, "openai-api", "chat", "disabled-plan", enabled: false), + CreateAccessProfile(ClientAppId, TenantId, "openai-api", null, "api-plan") + ], + CreateLegacyAssignment(planId: "legacy-plan")); + + using var client = CreateClient(resolverHarness.Instance); + var response = await client.GetAsync($"/api/precheck/{ClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=openai-api&operationId=chat"); + var json = await ReadJsonAsync(response); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.Equal("api-plan", json.RootElement.GetProperty("planId").GetString()); + Assert.Equal($"ap:{ClientAppId}:{TenantId}:openai-api:_all", json.RootElement.GetProperty("accessProfileId").GetString()); + } + + [Fact] + public async Task Precheck_ResponseIncludesAllowedDeployments_FromResolvedProfile() + { + SeedPlan(CreatePlan(id: "profile-plan", name: "Profile Plan")); + SeedClientAssignment(CreateLegacyAssignment(planId: "legacy-plan")); + + var (resolverProxy, _) = CreateResolverProxy((clientAppId, tenantId, apiId, operationId) => + new ResolvedAccessSnapshot( + PlanId: "profile-plan", + RoutingPolicyId: null, + AllowedDeployments: ["gpt-4o", "gpt-4o-mini"], + SourceProfileId: $"ap:{clientAppId}:{tenantId}:{apiId}:{operationId ?? "_all"}")); + + using var client = CreateClient(resolverProxy); + var response = await client.GetAsync($"/api/precheck/{ClientAppId}/{TenantId}?deploymentId=gpt-4o&apiId=openai-api"); + var json = await ReadJsonAsync(response); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.Equal( + ["gpt-4o", "gpt-4o-mini"], + json.RootElement.GetProperty("allowedDeployments").EnumerateArray().Select(x => x.GetString()!).ToArray()); + } + + [Fact] + public async Task Precheck_BackwardCompatibleLegacyCallers_KeepCurrentResponseShape() + { + SeedPlan(CreatePlan(id: "legacy-plan", name: "Legacy Plan")); + SeedClientAssignment(CreateLegacyAssignment(planId: "legacy-plan")); + + using var client = CreateClient(); + var response = await client.GetAsync($"/api/precheck/{ClientAppId}/{TenantId}?deploymentId=gpt-4o"); + var json = await ReadJsonAsync(response); + + Assert.Equal(HttpStatusCode.OK, response.StatusCode); + Assert.Equal("authorized", json.RootElement.GetProperty("status").GetString()); + Assert.Equal(ClientAppId, json.RootElement.GetProperty("clientAppId").GetString()); + Assert.Equal(TenantId, json.RootElement.GetProperty("tenantId").GetString()); + Assert.Equal("Legacy Plan", json.RootElement.GetProperty("plan").GetString()); + Assert.Equal("gpt-4o", json.RootElement.GetProperty("requestedDeployment").GetString()); + } + + [Fact(Skip = "[Pending: M4] all templates must render extraction")] + public void TemplateRendering_Pending_ApiIdVariableExtraction() + { + } + + [Fact(Skip = "[Pending: M4] all templates must include apiId and operationId on the precheck URL")] + public void TemplateRendering_Pending_PrecheckUrlCarriesApiAndOperation() + { + } + + [Fact(Skip = "[Pending: M4] all templates must include accessProfileId/planId/apiId/operationId in outbound log payload")] + public void TemplateRendering_Pending_LogPayloadCarriesAccessProfileMetadata() + { + } + + [Fact(Skip = "[Pending: M4] template manifests should bump to version 1.1 once AAA fields ship")] + public void TemplateRendering_Pending_TemplateVersionBump() + { + } + + private HttpClient CreateClient(object? resolver = null) + { + var factory = _factory.WithWebHostBuilder(builder => + { + builder.ConfigureTestServices(services => + { + if (resolver is not null) + { + var resolverType = RequireType("IAccessProfileResolver"); + RemoveService(services, resolverType); + services.AddSingleton(resolverType, resolver); + } + }); + }); + + var client = factory.CreateClient(); + client.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", "test-token"); + return client; + } + + private void SeedPlan(PlanData plan) + => _factory.Redis.SeedString(RedisKeys.Plan(plan.Id), JsonSerializer.Serialize(plan, JsonOpts)); + + private void SeedClientAssignment(ClientPlanAssignment assignment) + => _factory.Redis.SeedString($"client:{assignment.ClientAppId}:{assignment.TenantId}", JsonSerializer.Serialize(assignment, JsonOpts)); + + private static PlanData CreatePlan(string id, string name) => new() + { + Id = id, + Name = name, + MonthlyRate = 99m, + MonthlyTokenQuota = 1_000_000, + TokensPerMinuteLimit = 0, + RequestsPerMinuteLimit = 0, + AllowOverbilling = true, + CostPerMillionTokens = 5m, + RollUpAllDeployments = true + }; + + private static ClientPlanAssignment CreateLegacyAssignment(string planId) => new() + { + Id = $"{ClientAppId}:{TenantId}", + ClientAppId = ClientAppId, + TenantId = TenantId, + PlanId = planId, + DisplayName = "Legacy Client", + CurrentPeriodStart = new DateTime(2026, 05, 01, 0, 0, 0, DateTimeKind.Utc), + AllowedDeployments = [] + }; + + private static async Task ReadJsonAsync(HttpResponseMessage response) + => JsonDocument.Parse(await response.Content.ReadAsStringAsync()); + + private static void RemoveService(IServiceCollection services, Type serviceType) + { + var descriptors = services.Where(descriptor => descriptor.ServiceType == serviceType).ToList(); + foreach (var descriptor in descriptors) + { + services.Remove(descriptor); + } + } +} diff --git a/src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileResolverTests.cs b/src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileResolverTests.cs new file mode 100644 index 00000000..649461c8 --- /dev/null +++ b/src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileResolverTests.cs @@ -0,0 +1,116 @@ +using AIPolicyEngine.Api.Models; +using static AIPolicyEngine.Tests.Services.AccessProfiles.AccessProfileTestSupport; + +namespace AIPolicyEngine.Tests.Services.AccessProfiles; + +public sealed class AccessProfileResolverTests +{ + private const string ClientAppId = "client-123"; + private const string TenantId = "tenant-123"; + + [Fact] + public async Task ResolveAsync_OperationSpecificProfile_BeatsApiWideProfile() + { + using var harness = CreateResolverHarness( + [ + CreateAccessProfile(ClientAppId, TenantId, "openai-api", null, "api-plan"), + CreateAccessProfile(ClientAppId, TenantId, "openai-api", "chat", "operation-plan") + ], + legacyAssignment: CreateLegacyAssignment(planId: "legacy-plan")); + + var resolved = await harness.ResolveAsync(ClientAppId, TenantId, "openai-api", "chat"); + + Assert.NotNull(resolved); + Assert.Equal("operation-plan", resolved!.PlanId); + Assert.Equal($"ap:{ClientAppId}:{TenantId}:openai-api:chat", resolved.SourceProfileId); + } + + [Fact] + public async Task ResolveAsync_ApiWideProfile_BeatsClientGlobalProfile() + { + using var harness = CreateResolverHarness( + [ + CreateAccessProfile(ClientAppId, TenantId, "_global", null, "global-plan"), + CreateAccessProfile(ClientAppId, TenantId, "openai-api", null, "api-plan") + ], + legacyAssignment: CreateLegacyAssignment(planId: "legacy-plan")); + + var resolved = await harness.ResolveAsync(ClientAppId, TenantId, "openai-api", "embeddings"); + + Assert.NotNull(resolved); + Assert.Equal("api-plan", resolved!.PlanId); + Assert.Equal($"ap:{ClientAppId}:{TenantId}:openai-api:_all", resolved.SourceProfileId); + } + + [Fact] + public async Task ResolveAsync_ClientGlobalProfile_BeatsLegacyAssignment() + { + using var harness = CreateResolverHarness( + [ + CreateAccessProfile(ClientAppId, TenantId, "_global", null, "global-plan", routingPolicyId: "profile-routing", allowedDeployments: ["gpt-4o"]) + ], + legacyAssignment: CreateLegacyAssignment(planId: "legacy-plan", routingPolicyId: "legacy-routing", allowedDeployments: ["gpt-4o-mini"])); + + var resolved = await harness.ResolveAsync(ClientAppId, TenantId, "openai-api", null); + + Assert.NotNull(resolved); + Assert.Equal("global-plan", resolved!.PlanId); + Assert.Equal("profile-routing", resolved.RoutingPolicyId); + Assert.Equal(["gpt-4o"], resolved.AllowedDeployments); + Assert.Equal($"ap:{ClientAppId}:{TenantId}:_global:_all", resolved.SourceProfileId); + } + + [Fact] + public async Task ResolveAsync_NoProfiles_ReturnsLegacyAssignment() + { + using var harness = CreateResolverHarness([], CreateLegacyAssignment(planId: "legacy-plan", routingPolicyId: "legacy-routing", allowedDeployments: ["gpt-4o-mini"])); + + var resolved = await harness.ResolveAsync(ClientAppId, TenantId, "openai-api", "chat"); + + Assert.NotNull(resolved); + Assert.Equal("legacy-plan", resolved!.PlanId); + Assert.Equal("legacy-routing", resolved.RoutingPolicyId); + Assert.Equal(["gpt-4o-mini"], resolved.AllowedDeployments); + Assert.Null(resolved.SourceProfileId); + } + + [Fact] + public async Task ResolveAsync_NoProfilesAndNoLegacyAssignment_ReturnsNull() + { + using var harness = CreateResolverHarness([], legacyAssignment: null); + + var resolved = await harness.ResolveAsync(ClientAppId, TenantId, "openai-api", "chat"); + + Assert.Null(resolved); + } + + [Fact] + public async Task ResolveAsync_DisabledProfile_IsSkippedDuringCascade() + { + using var harness = CreateResolverHarness( + [ + CreateAccessProfile(ClientAppId, TenantId, "openai-api", "chat", "disabled-operation-plan", enabled: false), + CreateAccessProfile(ClientAppId, TenantId, "openai-api", null, "api-plan") + ], + legacyAssignment: CreateLegacyAssignment(planId: "legacy-plan")); + + var resolved = await harness.ResolveAsync(ClientAppId, TenantId, "openai-api", "chat"); + + Assert.NotNull(resolved); + Assert.Equal("api-plan", resolved!.PlanId); + Assert.Equal($"ap:{ClientAppId}:{TenantId}:openai-api:_all", resolved.SourceProfileId); + } + + private static ClientPlanAssignment CreateLegacyAssignment(string planId, string? routingPolicyId = null, List? allowedDeployments = null) + => new() + { + Id = $"{ClientAppId}:{TenantId}", + ClientAppId = ClientAppId, + TenantId = TenantId, + PlanId = planId, + DisplayName = "Resolver Test Client", + ModelRoutingPolicyOverride = routingPolicyId, + AllowedDeployments = allowedDeployments ?? [], + CurrentPeriodStart = new DateTime(2026, 05, 01, 0, 0, 0, DateTimeKind.Utc) + }; + } diff --git a/src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileTestSupport.cs b/src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileTestSupport.cs new file mode 100644 index 00000000..2324cab3 --- /dev/null +++ b/src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileTestSupport.cs @@ -0,0 +1,433 @@ +using System.Reflection; +using System.Text.Json; +using AIPolicyEngine.Api.Models; +using AIPolicyEngine.Api.Services; +using Microsoft.Extensions.DependencyInjection; +using Microsoft.Extensions.Logging; +using NSubstitute; +using StackExchange.Redis; +using Xunit.Sdk; + +namespace AIPolicyEngine.Tests.Services.AccessProfiles; + +internal static class AccessProfileTestSupport +{ + private static readonly Assembly ApiAssembly = typeof(PlanData).Assembly; + + public static Type RequireType(string typeName) + => ApiAssembly.GetTypes().FirstOrDefault(type => string.Equals(type.Name, typeName, StringComparison.Ordinal)) + ?? throw new XunitException($"Missing contract type '{typeName}'."); + + public static object CreateAccessProfile( + string clientAppId, + string tenantId, + string apiId, + string? operationId, + string planId, + string? routingPolicyId = null, + IEnumerable? allowedDeployments = null, + bool enabled = true) + { + var accessProfileType = RequireType("AccessProfile"); + var accessProfile = Activator.CreateInstance(accessProfileType) + ?? throw new XunitException("Failed to construct AccessProfile model."); + + SetProperty(accessProfile, "Id", $"ap:{clientAppId}:{tenantId}:{apiId}:{operationId ?? "_all"}"); + SetProperty(accessProfile, "PartitionKey", "access-profile"); + SetProperty(accessProfile, "ClientAppId", clientAppId); + SetProperty(accessProfile, "TenantId", tenantId); + SetProperty(accessProfile, "ApiId", apiId); + SetProperty(accessProfile, "OperationId", operationId); + SetProperty(accessProfile, "PlanId", planId); + SetProperty(accessProfile, "RoutingPolicyId", routingPolicyId); + SetProperty(accessProfile, "AllowedDeployments", (allowedDeployments ?? []).ToList()); + SetProperty(accessProfile, "Enabled", enabled); + SetProperty(accessProfile, "CreatedAt", new DateTime(2026, 05, 21, 12, 0, 0, DateTimeKind.Utc)); + SetProperty(accessProfile, "UpdatedAt", new DateTime(2026, 05, 21, 12, 0, 0, DateTimeKind.Utc)); + SetProperty(accessProfile, "CreatedBy", "tester@contoso.com"); + return accessProfile; + } + + public static AccessProfileResolverHarness CreateResolverHarness( + IEnumerable accessProfiles, + ClientPlanAssignment? legacyAssignment) + => CreateResolverHarness( + accessProfiles, + legacyAssignment is null ? [] : [legacyAssignment]); + + public static AccessProfileResolverHarness CreateResolverHarness( + IEnumerable accessProfiles, + IEnumerable legacyAssignments) + { + var resolverInterface = RequireType("IAccessProfileResolver"); + var resolverImplementation = ApiAssembly.GetTypes() + .Where(type => type.IsClass && !type.IsAbstract && resolverInterface.IsAssignableFrom(type)) + .OrderBy(type => type.Name, StringComparer.Ordinal) + .FirstOrDefault() + ?? throw new XunitException("Missing concrete IAccessProfileResolver implementation."); + + var services = new ServiceCollection(); + services.AddLogging(); + services.AddMemoryCache(); + services.AddSingleton(TimeProvider.System); + + var fakeRedis = new FakeRedis(); + services.AddSingleton(fakeRedis.Multiplexer); + services.AddSingleton>(new InMemoryClientAssignmentRepository(legacyAssignments)); + + RegisterConstructorDependencies(services, resolverImplementation, accessProfiles.ToList()); + + var provider = services.BuildServiceProvider(); + var resolver = ActivatorUtilities.CreateInstance(provider, resolverImplementation); + return new AccessProfileResolverHarness(resolver, resolverInterface); + } + + public static (object Proxy, ResolverInvocationTracker Tracker) CreateResolverProxy(Func onResolve) + { + var resolverType = RequireType("IAccessProfileResolver"); + var tracker = new ResolverInvocationTracker(); + var proxy = ReflectionProxy.Create(resolverType, (method, args) => + { + if (string.Equals(method.Name, "ResolveAsync", StringComparison.Ordinal)) + { + var clientAppId = (string?)args?[0] ?? string.Empty; + var tenantId = (string?)args?[1] ?? string.Empty; + var apiId = (string?)args?[2] ?? string.Empty; + var operationId = args?.Length > 3 ? args?[3] as string : null; + tracker.Calls.Add((clientAppId, tenantId, apiId, operationId)); + var snapshot = onResolve(clientAppId, tenantId, apiId, operationId); + return CreateTaskResult(method.ReturnType, snapshot is null ? null : CreateResolvedAccess(snapshot)); + } + + throw new NotSupportedException($"Resolver proxy does not support method '{method.Name}'."); + }); + + return (proxy, tracker); + } + + public static object? CreateResolvedAccess(ResolvedAccessSnapshot snapshot) + { + var resolvedType = ApiAssembly.GetTypes().FirstOrDefault(type => string.Equals(type.Name, "ResolvedAccessProfile", StringComparison.Ordinal)) + ?? throw new XunitException("Missing contract type 'ResolvedAccessProfile'. Spec appendix still mentions 'ResolvedAccess'; align the contract."); + + var resolved = Activator.CreateInstance(resolvedType) + ?? throw new XunitException("Failed to construct ResolvedAccessProfile."); + SetProperty(resolved, "PlanId", snapshot.PlanId); + SetProperty(resolved, "RoutingPolicyId", snapshot.RoutingPolicyId); + SetProperty(resolved, "AllowedDeployments", snapshot.AllowedDeployments.ToList()); + SetProperty(resolved, "SourceProfileId", snapshot.SourceProfileId); + return resolved; + } + + public static string? GetString(object? instance, string propertyName) + => instance is null ? null : GetProperty(instance, propertyName)?.ToString(); + + public static IReadOnlyList GetStrings(object instance, string propertyName) + { + var value = GetProperty(instance, propertyName); + return value switch + { + null => [], + IEnumerable strings => strings.ToList(), + _ => throw new XunitException($"Property '{propertyName}' is not a string list.") + }; + } + + private static void RegisterConstructorDependencies(IServiceCollection services, Type implementationType, IReadOnlyList accessProfiles) + { + foreach (var parameter in implementationType.GetConstructors() + .OrderByDescending(ctor => ctor.GetParameters().Length) + .First().GetParameters()) + { + if (services.Any(descriptor => descriptor.ServiceType == parameter.ParameterType)) + { + continue; + } + + if (parameter.ParameterType == typeof(IConnectionMultiplexer) || + parameter.ParameterType == typeof(TimeProvider) || + parameter.ParameterType == typeof(ILoggerFactory)) + { + continue; + } + + if (parameter.ParameterType.IsGenericType && + parameter.ParameterType.GetGenericTypeDefinition() == typeof(ILogger<>)) + { + continue; + } + + if (parameter.ParameterType.IsGenericType && + parameter.ParameterType.GetGenericTypeDefinition() == typeof(IRepository<>) && + string.Equals(parameter.ParameterType.GenericTypeArguments[0].Name, nameof(ClientPlanAssignment), StringComparison.Ordinal)) + { + continue; + } + + if (string.Equals(parameter.ParameterType.Name, "IAccessProfileRepository", StringComparison.Ordinal)) + { + services.AddSingleton(parameter.ParameterType, CreateAccessProfileRepositoryProxy(parameter.ParameterType, accessProfiles)); + continue; + } + + if (parameter.ParameterType.IsGenericType && + parameter.ParameterType.GetGenericTypeDefinition() == typeof(IRepository<>) && + string.Equals(parameter.ParameterType.GenericTypeArguments[0].Name, "AccessProfile", StringComparison.Ordinal)) + { + var repository = CreateGenericAccessProfileRepository(parameter.ParameterType.GenericTypeArguments[0], accessProfiles); + services.AddSingleton(parameter.ParameterType, repository); + continue; + } + + if (parameter.ParameterType.IsInterface || parameter.ParameterType.IsAbstract) + { + services.AddSingleton(parameter.ParameterType, _ => Substitute.For([parameter.ParameterType], [])); + continue; + } + + if (parameter.ParameterType.GetConstructor(Type.EmptyTypes) is not null) + { + services.AddSingleton(parameter.ParameterType, Activator.CreateInstance(parameter.ParameterType)!); + continue; + } + + throw new XunitException($"Unsupported constructor dependency '{parameter.ParameterType.FullName}' for resolver '{implementationType.Name}'."); + } + } + + private static object CreateGenericAccessProfileRepository(Type accessProfileType, IReadOnlyList accessProfiles) + { + var repositoryType = typeof(ReflectionEntityRepository<>).MakeGenericType(accessProfileType); + return Activator.CreateInstance(repositoryType, [accessProfiles]) + ?? throw new XunitException("Failed to construct generic access profile repository."); + } + + private static object CreateAccessProfileRepositoryProxy(Type repositoryInterfaceType, IReadOnlyList accessProfiles) + { + var store = accessProfiles.ToDictionary(profile => GetString(profile, "Id")!, CloneUntyped, StringComparer.Ordinal); + return ReflectionProxy.Create(repositoryInterfaceType, (method, args) => + { + if (string.Equals(method.Name, "GetAllAsync", StringComparison.Ordinal) || + string.Equals(method.Name, "ListAsync", StringComparison.Ordinal)) + { + return CreateTaskResult(method.ReturnType, CreateTypedList(method.ReturnType, store.Values.ToList())); + } + + if (string.Equals(method.Name, "GetAsync", StringComparison.Ordinal) || + string.Equals(method.Name, "FindAsync", StringComparison.Ordinal) || + string.Equals(method.Name, "ResolveAsync", StringComparison.Ordinal) || + string.Equals(method.Name, "GetForScopeAsync", StringComparison.Ordinal)) + { + var match = FindProfile(store.Values, args); + return CreateTaskResult(method.ReturnType, match is null ? null : CloneUntyped(match)); + } + + if (string.Equals(method.Name, "UpsertAsync", StringComparison.Ordinal)) + { + var entity = CloneUntyped(args?[0] ?? throw new XunitException("UpsertAsync requires entity argument.")); + store[GetString(entity, "Id")!] = entity; + return CreateTaskResult(method.ReturnType, CloneUntyped(entity)); + } + + if (string.Equals(method.Name, "DeleteAsync", StringComparison.Ordinal)) + { + var id = args?.OfType().FirstOrDefault(); + var deleted = id is not null && store.Remove(id); + return CreateTaskResult(method.ReturnType, deleted); + } + + throw new NotSupportedException($"Repository proxy does not support method '{method.Name}'."); + }); + } + + private static object? FindProfile(IEnumerable accessProfiles, object?[]? args) + { + var stringArgs = (args ?? []).OfType().ToList(); + if (stringArgs.Count == 0) + { + return null; + } + + if (stringArgs.Count == 1) + { + return accessProfiles.FirstOrDefault(profile => + string.Equals(GetString(profile, "Id"), stringArgs[0], StringComparison.Ordinal)); + } + + var clientAppId = stringArgs[0]; + var tenantId = stringArgs.Count > 1 ? stringArgs[1] : null; + var apiId = stringArgs.Count > 2 ? stringArgs[2] : null; + var operationId = stringArgs.Count > 3 ? stringArgs[3] : null; + + return accessProfiles.FirstOrDefault(profile => + string.Equals(GetString(profile, "ClientAppId"), clientAppId, StringComparison.Ordinal) && + string.Equals(GetString(profile, "TenantId"), tenantId, StringComparison.Ordinal) && + string.Equals(GetString(profile, "ApiId"), apiId, StringComparison.Ordinal) && + string.Equals(GetString(profile, "OperationId") ?? "_all", operationId ?? "_all", StringComparison.Ordinal)); + } + + private static object CreateTaskResult(Type returnType, object? result) + { + if (returnType == typeof(Task)) + { + return Task.CompletedTask; + } + + if (!returnType.IsGenericType || returnType.GetGenericTypeDefinition() != typeof(Task<>)) + { + throw new XunitException($"Unsupported async return type '{returnType.FullName}'."); + } + + var resultType = returnType.GenericTypeArguments[0]; + var taskFromResult = typeof(Task).GetMethods(BindingFlags.Public | BindingFlags.Static) + .Single(method => method.Name == nameof(Task.FromResult) && method.IsGenericMethod) + .MakeGenericMethod(resultType); + return taskFromResult.Invoke(null, [result])!; + } + + private static object CreateTypedList(Type taskReturnType, IReadOnlyList items) + { + var listType = taskReturnType.GenericTypeArguments[0]; + var elementType = listType.IsGenericType ? listType.GenericTypeArguments[0] : RequireType("AccessProfile"); + var typedList = (System.Collections.IList)Activator.CreateInstance(typeof(List<>).MakeGenericType(elementType))!; + foreach (var item in items) + { + typedList.Add(CloneUntyped(item)); + } + + return typedList; + } + + private static object? GetProperty(object instance, string propertyName) + => instance.GetType().GetProperty(propertyName, BindingFlags.Public | BindingFlags.Instance | BindingFlags.IgnoreCase)?.GetValue(instance); + + private static void SetProperty(object instance, string propertyName, object? value) + { + var property = instance.GetType().GetProperty(propertyName, BindingFlags.Public | BindingFlags.Instance | BindingFlags.IgnoreCase) + ?? throw new XunitException($"Property '{propertyName}' was not found on type '{instance.GetType().Name}'."); + property.SetValue(instance, value); + } + + private static object CloneUntyped(object instance) + => JsonSerializer.Deserialize(JsonSerializer.Serialize(instance, instance.GetType(), JsonConfig.Default), instance.GetType(), JsonConfig.Default) + ?? throw new XunitException($"Failed to clone instance of '{instance.GetType().Name}'."); + + internal sealed record ResolvedAccessSnapshot(string PlanId, string? RoutingPolicyId, IReadOnlyList AllowedDeployments, string? SourceProfileId); + + internal sealed class ResolverInvocationTracker + { + public List<(string ClientAppId, string TenantId, string ApiId, string? OperationId)> Calls { get; } = []; + } + + internal sealed class AccessProfileResolverHarness(object resolver, Type interfaceType) : IDisposable + { + private readonly MethodInfo _resolveMethod = interfaceType.GetMethod("ResolveAsync") + ?? throw new XunitException("IAccessProfileResolver.ResolveAsync was not found."); + + public object Instance => resolver; + + public async Task ResolveAsync(string clientAppId, string tenantId, string apiId, string? operationId) + { + var invocation = _resolveMethod.Invoke(resolver, [clientAppId, tenantId, apiId, operationId, CancellationToken.None]) + ?? throw new XunitException("ResolveAsync returned null task."); + await (Task)invocation; + var result = invocation.GetType().GetProperty("Result")?.GetValue(invocation); + if (result is null) + { + return null; + } + + return new ResolvedAccessSnapshot( + GetString(result, "PlanId") ?? throw new XunitException("Resolved access is missing PlanId."), + GetString(result, "RoutingPolicyId"), + GetStrings(result, "AllowedDeployments"), + GetString(result, "SourceProfileId")); + } + + public void Dispose() + { + if (resolver is IDisposable disposable) + { + disposable.Dispose(); + } + } + } + + private class ReflectionProxy : DispatchProxy + { + public Func Handler { get; set; } = null!; + + protected override object? Invoke(MethodInfo? targetMethod, object?[]? args) + => Handler(targetMethod ?? throw new XunitException("Missing target method."), args); + + public static object Create(Type interfaceType, Func handler) + { + var createMethod = typeof(DispatchProxy) + .GetMethods(BindingFlags.Public | BindingFlags.Static) + .Single(method => + string.Equals(method.Name, nameof(DispatchProxy.Create), StringComparison.Ordinal) && + method.IsGenericMethodDefinition && + method.GetGenericArguments().Length == 2 && + method.GetParameters().Length == 0); + var proxy = (ReflectionProxy)createMethod.MakeGenericMethod(interfaceType, typeof(ReflectionProxy)).Invoke(null, null)!; + proxy.Handler = handler; + return proxy; + } + } + + private sealed class InMemoryClientAssignmentRepository(IEnumerable assignments) : IRepository + { + private readonly Dictionary _store = assignments + .ToDictionary(assignment => $"{assignment.ClientAppId}:{assignment.TenantId}", Clone, StringComparer.Ordinal); + + public Task GetAsync(string id, CancellationToken ct = default) + => Task.FromResult(_store.TryGetValue(id, out var value) ? Clone(value) : null); + + public Task> GetAllAsync(CancellationToken ct = default) + => Task.FromResult(_store.Values.Select(Clone).ToList()); + + public Task UpsertAsync(ClientPlanAssignment entity, CancellationToken ct = default) + { + _store[$"{entity.ClientAppId}:{entity.TenantId}"] = Clone(entity); + return Task.FromResult(Clone(entity)); + } + + public Task DeleteAsync(string id, CancellationToken ct = default) + => Task.FromResult(_store.Remove(id)); + + private static ClientPlanAssignment Clone(ClientPlanAssignment entity) + => JsonSerializer.Deserialize(JsonSerializer.Serialize(entity, JsonConfig.Default), JsonConfig.Default)!; + } + + private sealed class ReflectionEntityRepository : IRepository where T : class + { + private readonly Dictionary _store; + + public ReflectionEntityRepository(IEnumerable seed) + { + _store = seed.Cast().ToDictionary(GetId, Clone, StringComparer.Ordinal); + } + + public Task GetAsync(string id, CancellationToken ct = default) + => Task.FromResult(_store.TryGetValue(id, out var entity) ? Clone(entity) : null); + + public Task> GetAllAsync(CancellationToken ct = default) + => Task.FromResult(_store.Values.Select(Clone).ToList()); + + public Task UpsertAsync(T entity, CancellationToken ct = default) + { + _store[GetId(entity)] = Clone(entity); + return Task.FromResult(Clone(entity)); + } + + public Task DeleteAsync(string id, CancellationToken ct = default) + => Task.FromResult(_store.Remove(id)); + + private static string GetId(T entity) + => (string?)typeof(T).GetProperty("Id", BindingFlags.Public | BindingFlags.Instance | BindingFlags.IgnoreCase)?.GetValue(entity) + ?? throw new XunitException($"Type '{typeof(T).Name}' does not expose an Id property."); + + private static T Clone(T entity) + => JsonSerializer.Deserialize(JsonSerializer.Serialize(entity, JsonConfig.Default), JsonConfig.Default)!; + } +} From f6526c1b7a08f192a7d0a6a7568e10d10c097077 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 17:51:52 -0400 Subject: [PATCH 11/14] docs(squad): record AAA M1-M3 completion + M4-M5 kickoff MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Orchestration logs: Freamon (M1-M3 complete, 3d409d24), Bunk (21-test matrix complete, 6c858b96), Sydnor (M4 in-flight), Kima (M5 in-flight) - Session log: M1-M3 shipped, test baseline 295→312 passing, M4-M5 parallel in-flight - Merged inbox decisions into decisions.md (M1-M3 implementation status) - Archived full specs: freamon-aaa-m1-m3-impl.md, bunk-aaa-test-matrix.md - Updated agent histories: Freamon (M1-M3 details), Bunk (test matrix details), Sydnor (M4 scope), Kima (M5 scope) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/bunk/history.md | 31 +++++++++++++ .squad/agents/freamon/history.md | 25 ++++++++++ .squad/agents/kima/history.md | 46 ++++++++++++++++++- .squad/agents/sydnor/history.md | 36 +++++++++++++++ .squad/decisions.md | 16 +++++++ .../bunk-aaa-test-matrix.md | 0 .../decisions/inbox/freamon-aaa-m1-m3-impl.md | 31 ------------- .../2026-05-21T21-48-19Z-aaa-m1-m3-shipped.md | 24 ++++++++++ .../2026-05-21T21-48-19Z-bunk.md | 38 +++++++++++++++ .../2026-05-21T21-48-19Z-freamon.md | 42 +++++++++++++++++ .../2026-05-21T21-48-19Z-kima.md | 33 +++++++++++++ .../2026-05-21T21-48-19Z-sydnor.md | 30 ++++++++++++ 12 files changed, 320 insertions(+), 32 deletions(-) rename .squad/decisions/{inbox => archive}/bunk-aaa-test-matrix.md (100%) delete mode 100644 .squad/decisions/inbox/freamon-aaa-m1-m3-impl.md create mode 100644 .squad/log/2026-05-21T21-48-19Z-aaa-m1-m3-shipped.md create mode 100644 .squad/orchestration-log/2026-05-21T21-48-19Z-bunk.md create mode 100644 .squad/orchestration-log/2026-05-21T21-48-19Z-freamon.md create mode 100644 .squad/orchestration-log/2026-05-21T21-48-19Z-kima.md create mode 100644 .squad/orchestration-log/2026-05-21T21-48-19Z-sydnor.md diff --git a/.squad/agents/bunk/history.md b/.squad/agents/bunk/history.md index 0b53a67a..75963185 100644 --- a/.squad/agents/bunk/history.md +++ b/.squad/agents/bunk/history.md @@ -481,3 +481,34 @@ When writing tests for deployed infrastructure: **For Sydnor:** No new Terraform changes expected for AAA work itself — infrastructure is done. M5 template updates are pure APIM policy XML changes (not infrastructure); Sydnor may assist with template version bump and APIM SDK testing if needed. **Context:** Full architecture at `.squad/decisions/archive/mcnulty-aaa-per-client-arch.md` (387 lines) and pre/post contracts at `.squad/decisions/archive/mcnulty-aaa-pre-post-endpoint-contracts.md` (522 lines). Decisions merged to `.squad/decisions.md` entry 2026-05-21T21:28:06Z. + +### 2026-05-21T21:48:19Z — AAA M1-M3 Test Matrix Complete + +**Status:** ✅ COMPLETE + +**Commits:** +- Bunk 21-test matrix: `6c858b96` + +**Delivered:** +- **Resolver unit tests (6):** Operation-specific > API-wide > client-global > legacy fallback cascade levels; disabled profile skip; null edge cases +- **Precheck integration tests (6):** With/without `apiId`/`operationId`; legacy backward compat (no apiId = legacy path); disabled profile handling; AAA response with `allowedDeployments` +- **Log integration tests (4):** PlanId resolution (supplied wins, fallback to legacy assignment); AccessProfileId persistence to audit trail; legacy payload compat +- **End-to-end cascade test (1):** Full precheck → log flow with cascade resolution order +- **Pending M4 template assertions (4 skipped):** Template render (apiId/operationId extraction), precheck URL diffs, log payload diffs, version bump 1.0→1.1 + +**Test Results:** +- Total: 320 +- Succeeded: 312 (↑ +17 from M1-M3 baseline) +- Skipped: 8 (4 pending M4 template assertions, 4 pre-existing) +- Failed: 0 + +**Key Test Decisions:** +- Resolver fallback logic placed inside service (not endpoint) — aligns with approved test matrix +- ResolvedAccessProfile type alias used for contract clarity +- Legacy precheck path preserved without access-profile metadata +- Plan resolution edge case (mismatched PlanId) not asserted (not in approved matrix) + +**Blocked Issues Resolved:** +- Freamon M1-M3 contract firm → all resolver/precheck/log shapes now testable +- Template diffs now visible → Sydnor can proceed with M4 APIM updates +- All assertions passing → M4 and M5 in-flight now unblocked diff --git a/.squad/agents/freamon/history.md b/.squad/agents/freamon/history.md index 5ceb0166..2be2a458 100644 --- a/.squad/agents/freamon/history.md +++ b/.squad/agents/freamon/history.md @@ -81,3 +81,28 @@ For detailed work items, see: - Async apply is better implemented as an in-process channel + `BackgroundService` than ad-hoc `Task.Run` from endpoints. Endpoints persist the desired assignment as `pending`, enqueue a scope work item, and return 202 immediately; the worker flips to `applying`, re-renders from stored parameters, applies through the SDK, computes `generatedXmlHash` on success, and records `failed/errorMessage` on exceptions. Startup replay of `pending`/`applying` items should be best-effort so tests or partial environments do not stop the host. - For Bunk: the APIM seams are now interface-first (`IApimCatalogService`, `ITemplateLibraryService`, `IPolicyAssignmentRepository`, `IApimPolicyApplyService`) and the worker logic is isolated in `ApimPolicyApplyService.ProcessAssignmentAsync`. Unit tests can exercise template rendering and apply orchestration without live Azure; recorded/live APIM coverage should focus on `ApimCatalogService` method mappings and the raw-XML policy format behavior. - ASP.NET Core binds nested options like Apim:ResourceId from environment variables that use double underscores (Apim__ResourceId), not single underscores. When Terraform wires Container App settings for ApimManagementOptions.ResourceId, use the double-underscore form or the API will see an empty resource ID at runtime (src/AIPolicyEngine.Api/Services/ApimManagement/ApimManagementOptions.cs, infra/terraform/modules/compute/main.tf). + +## 2026-05-21T21:48:19Z — AAA M1-M3 Backend Implementation Complete + +**Status:** ✅ COMPLETE + +**Commits:** +- Freamon M1-M3: `3d409d24` + +**Delivered:** +- **M1:** AccessProfile model (Cosmos `configuration` container, partition key `"access-profile"`, deterministic IDs `ap:{clientAppId}:{tenantId}:{apiId}:{operationId|_all}`) +- **M2:** IAccessProfileResolver service + cascade logic (operation > api > global > legacy fallback); admin CRUD endpoints `/api/access-profiles/*`, bulk assign +- **M3:** Precheck endpoint — optional `apiId`/`operationId` params, extended response fields `planId`/`accessProfileId`/`allowedDeployments`, backward-compat fallback (no `apiId` = legacy path) +- **M3:** Log-ingest endpoint — accept + persist `accessProfileId`, `planId`, `apiId`, `operationId` to audit trail; plan inheritance for routing/deployment policy + +**Key Decisions Validated:** +- Client metering stays on ClientPlanAssignment (quota/rate-limit state); Access Profiles layer above resolves policy +- Legacy callers work unchanged (preserved backward compat) +- Plan override semantics: profile wins if populated; otherwise plan defaults + +**Test Validation:** +- ✅ `dotnet build src\AIPolicyEngine.Api\AIPolicyEngine.Api.csproj --nologo` +- ✅ `dotnet test src\AIPolicyEngine.Tests\AIPolicyEngine.Tests.csproj --no-restore --nologo` → 311 succeeded, 8 skipped (0 failed) + +**Blocked Issue Resolved (Bunk coordination):** +- Test matrix depended on M1-M3 endpoint contracts and audit trail shapes; now ready for Bunk's 21-test assertions and Sydnor's M4 template updates diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index e0b8c94a..9647567c 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -90,4 +90,48 @@ Freamon fixed a config-binding bug in the APIM infrastructure: the env var `APIM **Validation:** Contract awaits M3 precheck integration (apiId/operationId handling) and M4 log-ingest (audit trail). -**Next:** Start after M2 API contracts firm (2-3 days out). \ No newline at end of file +**Next:** Start after M2 API contracts firm (2-3 days out). + +### 2026-05-21T21:48:19Z — AAA M5 UI (`/access` Page) In-Flight + +**Status:** 🔄 IN-FLIGHT + +**Scope:** +Build the `/access` admin page for Access Profile management (new per-client, per-API authorization overrides). + +**Layout & Components:** +- **Top Section:** Client selector (dropdown/search from `/api/clients`) +- **Main Grid:** API rows with columns: + - Plan (read-only, shows from profile or inherited) + - Routing Policy (select or null for inherit) + - Deployments Allowed (multi-select or null for inherit) + - Enable/Disable toggle +- **Drill-down:** Click API row to reveal operations table with per-operation overrides (same column structure) +- **Add/Edit Form:** Modal with Plan selector, optional Routing Policy, optional deployment restrictions, submit/cancel +- **Bulk Action:** Select multiple APIs from checklist, apply the same profile to all in one shot + +**Reuse Existing Components:** +- Plan selector dropdown (`.squad/skills/` available) +- Routing Policy selector (already built for Plans page) +- Tailwind flex+truncate pattern for row overflow (`.squad/skills/tailwind-flex-truncate-pattern/SKILL.md`) +- React render-loop debugging skill (`.squad/skills/react-render-loop-debugging/SKILL.md` — avoid Apis.tsx pattern) + +**API Contract:** +- GET `/api/access-profiles` — list profiles for a client +- POST/PUT `/api/access-profiles/{id}` — create/update profile +- DELETE `/api/access-profiles/{id}` — delete profile +- Bulk endpoint (TBD via Freamon M2 spec) + +**Blockers Now Cleared:** +- ✅ Freamon M1-M3: Precheck contract finalized (apiId/operationId path ready) +- ✅ Bunk: 21-test matrix validates response shapes + +**Parallel to M4:** +- Sydnor's M4 template updates do not block UI work +- API endpoint contracts already firm + +**Next Steps:** +- Start component structure (ClientSelector, ApiGrid, OperationGrid, ProfileForm) +- Implement data fetch + caching patterns (parallel to API updates) +- Polish flex+truncate styling for API/operation rows +- Mock data for component testing before API integration \ No newline at end of file diff --git a/.squad/agents/sydnor/history.md b/.squad/agents/sydnor/history.md index b865d4c0..f42dd59c 100644 --- a/.squad/agents/sydnor/history.md +++ b/.squad/agents/sydnor/history.md @@ -309,3 +309,39 @@ Freamon discovered and fixed a config-binding bug in the APIM_RESOURCE_ID wiring **Decision Merged into `.squad/decisions.md`:** All future APIM Terraform wiring must use `Apim__ResourceId` when populating the nested config key. Freamon audited 200+ env vars and found no other mismatches. +### 2026-05-21T21:48:19Z — AAA M4 APIM Template Updates In-Flight + +**Status:** 🔄 IN-FLIGHT + +**Scope:** +Update all 5 APIM policy templates (version 1.0 → 1.1): +- `policies/entra-jwt-ai/policy.xml` +- `policies/entra-jwt-ai-dlp/policy.xml` +- `policies/subscription-key-ai/policy.xml` +- `policies/subscription-key-ai-dlp/policy.xml` +- `policies/entra-jwt-rest/policy.xml` (log-ingest only) + +**Changes per template:** +- Add APIM `set-variable` blocks to extract `apiId` and `operationId` from request context +- Extend precheck URL with `&apiId={apiId}&operationId={operationId}` query params +- Extract resolved `accessProfileId`, `planId`, `allowedDeployments` from precheck response +- Add those fields to outbound log payload (alongside existing `correlationId`, `clientAppId`, `requestCost`) +- Update template manifest version: `1.0` → `1.1` + +**Blockers Now Cleared:** +- ✅ Freamon M1-M3: Precheck/log endpoints ready with apiId/operationId param support +- ✅ Bunk: 21-test matrix complete (4 pending M4 template assertions documented) + +**Test Coverage (Bunk Pending M4):** +- Template extracts `apiId` from request context +- Precheck URL carries new params +- Outbound log payload carries AccessProfileId, PlanId, ApiId, OperationId +- Template manifest version bumped to `1.1` + +**Next Steps:** +- Implement template XML diffs +- Run Bunk's 4 pending template assertions +- Coordinate with APIM deployment/staging validation +- Parallel to Kima's M5 UI work + + diff --git a/.squad/decisions.md b/.squad/decisions.md index 26edeac2..c41b1f0b 100644 --- a/.squad/decisions.md +++ b/.squad/decisions.md @@ -12,6 +12,22 @@ ## Active Decisions +### 2026-05-21T21:48:19Z: Implementation status — AAA M1-M3 backend complete, M4-M5 parallel in-flight +**By:** Scribe (logged from orchestration) +**Status:** In-Flight +**What:** +- **M1-M3 COMPLETE (Freamon):** AccessProfile model, Cosmos repo, IAccessProfileResolver cascade, CRUD endpoints, precheck integration, log-ingest integration. Commit `3d409d24`. +- **21-test matrix COMPLETE (Bunk):** 17 passing + 4 pending M4 template assertions. Total test baseline: 320 (312 pass, 8 skip). Commit `6c858b96`. +- **M4 IN-FLIGHT (Sydnor):** APIM template updates (5 templates, version 1.0→1.1, apiId/operationId variables, precheck URL extension, log payload updates). +- **M5 IN-FLIGHT (Kima):** `/access` admin page (client selector, API grid, per-operation drill-down, assign form). + +**Validation:** +- ✅ Freamon: `dotnet build` + `dotnet test` (311 pass, 8 skip) +- ✅ Bunk: 21-test matrix (17 pass, 4 pending M4) +- Sydnor/Kima: Parallel to M3 completion; M4 blockers lifted, M5 API ready + +**Why:** Track M1-M3 delivery and verify M4/M5 can proceed without dependency deadlock. + ### 2026-05-21T21:28:06Z: User directive — AAA access-profile architecture approved (M1-M6) **By:** Zack Way (via McNulty proposal review) **Status:** Approved diff --git a/.squad/decisions/inbox/bunk-aaa-test-matrix.md b/.squad/decisions/archive/bunk-aaa-test-matrix.md similarity index 100% rename from .squad/decisions/inbox/bunk-aaa-test-matrix.md rename to .squad/decisions/archive/bunk-aaa-test-matrix.md diff --git a/.squad/decisions/inbox/freamon-aaa-m1-m3-impl.md b/.squad/decisions/inbox/freamon-aaa-m1-m3-impl.md deleted file mode 100644 index 9f1a4de6..00000000 --- a/.squad/decisions/inbox/freamon-aaa-m1-m3-impl.md +++ /dev/null @@ -1,31 +0,0 @@ -# Freamon AAA Access Profiles M1-M3 implementation - -## Scope delivered -- Added `AccessProfile` data model and Cosmos-backed repository on the shared `configuration` container with partition key `access-profile` and deterministic IDs `ap:{clientAppId}:{tenantId}:{apiId}:{operationId|_all}`. -- Added `IAccessProfileResolver` + resolver cascade for operation-specific, API-wide, client-global, then legacy client assignment fallback. -- Added admin CRUD and bulk endpoints under `/api/access-profiles` guarded by `AdminPolicy`. -- Integrated access-profile-aware precheck behavior with optional `apiId`, `operationId`, and additive response fields `planId`, `accessProfileId`, and `allowedDeployments`. -- Extended log-ingest contracts to accept optional `accessProfileId`, `planId`, `apiId`, and `operationId`, and persisted those fields to the audit stream. - -## Key implementation decisions -- Kept client metering on `ClientPlanAssignment`; access profiles resolve plan/routing/deployment policy, but precheck still requires a client assignment for quota/rate-limit state so log-ingest and precheck stay aligned. -- Preserved legacy precheck callers when `apiId` is absent, while still surfacing `planId` for additive contract compatibility. -- Used plan inheritance semantics for `routingPolicyId` and `allowedDeployments`: access-profile overrides win when populated; otherwise plan defaults apply. -- Did not touch APIM template XML or UI code. - -## Main files -- `src/AIPolicyEngine.Api/Models/AccessProfile*.cs` -- `src/AIPolicyEngine.Api/Services/AccessProfiles/*` -- `src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs` -- `src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs` -- `src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs` -- `src/AIPolicyEngine.Api/Models/LogIngestRequest.cs` -- `src/AIPolicyEngine.Api/Models/AuditLogItem.cs` -- `src/AIPolicyEngine.Api/Models/AuditLogDocument.cs` -- `src/AIPolicyEngine.Api/Services/AuditLogWriter.cs` -- `src/AIPolicyEngine.Api/Program.cs` - -## Validation -- `dotnet build src\AIPolicyEngine.Api\AIPolicyEngine.Api.csproj --nologo` -- `dotnet test src\AIPolicyEngine.Tests\AIPolicyEngine.Tests.csproj --no-restore --nologo` -- Test run passed locally: 311 succeeded, 8 skipped. diff --git a/.squad/log/2026-05-21T21-48-19Z-aaa-m1-m3-shipped.md b/.squad/log/2026-05-21T21-48-19Z-aaa-m1-m3-shipped.md new file mode 100644 index 00000000..d2e978fb --- /dev/null +++ b/.squad/log/2026-05-21T21-48-19Z-aaa-m1-m3-shipped.md @@ -0,0 +1,24 @@ +# Session: AAA M1-M3 Shipped +**Date:** 2026-05-21T21:48:19Z +**Participants:** Freamon (M1-M3 backend), Bunk (21-test matrix), Sydnor + Kima (M4/M5 in flight) +**Branch:** seiggy/feature/apim-policy-management + +## Summary +M1-M3 backend implementation shipped. Access Profiles resolver, CRUD endpoints, and precheck/log integration complete. Test suite: 312 pass / 8 skip (baseline 295 → +17 active tests, 4 pending M4). + +M4 (Sydnor: template version bump + diffs) and M5 (Kima: `/access` admin UI) now in-flight in parallel. + +## Commits +- Freamon M1-M3: `3d409d24` +- Bunk 21-test matrix: `6c858b96` + +## Test Status +- Total: 320 +- Succeeded: 312 +- Skipped: 8 (4 pending M4 template assertions, 4 pre-existing) +- Failed: 0 + +## Next +- M4 (Sydnor): APIM template updates + test assertions +- M5 (Kima): `/access` page admin UI +- Parallel delivery; M4 + M5 can overlap with M3 completion diff --git a/.squad/orchestration-log/2026-05-21T21-48-19Z-bunk.md b/.squad/orchestration-log/2026-05-21T21-48-19Z-bunk.md new file mode 100644 index 00000000..372d06c2 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-48-19Z-bunk.md @@ -0,0 +1,38 @@ +# Orchestration: Bunk @ 2026-05-21T21:48:19Z + +## Agent +**Bunk** — Test coverage / AAA test matrix + +## Status +✅ COMPLETE + +## Scope Delivered +21-test AAA Access Profile matrix (17 active passing, 4 pending M4): + +- **Resolver unit tests:** 6 (cascade levels, disabled-profile skip) +- **Precheck integration tests:** 6 (with/without apiId, legacy backward compat) +- **Log integration tests:** 4 (PlanId resolution, audit trail) +- **End-to-end cascade test:** 1 (full flow) +- **Pending M4 template assertions:** 4 skipped (template render, version bump) + +## Files Created +- `src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileTestSupport.cs` +- `src/AIPolicyEngine.Tests/Services/AccessProfiles/AccessProfileResolverTests.cs` +- `src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs` +- `src/AIPolicyEngine.Tests/Integration/AccessProfileLogTests.cs` +- `src/AIPolicyEngine.Tests/Integration/AccessProfileCascadeE2ETests.cs` + +## Test Results +- Total: 320 (312 succeeded, 8 skipped) +- AAA total: 21 (17 succeeded, 4 skipped pending M4, 0 failed) + +## Notes +- Resolver fallback logic placed inside service (vs endpoint), per approved test matrix +- Added `ResolvedAccessProfile` type alias for contract clarity +- Legacy precheck path preserved for backward compat (no `apiId` = skip resolver) + +## Commit +`6c858b96` + +## Branch +`seiggy/feature/apim-policy-management` diff --git a/.squad/orchestration-log/2026-05-21T21-48-19Z-freamon.md b/.squad/orchestration-log/2026-05-21T21-48-19Z-freamon.md new file mode 100644 index 00000000..767ddf76 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-48-19Z-freamon.md @@ -0,0 +1,42 @@ +# Orchestration: Freamon @ 2026-05-21T21:48:19Z + +## Agent +**Freamon** — Backend / AAA M1-M3 implementation + +## Status +✅ COMPLETE + +## Scope Delivered +- **M1:** AccessProfile model + Cosmos repository (partition key `access-profile`, deterministic IDs) +- **M2:** IAccessProfileResolver service with cascade logic (operation > API > client-global > legacy fallback) +- **M3:** Admin CRUD endpoints `/api/access-profiles/*` + bulk assign +- **M3:** Precheck integration — optional `apiId`/`operationId` params, extended response (planId, accessProfileId, allowedDeployments) +- **M3:** Log-ingest integration — AccessProfileId, PlanId, ApiId, OperationId flow through to audit trail + +## Key Decisions +- Client metering stays on ClientPlanAssignment (quota/rate-limit state) +- Access Profiles resolve plan/routing/deployment policy +- Legacy precheck callers work unchanged (no `apiId` = skip resolver, use legacy path) +- Plan inheritance: profile overrides win; otherwise plan defaults apply + +## Files Modified +- `src/AIPolicyEngine.Api/Models/AccessProfile*.cs` +- `src/AIPolicyEngine.Api/Services/AccessProfiles/*` +- `src/AIPolicyEngine.Api/Endpoints/AccessProfileEndpoints.cs` +- `src/AIPolicyEngine.Api/Endpoints/PrecheckEndpoints.cs` +- `src/AIPolicyEngine.Api/Endpoints/LogIngestEndpoints.cs` +- `src/AIPolicyEngine.Api/Models/LogIngestRequest.cs` +- `src/AIPolicyEngine.Api/Models/AuditLogItem.cs` +- `src/AIPolicyEngine.Api/Models/AuditLogDocument.cs` +- `src/AIPolicyEngine.Api/Services/AuditLogWriter.cs` +- `src/AIPolicyEngine.Api/Program.cs` + +## Validation +- ✅ `dotnet build` passed +- ✅ `dotnet test` passed (311 succeeded, 8 skipped) + +## Commit +`3d409d24` + +## Branch +`seiggy/feature/apim-policy-management` diff --git a/.squad/orchestration-log/2026-05-21T21-48-19Z-kima.md b/.squad/orchestration-log/2026-05-21T21-48-19Z-kima.md new file mode 100644 index 00000000..41337c53 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-48-19Z-kima.md @@ -0,0 +1,33 @@ +# Orchestration: Kima @ 2026-05-21T21:48:19Z + +## Agent +**Kima** — UI / AAA Management Page (M5) + +## Status +🔄 IN-FLIGHT + +## Scope +M5: Build `/access` page admin UI for Access Profiles + +- Client selector (dropdown/search from existing `/api/clients`) +- API grid (rows) with columns: Plan, Routing Policy, Deployments allowed, Enable toggle +- Drill-down: Click API row to expand operations with per-operation overrides +- Add/Edit form: Select Plan, optionally select Routing Policy, optionally restrict deployments +- Bulk action: Apply same profile to multiple APIs at once + +## Reuse +- Plan selector dropdown (existing) +- Routing Policy selector (existing) +- Tailwind flex+truncate pattern (skill available: `.squad/skills/tailwind-flex-truncate-pattern/SKILL.md`) +- React render-loop debugging (skill available: `.squad/skills/react-render-loop-debugging/SKILL.md`) + +## API Contract +- Uses Freamon's `/api/access-profiles/*` endpoints (M2 CRUD) +- Response shapes from M3 precheck + M4 log integration + +## Blockers +- Freamon M1-M3 complete ✅ +- Bunk tests ready ✅ + +## Parallel with +- Sydnor's M4 template work diff --git a/.squad/orchestration-log/2026-05-21T21-48-19Z-sydnor.md b/.squad/orchestration-log/2026-05-21T21-48-19Z-sydnor.md new file mode 100644 index 00000000..822b0db6 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T21-48-19Z-sydnor.md @@ -0,0 +1,30 @@ +# Orchestration: Sydnor @ 2026-05-21T21:48:19Z + +## Agent +**Sydnor** — Infra / APIM templates (M4) + +## Status +🔄 IN-FLIGHT + +## Scope +M4: Update all 5 APIM policy templates (version 1.0 → 1.1) +- Add `apiId` and `operationId` variable extraction +- Extend precheck URL with new query params +- Extract profile metadata from precheck response +- Add AccessProfileId, PlanId, ApiId, OperationId to log payload + +## Templates +- `policies/entra-jwt-ai/policy.xml` +- `policies/entra-jwt-ai-dlp/policy.xml` +- `policies/subscription-key-ai/policy.xml` +- `policies/subscription-key-ai-dlp/policy.xml` +- `policies/entra-jwt-rest/policy.xml` (log-ingest only) + +## Blockers +- Freamon M1-M3 complete (precheck/log endpoints ready) +- Bunk tests ready (4 template assertions pending) + +## Next Steps +- Test template rendering diffs with Bunk's 4 pending assertions +- Prepare for template deployment +- Parallel with Kima's M5 UI work From 24de42b519064aa48213909b63d97be02650a69b Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 17:53:40 -0400 Subject: [PATCH 12/14] Add AAA M4 APIM template metadata wiring Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../inbox/sydnor-aaa-m4-templates.md | 34 ++++++++ .../templates/entra-jwt-ai-dlp/policy.xml | 12 ++- .../templates/entra-jwt-ai-dlp/template.json | 2 +- policies/templates/entra-jwt-ai/policy.xml | 12 ++- policies/templates/entra-jwt-ai/template.json | 2 +- policies/templates/entra-jwt-rest/policy.xml | 10 ++- .../templates/entra-jwt-rest/template.json | 2 +- .../subscription-key-ai-dlp/policy.xml | 12 ++- .../subscription-key-ai-dlp/template.json | 2 +- .../templates/subscription-key-ai/policy.xml | 12 ++- .../subscription-key-ai/template.json | 2 +- .../Integration/AccessProfilePrecheckTests.cs | 83 +++++++++++++++++-- 12 files changed, 167 insertions(+), 18 deletions(-) create mode 100644 .squad/decisions/inbox/sydnor-aaa-m4-templates.md diff --git a/.squad/decisions/inbox/sydnor-aaa-m4-templates.md b/.squad/decisions/inbox/sydnor-aaa-m4-templates.md new file mode 100644 index 00000000..eef0a17e --- /dev/null +++ b/.squad/decisions/inbox/sydnor-aaa-m4-templates.md @@ -0,0 +1,34 @@ +# Sydnor AAA M4 — APIM template XML updates + +## Scope completed +- Updated all 5 shipped APIM templates under `policies/templates/*/policy.xml` for AAA access-profile metadata propagation. +- Added `apiIdValue` / `operationIdValue` capture to every template. +- Updated the 4 AI templates (`entra-jwt-ai`, `entra-jwt-ai-dlp`, `subscription-key-ai`, `subscription-key-ai-dlp`) to append `apiId` + `operationId` to precheck calls and extract `accessProfileId` + `planId` from the precheck response. +- Updated outbound log payloads to carry `accessProfileId`, `planId`, `apiId`, and `operationId` using the existing lower-camel JSON payload convention. +- Updated `entra-jwt-rest` per McNulty’s matrix: active outbound log payload now carries the new AAA fields; the commented `precheck-rest` alternative also shows the api/operation query params plus response-field extraction for future activation. +- Bumped all 5 template manifests from version `1.0` to `1.1`. +- Activated Bunk’s 4 pending M4 tests in `AccessProfilePrecheckTests` and replaced the placeholders with concrete assertions against the shipped template files. + +## Contract alignment notes +- Precheck response field names were taken from Freamon’s shipped contract: `planId`, `accessProfileId`, `allowedDeployments`. +- Outbound log payload additions use lower-camel JSON property names (`accessProfileId`, `planId`, `apiId`, `operationId`) to match the existing APIM payload style and the endpoint contract. +- Used local APIM variable names `apiIdValue` / `operationIdValue` to avoid ambiguity with JSON property names while still emitting `apiId` / `operationId` on the wire. +- Kept AI-template response extraction variable as `resolvedPlanId` so the log payload can cleanly map to `planId` and the audit model continues to distinguish resolved-plan metadata from legacy assignment fallback. + +## Validation +- `dotnet test src\AIPolicyEngine.Tests\AIPolicyEngine.Tests.csproj --no-restore --nologo` +- Result: **320 total / 316 passed / 0 failed / 4 skipped** +- Remaining skips are the pre-existing Purview seam tests; the 4 prior M4 template skips are now active and passing. + +## Files changed +- `policies/templates/entra-jwt-ai/policy.xml` +- `policies/templates/entra-jwt-ai/template.json` +- `policies/templates/entra-jwt-ai-dlp/policy.xml` +- `policies/templates/entra-jwt-ai-dlp/template.json` +- `policies/templates/subscription-key-ai/policy.xml` +- `policies/templates/subscription-key-ai/template.json` +- `policies/templates/subscription-key-ai-dlp/policy.xml` +- `policies/templates/subscription-key-ai-dlp/template.json` +- `policies/templates/entra-jwt-rest/policy.xml` +- `policies/templates/entra-jwt-rest/template.json` +- `src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs` diff --git a/policies/templates/entra-jwt-ai-dlp/policy.xml b/policies/templates/entra-jwt-ai-dlp/policy.xml index f39b770b..bfeadbac 100644 --- a/policies/templates/entra-jwt-ai-dlp/policy.xml +++ b/policies/templates/entra-jwt-ai-dlp/policy.xml @@ -60,12 +60,14 @@ if (segments.Length > 1) { return segments[segments.Length - 1]; } return "unknown"; }" /> + + - @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + @($"{(string)context.Variables["containerAppBaseUrl"]}/api/precheck/{(string)context.Variables["clientAppId"]}/{(string)context.Variables["tenantId"]}?deploymentId={(string)context.Variables["deploymentId"]}&apiId={(string)context.Variables["apiIdValue"]}&operationId={(string)context.Variables["operationIdValue"]}") GET @("Bearer " + (string)context.Variables["msi-access-token"]) @@ -165,6 +167,10 @@ value="@(((IResponse)context.Variables["precheckResponse"]).Body.As(preserveContent: true)["routedDeployment"]?.ToString())" /> (preserveContent: true)["routingPolicyId"]?.ToString())" /> + (preserveContent: true)["accessProfileId"]?.ToString())" /> + (preserveContent: true)["planId"]?.ToString())" /> @@ -275,6 +281,10 @@ payload.Add(new JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); payload.Add(new JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); payload.Add(new JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new JProperty("accessProfileId", context.Variables.GetValueOrDefault("accessProfileId") ?? "")); + payload.Add(new JProperty("planId", context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); + payload.Add(new JProperty("apiId", context.Variables.GetValueOrDefault("apiIdValue") ?? "")); + payload.Add(new JProperty("operationId", context.Variables.GetValueOrDefault("operationIdValue") ?? "")); payload.Add(new JProperty("correlationId", context.RequestId.ToString())); return payload.ToString(); } diff --git a/policies/templates/entra-jwt-ai-dlp/template.json b/policies/templates/entra-jwt-ai-dlp/template.json index 5fcab9cb..035f42a8 100644 --- a/policies/templates/entra-jwt-ai-dlp/template.json +++ b/policies/templates/entra-jwt-ai-dlp/template.json @@ -1,7 +1,7 @@ { "id": "entra-jwt-ai-dlp", "displayName": "Entra JWT — AI + DLP", - "version": "1.0", + "version": "1.1", "scope": "api", "parameters": [ { diff --git a/policies/templates/entra-jwt-ai/policy.xml b/policies/templates/entra-jwt-ai/policy.xml index 1f738a65..6951e332 100644 --- a/policies/templates/entra-jwt-ai/policy.xml +++ b/policies/templates/entra-jwt-ai/policy.xml @@ -57,12 +57,14 @@ if (segments.Length > 1) { return segments[segments.Length - 1]; } return "unknown"; }" /> + + - @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + @($"{(string)context.Variables["containerAppBaseUrl"]}/api/precheck/{(string)context.Variables["clientAppId"]}/{(string)context.Variables["tenantId"]}?deploymentId={(string)context.Variables["deploymentId"]}&apiId={(string)context.Variables["apiIdValue"]}&operationId={(string)context.Variables["operationIdValue"]}") GET @("Bearer " + (string)context.Variables["msi-access-token"]) @@ -122,6 +124,10 @@ value="@(((IResponse)context.Variables["precheckResponse"]).Body.As(preserveContent: true)["routedDeployment"]?.ToString())" /> (preserveContent: true)["routingPolicyId"]?.ToString())" /> + (preserveContent: true)["accessProfileId"]?.ToString())" /> + (preserveContent: true)["planId"]?.ToString())" /> @@ -232,6 +238,10 @@ payload.Add(new Newtonsoft.Json.Linq.JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("accessProfileId", context.Variables.GetValueOrDefault("accessProfileId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("planId", context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("apiId", context.Variables.GetValueOrDefault("apiIdValue") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("operationId", context.Variables.GetValueOrDefault("operationIdValue") ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); return payload.ToString(); } diff --git a/policies/templates/entra-jwt-ai/template.json b/policies/templates/entra-jwt-ai/template.json index b300ffb6..5b629500 100644 --- a/policies/templates/entra-jwt-ai/template.json +++ b/policies/templates/entra-jwt-ai/template.json @@ -1,7 +1,7 @@ { "id": "entra-jwt-ai", "displayName": "Entra JWT — AI", - "version": "1.0", + "version": "1.1", "scope": "api", "parameters": [ { diff --git a/policies/templates/entra-jwt-rest/policy.xml b/policies/templates/entra-jwt-rest/policy.xml index 40334a83..42ab2c88 100644 --- a/policies/templates/entra-jwt-rest/policy.xml +++ b/policies/templates/entra-jwt-rest/policy.xml @@ -47,6 +47,8 @@ return !string.IsNullOrEmpty(azp) ? azp : jwt?.Claims.GetValueOrDefault("appid",""); }" /> + + @@ -65,12 +67,14 @@ 4. Keep outbound /api/log-rest so dashboards and audit stay consistent. - @("{{ContainerAppUrl}}/api/precheck-rest/" + (string)context.Variables["clientAppId"] + "?tenantId=" + (string)context.Variables["tenantId"]) + @($"{{ContainerAppUrl}}/api/precheck-rest/{(string)context.Variables["clientAppId"]}?tenantId={(string)context.Variables["tenantId"]}&apiId={(string)context.Variables["apiIdValue"]}&operationId={(string)context.Variables["operationIdValue"]}") GET @("Bearer " + (string)context.Variables["msi-access-token"]) + (preserveContent: true)["accessProfileId"]?.ToString())" /> + (preserveContent: true)["planId"]?.ToString())" /> @@ -113,6 +117,10 @@ payload.Add(new Newtonsoft.Json.Linq.JProperty("requestPath", context.Request.Url.Path)); payload.Add(new Newtonsoft.Json.Linq.JProperty("statusCode", context.Response.StatusCode)); payload.Add(new Newtonsoft.Json.Linq.JProperty("latencyMs", latencyMs)); + payload.Add(new Newtonsoft.Json.Linq.JProperty("accessProfileId", context.Variables.GetValueOrDefault("accessProfileId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("planId", context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("apiId", context.Variables.GetValueOrDefault("apiIdValue") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("operationId", context.Variables.GetValueOrDefault("operationIdValue") ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); return payload.ToString(); } diff --git a/policies/templates/entra-jwt-rest/template.json b/policies/templates/entra-jwt-rest/template.json index c7be9c69..a992f9f3 100644 --- a/policies/templates/entra-jwt-rest/template.json +++ b/policies/templates/entra-jwt-rest/template.json @@ -1,7 +1,7 @@ { "id": "entra-jwt-rest", "displayName": "Entra JWT — Non-AI REST (Rate Limit + Quota)", - "version": "1.0", + "version": "1.1", "scope": "api", "parameters": [ { diff --git a/policies/templates/subscription-key-ai-dlp/policy.xml b/policies/templates/subscription-key-ai-dlp/policy.xml index 6b294826..5527fc10 100644 --- a/policies/templates/subscription-key-ai-dlp/policy.xml +++ b/policies/templates/subscription-key-ai-dlp/policy.xml @@ -47,12 +47,14 @@ if (segments.Length > 1) { return segments[segments.Length - 1]; } return "unknown"; }" /> + + - @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + @($"{(string)context.Variables["containerAppBaseUrl"]}/api/precheck/{(string)context.Variables["clientAppId"]}/{(string)context.Variables["tenantId"]}?deploymentId={(string)context.Variables["deploymentId"]}&apiId={(string)context.Variables["apiIdValue"]}&operationId={(string)context.Variables["operationIdValue"]}") GET @("Bearer " + (string)context.Variables["msi-access-token"]) @@ -112,6 +114,10 @@ value="@(((IResponse)context.Variables["precheckResponse"]).Body.As(preserveContent: true)["routedDeployment"]?.ToString())" /> (preserveContent: true)["routingPolicyId"]?.ToString())" /> + (preserveContent: true)["accessProfileId"]?.ToString())" /> + (preserveContent: true)["planId"]?.ToString())" /> @@ -262,6 +268,10 @@ payload.Add(new JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); payload.Add(new JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); payload.Add(new JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new JProperty("accessProfileId", context.Variables.GetValueOrDefault("accessProfileId") ?? "")); + payload.Add(new JProperty("planId", context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); + payload.Add(new JProperty("apiId", context.Variables.GetValueOrDefault("apiIdValue") ?? "")); + payload.Add(new JProperty("operationId", context.Variables.GetValueOrDefault("operationIdValue") ?? "")); payload.Add(new JProperty("correlationId", context.RequestId.ToString())); return payload.ToString(); } diff --git a/policies/templates/subscription-key-ai-dlp/template.json b/policies/templates/subscription-key-ai-dlp/template.json index b11ad9a1..43aa0891 100644 --- a/policies/templates/subscription-key-ai-dlp/template.json +++ b/policies/templates/subscription-key-ai-dlp/template.json @@ -1,7 +1,7 @@ { "id": "subscription-key-ai-dlp", "displayName": "Subscription Key — AI + DLP", - "version": "1.0", + "version": "1.1", "scope": "api", "parameters": [ { diff --git a/policies/templates/subscription-key-ai/policy.xml b/policies/templates/subscription-key-ai/policy.xml index c0c02781..d3fecce4 100644 --- a/policies/templates/subscription-key-ai/policy.xml +++ b/policies/templates/subscription-key-ai/policy.xml @@ -44,12 +44,14 @@ if (segments.Length > 1) { return segments[segments.Length - 1]; } return "unknown"; }" /> + + - @((string)context.Variables["containerAppBaseUrl"] + "/api/precheck/" + (string)context.Variables["clientAppId"] + "/" + (string)context.Variables["tenantId"] + "?deploymentId=" + (string)context.Variables["deploymentId"]) + @($"{(string)context.Variables["containerAppBaseUrl"]}/api/precheck/{(string)context.Variables["clientAppId"]}/{(string)context.Variables["tenantId"]}?deploymentId={(string)context.Variables["deploymentId"]}&apiId={(string)context.Variables["apiIdValue"]}&operationId={(string)context.Variables["operationIdValue"]}") GET @("Bearer " + (string)context.Variables["msi-access-token"]) @@ -109,6 +111,10 @@ value="@(((IResponse)context.Variables["precheckResponse"]).Body.As(preserveContent: true)["routedDeployment"]?.ToString())" /> (preserveContent: true)["routingPolicyId"]?.ToString())" /> + (preserveContent: true)["accessProfileId"]?.ToString())" /> + (preserveContent: true)["planId"]?.ToString())" /> @@ -219,6 +225,10 @@ payload.Add(new Newtonsoft.Json.Linq.JProperty("requestedDeploymentId", requestedDeploymentId ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("routedDeployment", context.Variables.GetValueOrDefault("routedDeployment") ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("routingPolicyId", context.Variables.GetValueOrDefault("routingPolicyId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("accessProfileId", context.Variables.GetValueOrDefault("accessProfileId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("planId", context.Variables.GetValueOrDefault("resolvedPlanId") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("apiId", context.Variables.GetValueOrDefault("apiIdValue") ?? "")); + payload.Add(new Newtonsoft.Json.Linq.JProperty("operationId", context.Variables.GetValueOrDefault("operationIdValue") ?? "")); payload.Add(new Newtonsoft.Json.Linq.JProperty("correlationId", context.RequestId.ToString())); return payload.ToString(); } diff --git a/policies/templates/subscription-key-ai/template.json b/policies/templates/subscription-key-ai/template.json index 3d51f764..ed64b802 100644 --- a/policies/templates/subscription-key-ai/template.json +++ b/policies/templates/subscription-key-ai/template.json @@ -1,7 +1,7 @@ { "id": "subscription-key-ai", "displayName": "Subscription Key — AI", - "version": "1.0", + "version": "1.1", "scope": "api", "parameters": [ { diff --git a/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs b/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs index 5db92e32..d5ae9d31 100644 --- a/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs +++ b/src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs @@ -13,6 +13,23 @@ public sealed class AccessProfilePrecheckTests : IClassFixture extraction")] - public void TemplateRendering_Pending_ApiIdVariableExtraction() + [Fact] + public void TemplateRendering_ApiIdVariableExtraction() { + foreach (var templateId in ShippedTemplateIds) + { + var policyXml = ReadTemplatePolicy(templateId); + Assert.Contains("", policyXml, StringComparison.Ordinal); + Assert.Contains("", policyXml, StringComparison.Ordinal); + } } - [Fact(Skip = "[Pending: M4] all templates must include apiId and operationId on the precheck URL")] - public void TemplateRendering_Pending_PrecheckUrlCarriesApiAndOperation() + [Fact] + public void TemplateRendering_PrecheckUrlCarriesApiAndOperation() { + foreach (var templateId in PrecheckTemplateIds) + { + var policyXml = ReadTemplatePolicy(templateId); + Assert.Contains("&apiId={(string)context.Variables[\"apiIdValue\"]}&operationId={(string)context.Variables[\"operationIdValue\"]}", policyXml, StringComparison.Ordinal); + } } - [Fact(Skip = "[Pending: M4] all templates must include accessProfileId/planId/apiId/operationId in outbound log payload")] - public void TemplateRendering_Pending_LogPayloadCarriesAccessProfileMetadata() + [Fact] + public void TemplateRendering_LogPayloadCarriesAccessProfileMetadata() { + foreach (var templateId in ShippedTemplateIds) + { + var policyXml = ReadTemplatePolicy(templateId); + Assert.Contains("JProperty(\"accessProfileId\"", policyXml, StringComparison.Ordinal); + Assert.Contains("GetValueOrDefault(\"accessProfileId\")", policyXml, StringComparison.Ordinal); + Assert.Contains("JProperty(\"planId\"", policyXml, StringComparison.Ordinal); + Assert.Contains("GetValueOrDefault(\"resolvedPlanId\")", policyXml, StringComparison.Ordinal); + Assert.Contains("JProperty(\"apiId\"", policyXml, StringComparison.Ordinal); + Assert.Contains("GetValueOrDefault(\"apiIdValue\")", policyXml, StringComparison.Ordinal); + Assert.Contains("JProperty(\"operationId\"", policyXml, StringComparison.Ordinal); + Assert.Contains("GetValueOrDefault(\"operationIdValue\")", policyXml, StringComparison.Ordinal); + } } - [Fact(Skip = "[Pending: M4] template manifests should bump to version 1.1 once AAA fields ship")] - public void TemplateRendering_Pending_TemplateVersionBump() + [Fact] + public void TemplateRendering_TemplateVersionBump() { + foreach (var templateId in ShippedTemplateIds) + { + using var manifestJson = JsonDocument.Parse(ReadTemplateManifest(templateId)); + Assert.Equal("1.1", manifestJson.RootElement.GetProperty("version").GetString()); + } } private HttpClient CreateClient(object? resolver = null) @@ -182,6 +227,28 @@ private HttpClient CreateClient(object? resolver = null) return client; } + private static string ReadTemplatePolicy(string templateId) + => File.ReadAllText(Path.Combine(FindRepositoryRoot(), "policies", "templates", templateId, "policy.xml")); + + private static string ReadTemplateManifest(string templateId) + => File.ReadAllText(Path.Combine(FindRepositoryRoot(), "policies", "templates", templateId, "template.json")); + + private static string FindRepositoryRoot() + { + var directory = new DirectoryInfo(AppContext.BaseDirectory); + while (directory is not null) + { + if (Directory.Exists(Path.Combine(directory.FullName, "policies", "templates"))) + { + return directory.FullName; + } + + directory = directory.Parent; + } + + throw new DirectoryNotFoundException("Repository root with policies\\templates was not found."); + } + private void SeedPlan(PlanData plan) => _factory.Redis.SeedString(RedisKeys.Plan(plan.Id), JsonSerializer.Serialize(plan, JsonOpts)); From ec54c29c663d0206c5c1d4ab7c381955b2ac8a8d Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 18:05:52 -0400 Subject: [PATCH 13/14] Add access profile admin UI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/decisions/inbox/kima-aaa-m5-ui.md | 61 ++ src/aipolicyengine-ui/src/App.tsx | 3 + .../src/api/accessProfiles.ts | 81 +++ .../src/components/Layout.tsx | 1 + .../accessProfiles/CascadeBadge.tsx | 61 ++ .../components/accessProfiles/ClientList.tsx | 100 +++ .../accessProfiles/ProfileEditor.tsx | 277 ++++++++ .../components/accessProfiles/ProfileGrid.tsx | 324 +++++++++ .../src/components/accessProfiles/types.ts | 30 + .../src/components/ui/dialog.tsx | 5 +- .../src/hooks/useApimCatalog.ts | 142 ++++ .../src/pages/AccessProfiles.tsx | 663 ++++++++++++++++++ src/aipolicyengine-ui/src/pages/Apis.tsx | 135 ++-- .../src/types/accessProfiles.ts | 52 ++ 14 files changed, 1852 insertions(+), 83 deletions(-) create mode 100644 .squad/decisions/inbox/kima-aaa-m5-ui.md create mode 100644 src/aipolicyengine-ui/src/api/accessProfiles.ts create mode 100644 src/aipolicyengine-ui/src/components/accessProfiles/CascadeBadge.tsx create mode 100644 src/aipolicyengine-ui/src/components/accessProfiles/ClientList.tsx create mode 100644 src/aipolicyengine-ui/src/components/accessProfiles/ProfileEditor.tsx create mode 100644 src/aipolicyengine-ui/src/components/accessProfiles/ProfileGrid.tsx create mode 100644 src/aipolicyengine-ui/src/components/accessProfiles/types.ts create mode 100644 src/aipolicyengine-ui/src/hooks/useApimCatalog.ts create mode 100644 src/aipolicyengine-ui/src/pages/AccessProfiles.tsx create mode 100644 src/aipolicyengine-ui/src/types/accessProfiles.ts diff --git a/.squad/decisions/inbox/kima-aaa-m5-ui.md b/.squad/decisions/inbox/kima-aaa-m5-ui.md new file mode 100644 index 00000000..7ab94582 --- /dev/null +++ b/.squad/decisions/inbox/kima-aaa-m5-ui.md @@ -0,0 +1,61 @@ +# Kima AAA M5 — Access Profile admin UI + +## Scope completed +- Added a new `/access` admin page for Access Profile management in the React/Vite SPA. +- Implemented a client-first workflow: searchable client list on the left, access matrix on the right. +- Visualized all three direct override scopes plus fallback inheritance: + - client-global (`_global`) + - API-wide + - operation-level +- Added direct CRUD flows against `/api/access-profiles`: + - list by client + - point-read before edit + - create + - update + - delete +- Added bulk-create flow using `/api/access-profiles/bulk` for queued inherited cells. +- Added a shared `useApimCatalog` hook and migrated both `/access` and `/apis` to reuse the same APIM catalog loading pattern. +- Added an `Access` nav item and route wiring in the shell. +- Extended the shared dialog primitive with `contentClassName` so the profile editor can render as a larger drawer-style panel. + +## UI behavior notes +- Empty cells do not look blank; they show the currently effective cascade result before an override is created. +- Direct overrides are visually distinct from inherited values. +- Disabled direct profiles stay visible but are treated as non-winning, matching backend cascade behavior. +- API cards lazy-load operations from APIM when expanded. +- The editor supports both single-scope editing and bulk creation across queued scopes. +- The APIM `/apis` page now uses the same shared catalog hook, avoiding duplicate API/operation loading logic and preserving the render-loop-safe ref/callback pattern. + +## Contract and architecture alignment +- The UI follows McNulty’s client-first `/access` recommendation from `mcnulty-aaa-per-client-arch.md`. +- Access Profile IDs and scope modeling assume the shipped backend contract: + - `ap:{clientAppId}:{tenantId}:{apiId}:{operationId|_all}` + - `_global` for client-global defaults +- Effective state shown in the grid mirrors backend precedence: + 1. operation override + 2. API-wide override + 3. client-global override + 4. client plan assignment fallback +- Effective routing and allowed deployments mirror backend fallback semantics: + - null routing policy falls back to the selected plan/default assignment + - empty deployment list falls back to the selected plan/default assignment + +## Validation +- `cd src\aipolicyengine-ui && npm run lint` +- `cd src\aipolicyengine-ui && npm run build` +- Result: both passed after the shared-hook `/apis` refactor was completed. + +## Files changed +- `src\aipolicyengine-ui\src\App.tsx` +- `src\aipolicyengine-ui\src\components\Layout.tsx` +- `src\aipolicyengine-ui\src\components\ui\dialog.tsx` +- `src\aipolicyengine-ui\src\hooks\useApimCatalog.ts` +- `src\aipolicyengine-ui\src\api\accessProfiles.ts` +- `src\aipolicyengine-ui\src\types\accessProfiles.ts` +- `src\aipolicyengine-ui\src\pages\AccessProfiles.tsx` +- `src\aipolicyengine-ui\src\pages\Apis.tsx` +- `src\aipolicyengine-ui\src\components\accessProfiles\CascadeBadge.tsx` +- `src\aipolicyengine-ui\src\components\accessProfiles\ClientList.tsx` +- `src\aipolicyengine-ui\src\components\accessProfiles\ProfileEditor.tsx` +- `src\aipolicyengine-ui\src\components\accessProfiles\ProfileGrid.tsx` +- `src\aipolicyengine-ui\src\components\accessProfiles\types.ts` diff --git a/src/aipolicyengine-ui/src/App.tsx b/src/aipolicyengine-ui/src/App.tsx index bded33c1..46ce1e8b 100644 --- a/src/aipolicyengine-ui/src/App.tsx +++ b/src/aipolicyengine-ui/src/App.tsx @@ -10,6 +10,7 @@ import { Export } from "./pages/Export" import { ClientDetail } from "./pages/ClientDetail" import { RoutingPolicies } from "./pages/RoutingPolicies" import { RequestBilling } from "./pages/RequestBilling" +import { AccessProfiles } from "./pages/AccessProfiles" import { Apis } from "./pages/Apis" import { loginRequest } from "./auth/msalConfig" import { fetchPlans } from "./api" @@ -23,6 +24,7 @@ const TAB_PATHS = { plans: "/plans", pricing: "/pricing", routing: "/routing", + access: "/access", apis: "/apis", requests: "/request-billing", export: "/export", @@ -141,6 +143,7 @@ function App() { {activeTab === "plans" && } {activeTab === "pricing" && } {activeTab === "routing" && } + {activeTab === "access" && } {activeTab === "apis" && } {activeTab === "requests" && setSelectedClient({ clientAppId, tenantId })} />} {activeTab === "export" && } diff --git a/src/aipolicyengine-ui/src/api/accessProfiles.ts b/src/aipolicyengine-ui/src/api/accessProfiles.ts new file mode 100644 index 00000000..9c0c0cae --- /dev/null +++ b/src/aipolicyengine-ui/src/api/accessProfiles.ts @@ -0,0 +1,81 @@ +import { API_BASE, authFetch, parseErrorMessage } from "../api" +import type { HttpError } from "../types/apim" +import type { + AccessProfile, + AccessProfileCreateRequest, + AccessProfileUpdateRequest, + AccessProfilesResponse, + BulkAccessProfilesRequest, + BulkAccessProfilesResponse, +} from "../types/accessProfiles" + +async function buildHttpError(res: Response, fallback: string): Promise { + const message = await parseErrorMessage(res, fallback) + const error = new Error(message) as HttpError + error.status = res.status + error.body = await res.clone().json().catch(() => null) + return error +} + +async function requestJson(path: string, fallback: string, options: RequestInit = {}): Promise { + const res = await authFetch(`${API_BASE}${path}`, options) + if (!res.ok) { + throw await buildHttpError(res, fallback) + } + + if (res.status === 204) { + return undefined as T + } + + return res.json() as Promise +} + +function toQueryString(filters: { clientAppId?: string; tenantId?: string; apiId?: string }): string { + const params = new URLSearchParams() + + if (filters.clientAppId) params.set("clientAppId", filters.clientAppId) + if (filters.tenantId) params.set("tenantId", filters.tenantId) + if (filters.apiId) params.set("apiId", filters.apiId) + + const query = params.toString() + return query ? `?${query}` : "" +} + +export function fetchAccessProfiles(filters: { clientAppId?: string; tenantId?: string; apiId?: string }): Promise { + return requestJson(`/api/access-profiles${toQueryString(filters)}`, "Failed to fetch access profiles") +} + +export function fetchAccessProfile(profileId: string): Promise { + return requestJson(`/api/access-profiles/${encodeURIComponent(profileId)}`, "Failed to fetch access profile") +} + +export function createAccessProfile(data: AccessProfileCreateRequest): Promise { + return requestJson("/api/access-profiles", "Failed to create access profile", { + method: "POST", + body: JSON.stringify(data), + }) +} + +export function updateAccessProfile(profileId: string, data: AccessProfileUpdateRequest): Promise { + return requestJson(`/api/access-profiles/${encodeURIComponent(profileId)}`, "Failed to update access profile", { + method: "PUT", + body: JSON.stringify(data), + }) +} + +export async function deleteAccessProfile(profileId: string): Promise { + const res = await authFetch(`${API_BASE}/api/access-profiles/${encodeURIComponent(profileId)}`, { + method: "DELETE", + }) + + if (!res.ok) { + throw await buildHttpError(res, "Failed to delete access profile") + } +} + +export function bulkCreateAccessProfiles(data: BulkAccessProfilesRequest): Promise { + return requestJson("/api/access-profiles/bulk", "Failed to create access profiles in bulk", { + method: "POST", + body: JSON.stringify(data), + }) +} diff --git a/src/aipolicyengine-ui/src/components/Layout.tsx b/src/aipolicyengine-ui/src/components/Layout.tsx index 3b208a2f..dc4cd34c 100644 --- a/src/aipolicyengine-ui/src/components/Layout.tsx +++ b/src/aipolicyengine-ui/src/components/Layout.tsx @@ -24,6 +24,7 @@ export function Layout({ children, activeTab, onTabChange, billingMode = 'token' { id: "plans", label: "Plans" }, { id: "pricing", label: "Pricing" }, { id: "routing", label: "Routing" }, + { id: "access", label: "Access" }, { id: "apis", label: "APIs" }, // Adaptive: show Request Billing only when at least one plan uses multiplier billing ...(billingMode !== 'token' ? [{ id: "requests", label: "Request Billing" }] : []), diff --git a/src/aipolicyengine-ui/src/components/accessProfiles/CascadeBadge.tsx b/src/aipolicyengine-ui/src/components/accessProfiles/CascadeBadge.tsx new file mode 100644 index 00000000..ec25bdca --- /dev/null +++ b/src/aipolicyengine-ui/src/components/accessProfiles/CascadeBadge.tsx @@ -0,0 +1,61 @@ +import { GitBranchPlus, Layers3 } from "lucide-react" +import { Badge } from "../ui/badge" +import { Button } from "../ui/button" +import { cn } from "../../lib/utils" +import type { EffectiveAccessPreview } from "./types" + +interface CascadeBadgeProps { + effective: EffectiveAccessPreview | null + bulkQueued?: boolean + onOverride: () => void + onQueueBulk?: () => void +} + +const SOURCE_BADGE_VARIANTS = { + api: "teal", + global: "cyan", + client: "amber", +} as const + +export function CascadeBadge({ effective, bulkQueued = false, onOverride, onQueueBulk }: CascadeBadgeProps) { + if (!effective || effective.source === "direct") { + return null + } + + const sourceVariant = SOURCE_BADGE_VARIANTS[effective.source] + + return ( +
+
+
+ +
+
+
+ {effective.sourceLabel} + Cascade preview +
+

{effective.sourceDescription}

+

This scope inherits until you create a direct override here.

+
+ + {onQueueBulk && ( + + )} +
+
+
+
+ ) +} diff --git a/src/aipolicyengine-ui/src/components/accessProfiles/ClientList.tsx b/src/aipolicyengine-ui/src/components/accessProfiles/ClientList.tsx new file mode 100644 index 00000000..51bc3b8a --- /dev/null +++ b/src/aipolicyengine-ui/src/components/accessProfiles/ClientList.tsx @@ -0,0 +1,100 @@ +import { useMemo, useState } from "react" +import { Search, UserRoundCog } from "lucide-react" +import { Badge } from "../ui/badge" +import { Card, CardContent, CardHeader, CardTitle } from "../ui/card" +import { Input } from "../ui/input" +import type { ClientAssignment, PlanData } from "../../types" +import { cn } from "../../lib/utils" + +interface ClientListProps { + clients: ClientAssignment[] + plans: PlanData[] + selectedClientKey: string + onSelectClient: (clientKey: string) => void +} + +function buildClientKey(client: Pick): string { + return `${client.clientAppId}|${client.tenantId}` +} + +export function ClientList({ clients, plans, selectedClientKey, onSelectClient }: ClientListProps) { + const [query, setQuery] = useState("") + + const plansById = useMemo( + () => Object.fromEntries(plans.map((plan) => [plan.id, plan])), + [plans], + ) + + const filteredClients = useMemo(() => { + const normalizedQuery = query.trim().toLowerCase() + if (!normalizedQuery) return clients + + return clients.filter((client) => { + const haystacks = [ + client.displayName, + client.clientAppId, + client.tenantId, + plansById[client.planId]?.name, + ] + + return haystacks.some((value) => value?.toLowerCase().includes(normalizedQuery)) + }) + }, [clients, plansById, query]) + + return ( + + + + + Clients + {clients.length} + +
+ + setQuery(event.target.value)} placeholder="Search client, tenant, plan…" className="pl-9" /> +
+
+ +
+ {filteredClients.length === 0 ? ( +
No clients match your search.
+ ) : ( +
    + {filteredClients.map((client) => { + const clientKey = buildClientKey(client) + const selected = clientKey === selectedClientKey + const plan = plansById[client.planId] + + return ( +
  • + +
  • + ) + })} +
+ )} +
+
+
+ ) +} diff --git a/src/aipolicyengine-ui/src/components/accessProfiles/ProfileEditor.tsx b/src/aipolicyengine-ui/src/components/accessProfiles/ProfileEditor.tsx new file mode 100644 index 00000000..9c727039 --- /dev/null +++ b/src/aipolicyengine-ui/src/components/accessProfiles/ProfileEditor.tsx @@ -0,0 +1,277 @@ +import { useMemo, useState } from "react" +import { Badge } from "../ui/badge" +import { Button } from "../ui/button" +import { Dialog, DialogClose, DialogHeader, DialogTitle } from "../ui/dialog" +import { Input } from "../ui/input" +import { cn } from "../../lib/utils" +import type { DeploymentInfo, ModelRoutingPolicy, PlanData } from "../../types" +import type { AccessProfile } from "../../types/accessProfiles" +import type { AccessScopeTarget } from "./types" + +export interface ProfileEditorValues { + planId: string + routingPolicyId: string | null + allowedDeployments: string[] + enabled: boolean +} + +interface ProfileEditorProps { + open: boolean + mode: "single" | "bulk" + targets: AccessScopeTarget[] + existingProfile: AccessProfile | null + initialValues: ProfileEditorValues + plans: PlanData[] + routingPolicies: ModelRoutingPolicy[] + deployments: DeploymentInfo[] + saving: boolean + deleting: boolean + onClose: () => void + onSave: (values: ProfileEditorValues) => Promise + onDelete?: () => Promise +} + +function targetLabel(target: AccessScopeTarget): string { + if (target.kind === "global") return "Client-global default" + if (target.kind === "api") return `${target.apiDisplayName} · API-wide` + return `${target.apiDisplayName} · ${target.method} ${target.operationDisplayName ?? target.operationId}` +} + +export function ProfileEditor({ + open, + mode, + targets, + existingProfile, + initialValues, + plans, + routingPolicies, + deployments, + saving, + deleting, + onClose, + onSave, + onDelete, +}: ProfileEditorProps) { + const [planId, setPlanId] = useState(initialValues.planId) + const [routingPolicyId, setRoutingPolicyId] = useState(initialValues.routingPolicyId ?? "") + const [allowedDeployments, setAllowedDeployments] = useState(initialValues.allowedDeployments) + const [enabled, setEnabled] = useState(initialValues.enabled) + const [deploymentQuery, setDeploymentQuery] = useState("") + const [formError, setFormError] = useState(null) + + const filteredDeployments = useMemo(() => { + const normalizedQuery = deploymentQuery.trim().toLowerCase() + if (!normalizedQuery) return deployments + + return deployments.filter((deployment) => { + return [deployment.id, deployment.model, deployment.modelVersion] + .some((value) => value.toLowerCase().includes(normalizedQuery)) + }) + }, [deploymentQuery, deployments]) + + const toggleDeployment = (deploymentId: string) => { + setAllowedDeployments((current) => ( + current.includes(deploymentId) + ? current.filter((item) => item !== deploymentId) + : [...current, deploymentId] + )) + } + + const handleSave = async () => { + if (!planId) { + setFormError("Select a plan before saving.") + return + } + + setFormError(null) + await onSave({ + planId, + routingPolicyId: routingPolicyId || null, + allowedDeployments, + enabled, + }) + } + + return ( + { + if (!nextOpen) onClose() + }} + contentClassName="max-w-4xl lg:ml-auto lg:mr-0 lg:h-screen lg:max-h-screen lg:rounded-none lg:border-l" + > + + +
+ + {mode === "bulk" ? "Bulk create" : existingProfile ? "Edit override" : "Create override"} + + {targets.length} scope{targets.length === 1 ? "" : "s"} +
+ + {mode === "bulk" ? "Apply direct overrides to selected scopes" : "Access Profile editor"} + +

+ Leave routing blank to inherit from the chosen plan. Leave deployments empty to use that plan's deployment rules. +

+
+ +
+
+
+

Targets

+
+ {targets.map((target) => ( +
+

{targetLabel(target)}

+

+ {target.apiId}{target.operationId ? ` / ${target.operationId}` : " / _all"} +

+
+ ))} +
+
+ +
+

Plan + routing

+
+
+ + +
+ +
+ + +
+
+
+ +
+
+
+

Allowed deployments

+

Choose specific deployments or leave the list empty to inherit from the selected plan.

+
+
+ + +
+
+ +
+ setDeploymentQuery(event.target.value)} + placeholder="Filter deployments…" + /> +
+ {filteredDeployments.length === 0 ? ( +
No deployments match the current filter.
+ ) : ( + filteredDeployments.map((deployment) => { + const checked = allowedDeployments.includes(deployment.id) + return ( + + ) + }) + )} +
+
+
+
+ + +
+
+ ) +} diff --git a/src/aipolicyengine-ui/src/components/accessProfiles/ProfileGrid.tsx b/src/aipolicyengine-ui/src/components/accessProfiles/ProfileGrid.tsx new file mode 100644 index 00000000..1d812e74 --- /dev/null +++ b/src/aipolicyengine-ui/src/components/accessProfiles/ProfileGrid.tsx @@ -0,0 +1,324 @@ +import { ChevronDown, ChevronRight, Layers2, RefreshCcw, Shield, Sparkles } from "lucide-react" +import { Badge } from "../ui/badge" +import { Button } from "../ui/button" +import { Card, CardContent, CardHeader, CardTitle } from "../ui/card" +import { CascadeBadge } from "./CascadeBadge" +import { cn } from "../../lib/utils" +import type { ClientAssignment, ModelRoutingPolicy, PlanData } from "../../types" +import type { AccessProfile } from "../../types/accessProfiles" +import type { ApimApiSummary } from "../../types/apim" +import type { AccessGridCellData, AccessScopeTarget } from "./types" + +interface AccessApiSection { + api: ApimApiSummary + apiCell: AccessGridCellData + operationCells: AccessGridCellData[] + directOverrideCount: number + expanded: boolean + loadingOperations: boolean + operationError: string | null +} + +interface ProfileGridProps { + client: ClientAssignment | null + globalCell: AccessGridCellData | null + sections: AccessApiSection[] + plansById: Record + routingPoliciesById: Record + queuedScopeKeys: string[] + profilesLoading: boolean + onToggleApi: (api: ApimApiSummary) => void + onRetryOperations: (api: ApimApiSummary) => void + onOpenCell: (target: AccessScopeTarget, directProfile: AccessProfile | null, effective: AccessGridCellData["effective"]) => void + onToggleQueuedScope: (target: AccessScopeTarget) => void +} + +function scopeKey(target: AccessScopeTarget): string { + return `${target.apiId}:${target.operationId ?? "_all"}` +} + +function deploymentLabel(deployments: string[]): string[] { + return deployments.length > 0 ? deployments : ["All deployments"] +} + +function sourceVariant(source: "direct" | "api" | "global" | "client"): "teal" | "cyan" | "amber" | "green" { + switch (source) { + case "api": + return "teal" + case "global": + return "cyan" + case "client": + return "amber" + default: + return "green" + } +} + +function directTone(profile: AccessProfile | null): string { + if (!profile) return "border-dashed border-slate-300/70 bg-slate-500/5 dark:border-slate-700/70 dark:bg-slate-500/10" + if (!profile.enabled) return "border-amber-400/50 bg-amber-500/10" + return "border-emerald-500/30 bg-emerald-500/10" +} + +function renderSummary( + cell: AccessGridCellData, + plansById: Record, + routingPoliciesById: Record, +) { + const effective = cell.effective + if (!effective) { + return

No direct or inherited profile resolves at this scope.

+ } + + const planName = plansById[effective.planId]?.name ?? effective.planId + const routingName = effective.routingPolicyId + ? (routingPoliciesById[effective.routingPolicyId]?.name ?? effective.routingPolicyId) + : "No routing override" + const deployments = deploymentLabel(effective.allowedDeployments) + + return ( +
+
+ {planName} + + {effective.source === "direct" ? (cell.directProfile?.enabled ? "Direct override" : "Direct override · disabled") : effective.sourceLabel} + + {effective.routingPolicyId ? ( + {routingName} + ) : ( + {routingName} + )} +
+
+ {deployments.map((deployment) => ( + + {deployment} + + ))} +
+
+ ) +} + +function ScopeRow({ + cell, + plansById, + routingPoliciesById, + queued, + onOpenCell, + onToggleQueuedScope, +}: { + cell: AccessGridCellData + plansById: Record + routingPoliciesById: Record + queued: boolean + onOpenCell: (target: AccessScopeTarget, directProfile: AccessProfile | null, effective: AccessGridCellData["effective"]) => void + onToggleQueuedScope: (target: AccessScopeTarget) => void +}) { + const { target, directProfile, effective } = cell + const summary = renderSummary(cell, plansById, routingPoliciesById) + + return ( +
+
+
+

+ {target.kind === "global" ? "Client-global" : target.operationDisplayName ?? "API-wide default"} +

+ {target.kind === "operation" && target.method && ( + {target.method} + )} +
+

+ {target.kind === "global" ? "_global / _all" : `${target.apiId} / ${target.operationId ?? "_all"}`} +

+ {target.kind === "operation" && target.urlTemplate && ( +

{target.urlTemplate}

+ )} +
+ +
+ + + {!directProfile && ( +
+ onOpenCell(target, directProfile, effective)} + onQueueBulk={() => onToggleQueuedScope(target)} + /> +
+ )} +
+
+ ) +} + +export function ProfileGrid({ + client, + globalCell, + sections, + plansById, + routingPoliciesById, + queuedScopeKeys, + profilesLoading, + onToggleApi, + onRetryOperations, + onOpenCell, + onToggleQueuedScope, +}: ProfileGridProps) { + if (!client) { + return ( + + + +
+

Select a client

+

Choose a client from the left pane to inspect endpoint-scoped Access Profiles.

+
+
+
+ ) + } + + return ( +
+ + +
+
+ {client.displayName || client.clientAppId} +

+ Client-first access control matrix with API-wide, operation-level, and client-global inheritance. +

+
+
+ {client.planId} + {client.tenantId} + {client.clientAppId} +
+
+
+ + Empty scopes show their current cascade result. Direct overrides stay green, disabled ones glow amber, and queued bulk overrides are highlighted in the inherited badge. + +
+ + + + + + Client-global profile + {profilesLoading && Refreshing…} + + + + {globalCell ? ( + + ) : ( +
Client-global scope is unavailable.
+ )} +
+
+ + {sections.map((section) => ( + + +
+ +
+ {section.directOverrideCount} direct override{section.directOverrideCount === 1 ? "" : "s"} + {section.operationCells.length} operation{section.operationCells.length === 1 ? "" : "s"} +
+
+
+ + + + {section.expanded && section.loadingOperations && ( +
+ Loading operations… +
+ )} + + {section.expanded && section.operationError && ( +
+
+ {section.operationError} + +
+
+ )} + + {section.expanded && !section.loadingOperations && !section.operationError && section.operationCells.length === 0 && ( +
+ No APIM operations were returned for this API. +
+ )} + + {section.expanded && !section.loadingOperations && !section.operationError && section.operationCells.length > 0 && ( +
+ {section.operationCells.map((cell) => ( + + ))} +
+ )} +
+
+ ))} + + {sections.length === 0 && ( + + + +
+

No APIs discovered

+

APIM has not returned any APIs yet, so there are no scopes to manage for this client.

+
+
+
+ )} +
+ ) +} diff --git a/src/aipolicyengine-ui/src/components/accessProfiles/types.ts b/src/aipolicyengine-ui/src/components/accessProfiles/types.ts new file mode 100644 index 00000000..f8344b50 --- /dev/null +++ b/src/aipolicyengine-ui/src/components/accessProfiles/types.ts @@ -0,0 +1,30 @@ +import type { AccessProfile } from "../../types/accessProfiles" + +export type CascadeSource = "direct" | "api" | "global" | "client" + +export interface AccessScopeTarget { + kind: "global" | "api" | "operation" + apiId: string + apiDisplayName: string + operationId: string | null + operationDisplayName?: string + method?: string + urlTemplate?: string +} + +export interface EffectiveAccessPreview { + source: CascadeSource + sourceLabel: string + sourceDescription: string + profileId: string | null + planId: string + routingPolicyId: string | null + allowedDeployments: string[] + enabled: boolean +} + +export interface AccessGridCellData { + target: AccessScopeTarget + directProfile: AccessProfile | null + effective: EffectiveAccessPreview | null +} diff --git a/src/aipolicyengine-ui/src/components/ui/dialog.tsx b/src/aipolicyengine-ui/src/components/ui/dialog.tsx index afc5ea1d..d446ff3a 100644 --- a/src/aipolicyengine-ui/src/components/ui/dialog.tsx +++ b/src/aipolicyengine-ui/src/components/ui/dialog.tsx @@ -6,14 +6,15 @@ interface DialogProps { open: boolean onOpenChange: (open: boolean) => void children: React.ReactNode + contentClassName?: string } -function Dialog({ open, onOpenChange, children }: DialogProps) { +function Dialog({ open, onOpenChange, children, contentClassName }: DialogProps) { if (!open) return null return (
onOpenChange(false)} /> -
+
{children}
diff --git a/src/aipolicyengine-ui/src/hooks/useApimCatalog.ts b/src/aipolicyengine-ui/src/hooks/useApimCatalog.ts new file mode 100644 index 00000000..5b0a6826 --- /dev/null +++ b/src/aipolicyengine-ui/src/hooks/useApimCatalog.ts @@ -0,0 +1,142 @@ +import { useCallback, useEffect, useRef, useState } from "react" +import { fetchApimApis, fetchApimOperations } from "../api/apim" +import type { HttpError, ApimApiSummary, ApimOperationSummary } from "../types/apim" + +interface UseApimCatalogOptions { + enabled?: boolean +} + +interface UseApimCatalogResult { + apis: ApimApiSummary[] + catalogLoading: boolean + catalogError: string | null + accessDenied: boolean + operationsByApi: Record + operationErrors: Record + loadingOperationApiIds: string[] + refreshApis: () => Promise + ensureOperationsLoaded: (api: ApimApiSummary) => Promise + refreshOperations: (api: ApimApiSummary) => Promise +} + +function getStatus(error: unknown): number | undefined { + return typeof error === "object" && error !== null && "status" in error + ? (error as HttpError).status + : undefined +} + +function getErrorMessage(error: unknown, fallback: string): string { + return error instanceof Error ? error.message : fallback +} + +export function useApimCatalog({ enabled = true }: UseApimCatalogOptions = {}): UseApimCatalogResult { + const [apis, setApis] = useState([]) + const [catalogLoading, setCatalogLoading] = useState(true) + const [catalogError, setCatalogError] = useState(null) + const [accessDenied, setAccessDenied] = useState(false) + const [operationsByApi, setOperationsByApi] = useState>({}) + const [operationErrors, setOperationErrors] = useState>({}) + const [loadingOperationApiIds, setLoadingOperationApiIds] = useState([]) + + const operationsByApiRef = useRef>({}) + const loadingOperationApiIdsRef = useRef([]) + + useEffect(() => { + operationsByApiRef.current = operationsByApi + }, [operationsByApi]) + + useEffect(() => { + loadingOperationApiIdsRef.current = loadingOperationApiIds + }, [loadingOperationApiIds]) + + const refreshApis = useCallback(async () => { + if (!enabled) { + setCatalogLoading(false) + return + } + + setCatalogLoading(true) + setCatalogError(null) + + try { + const response = await fetchApimApis() + const nextApis = response.apis ?? [] + const validApiIds = new Set(nextApis.map((api) => api.id)) + + setApis(nextApis) + setAccessDenied(false) + setOperationsByApi((current) => Object.fromEntries( + Object.entries(current).filter(([apiId]) => validApiIds.has(apiId)), + )) + setOperationErrors((current) => Object.fromEntries( + Object.entries(current).filter(([apiId]) => validApiIds.has(apiId)), + )) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + setAccessDenied(true) + setApis([]) + return + } + + setCatalogError(getErrorMessage(error, "Failed to load APIs")) + } finally { + setCatalogLoading(false) + } + }, [enabled]) + + const loadOperations = useCallback(async (api: ApimApiSummary, force = false) => { + if (!force) { + if (loadingOperationApiIdsRef.current.includes(api.id)) return + if (operationsByApiRef.current[api.id]) return + } + + setLoadingOperationApiIds((current) => current.includes(api.id) ? current : [...current, api.id]) + setOperationErrors((current) => ({ ...current, [api.id]: null })) + + try { + const response = await fetchApimOperations(api.id) + setAccessDenied(false) + setOperationsByApi((current) => ({ ...current, [api.id]: response.operations ?? [] })) + } catch (error) { + const status = getStatus(error) + if (status === 401 || status === 403) { + setAccessDenied(true) + return + } + + setOperationErrors((current) => ({ + ...current, + [api.id]: getErrorMessage(error, `Failed to load operations for ${api.displayName}`), + })) + } finally { + setLoadingOperationApiIds((current) => current.filter((apiId) => apiId !== api.id)) + } + }, []) + + useEffect(() => { + if (!enabled) { + setCatalogLoading(false) + return + } + + void refreshApis() + }, [enabled, refreshApis]) + + return { + apis, + catalogLoading, + catalogError, + accessDenied, + operationsByApi, + operationErrors, + loadingOperationApiIds, + refreshApis, + ensureOperationsLoaded: async (api) => { + await loadOperations(api) + }, + refreshOperations: async (api) => { + await loadOperations(api, true) + }, + } +} diff --git a/src/aipolicyengine-ui/src/pages/AccessProfiles.tsx b/src/aipolicyengine-ui/src/pages/AccessProfiles.tsx new file mode 100644 index 00000000..c27c4e07 --- /dev/null +++ b/src/aipolicyengine-ui/src/pages/AccessProfiles.tsx @@ -0,0 +1,663 @@ +import { useCallback, useEffect, useMemo, useRef, useState } from "react" +import { useMsal } from "@azure/msal-react" +import { AlertTriangle, RefreshCcw, ShieldCheck, Sparkles } from "lucide-react" +import { ClientList } from "../components/accessProfiles/ClientList" +import { ProfileEditor, type ProfileEditorValues } from "../components/accessProfiles/ProfileEditor" +import { ProfileGrid } from "../components/accessProfiles/ProfileGrid" +import type { AccessGridCellData, AccessScopeTarget, EffectiveAccessPreview } from "../components/accessProfiles/types" +import { Button } from "../components/ui/button" +import { fetchClients, fetchDeployments, fetchPlans, fetchRoutingPolicies } from "../api" +import { + bulkCreateAccessProfiles, + createAccessProfile, + deleteAccessProfile, + fetchAccessProfile, + fetchAccessProfiles, + updateAccessProfile, +} from "../api/accessProfiles" +import { useApimCatalog } from "../hooks/useApimCatalog" +import type { ClientAssignment, DeploymentInfo, ModelRoutingPolicy, PlanData } from "../types" +import type { ApimApiSummary } from "../types/apim" +import type { AccessProfile } from "../types/accessProfiles" + +interface ToastState { + message: string + retryLabel?: string + onRetry?: () => void +} + +interface EditorState { + mode: "single" | "bulk" + targets: AccessScopeTarget[] + existingProfile: AccessProfile | null + initialValues: ProfileEditorValues +} + +const GLOBAL_API_ID = "_global" + +function buildClientKey(client: Pick): string { + return `${client.clientAppId}|${client.tenantId}` +} + +function parseClientKey(clientKey: string): { clientAppId: string; tenantId: string } { + const [clientAppId = "", tenantId = ""] = clientKey.split("|") + return { clientAppId, tenantId } +} + +function buildScopeKey(apiId: string, operationId: string | null): string { + return `${apiId}:${operationId ?? "_all"}` +} + +function buildInitialValues(profile: AccessProfile | null, effective: EffectiveAccessPreview | null): ProfileEditorValues { + if (profile) { + return { + planId: profile.planId, + routingPolicyId: profile.routingPolicyId, + allowedDeployments: profile.allowedDeployments, + enabled: profile.enabled, + } + } + + return { + planId: effective?.planId ?? "", + routingPolicyId: effective?.routingPolicyId ?? null, + allowedDeployments: effective?.allowedDeployments ?? [], + enabled: effective?.enabled ?? true, + } +} + +function createDirectPreview( + profile: AccessProfile, + plansById: Record, +): EffectiveAccessPreview { + const plan = plansById[profile.planId] + + return { + source: "direct", + sourceLabel: "Direct override", + sourceDescription: profile.enabled + ? "This scope has its own Access Profile." + : "This scope has a stored override, but it is disabled and no longer wins in the cascade.", + profileId: profile.id, + planId: profile.planId, + routingPolicyId: profile.routingPolicyId ?? plan?.modelRoutingPolicyId ?? null, + allowedDeployments: profile.allowedDeployments.length > 0 ? profile.allowedDeployments : (plan?.allowedDeployments ?? []), + enabled: profile.enabled, + } +} + +function createInheritedPreview( + source: "api" | "global" | "client", + payload: { id?: string | null; planId: string; routingPolicyId: string | null; allowedDeployments: string[] }, + plansById: Record, +): EffectiveAccessPreview { + const plan = plansById[payload.planId] + + return { + source, + sourceLabel: source === "api" ? "API-wide" : source === "global" ? "Client-global" : "Client assignment", + sourceDescription: + source === "api" + ? "Inherited from the API-wide override for this client." + : source === "global" + ? "Inherited from the client-global default for this client." + : "Falling back to the client's base plan assignment.", + profileId: payload.id ?? null, + planId: payload.planId, + routingPolicyId: payload.routingPolicyId ?? plan?.modelRoutingPolicyId ?? null, + allowedDeployments: payload.allowedDeployments.length > 0 ? payload.allowedDeployments : (plan?.allowedDeployments ?? []), + enabled: true, + } +} + +export function AccessProfiles() { + const { accounts } = useMsal() + const [clients, setClients] = useState([]) + const [plans, setPlans] = useState([]) + const [routingPolicies, setRoutingPolicies] = useState([]) + const [deployments, setDeployments] = useState([]) + const [referenceLoading, setReferenceLoading] = useState(true) + const [referenceError, setReferenceError] = useState(null) + const [profiles, setProfiles] = useState([]) + const [profilesLoading, setProfilesLoading] = useState(false) + const [profilesError, setProfilesError] = useState(null) + const [selectedClientKey, setSelectedClientKey] = useState("") + const [expandedApiIds, setExpandedApiIds] = useState([]) + const [editorState, setEditorState] = useState(null) + const [queuedScopeKeys, setQueuedScopeKeys] = useState([]) + const [savingEditor, setSavingEditor] = useState(false) + const [deletingEditor, setDeletingEditor] = useState(false) + const [toast, setToast] = useState(null) + const [accessDenied, setAccessDenied] = useState(false) + + const selectedClientKeyRef = useRef("") + const adminRoleClaims = useMemo(() => { + const roles = accounts[0]?.idTokenClaims?.roles + return Array.isArray(roles) ? roles : [] + }, [accounts]) + const lacksExplicitAdminRole = adminRoleClaims.length > 0 && !adminRoleClaims.includes("AIPolicy.Admin") + + const { + apis, + catalogLoading, + catalogError, + accessDenied: catalogAccessDenied, + operationsByApi, + operationErrors, + loadingOperationApiIds, + refreshApis, + ensureOperationsLoaded, + refreshOperations, + } = useApimCatalog({ enabled: !lacksExplicitAdminRole }) + + useEffect(() => { + selectedClientKeyRef.current = selectedClientKey + }, [selectedClientKey]) + + const showToast = useCallback((message: string, onRetry?: () => void, retryLabel = "Retry") => { + setToast({ message, onRetry, retryLabel: onRetry ? retryLabel : undefined }) + }, []) + + const plansById = useMemo( + () => Object.fromEntries(plans.map((plan) => [plan.id, plan])), + [plans], + ) + const routingPoliciesById = useMemo( + () => Object.fromEntries(routingPolicies.map((policy) => [policy.id, policy])), + [routingPolicies], + ) + const profilesByScope = useMemo( + () => Object.fromEntries(profiles.map((profile) => [buildScopeKey(profile.apiId, profile.operationId), profile])), + [profiles], + ) + const selectedClient = useMemo( + () => clients.find((client) => buildClientKey(client) === selectedClientKey) ?? null, + [clients, selectedClientKey], + ) + + const accessDeniedMessage = lacksExplicitAdminRole || accessDenied || catalogAccessDenied + ? "You need AIPolicy.Admin role to use this page" + : null + + const loadReferenceData = useCallback(async (): Promise => { + setReferenceLoading(true) + setReferenceError(null) + + try { + const [clientsResponse, plansResponse, routingPoliciesResponse, deploymentsResponse] = await Promise.all([ + fetchClients(), + fetchPlans(), + fetchRoutingPolicies().catch(() => ({ policies: [] })), + fetchDeployments().catch(() => ({ deployments: [] })), + ]) + + const nextClients = clientsResponse.clients ?? [] + const nextSelectedClientKey = (() => { + const currentKey = selectedClientKeyRef.current + if (currentKey && nextClients.some((client) => buildClientKey(client) === currentKey)) { + return currentKey + } + + return nextClients[0] ? buildClientKey(nextClients[0]) : null + })() + + setClients(nextClients) + setPlans(plansResponse.plans ?? []) + setRoutingPolicies(routingPoliciesResponse.policies ?? []) + setDeployments(deploymentsResponse.deployments ?? []) + setSelectedClientKey(nextSelectedClientKey ?? "") + setAccessDenied(false) + return nextSelectedClientKey + } catch (error) { + const status = typeof error === "object" && error !== null && "status" in error ? Number((error as { status?: number }).status) : undefined + if (status === 401 || status === 403) { + setAccessDenied(true) + return null + } + + const message = error instanceof Error ? error.message : "Failed to load reference data" + setReferenceError(message) + showToast(message, () => { + void loadReferenceData() + }) + return null + } finally { + setReferenceLoading(false) + } + }, [showToast]) + + const loadProfiles = useCallback(async (clientKey: string) => { + const { clientAppId, tenantId } = parseClientKey(clientKey) + if (!clientAppId || !tenantId) { + setProfiles([]) + return + } + + setProfilesLoading(true) + setProfilesError(null) + + try { + const response = await fetchAccessProfiles({ clientAppId, tenantId }) + setProfiles(response.profiles ?? []) + setAccessDenied(false) + } catch (error) { + const status = typeof error === "object" && error !== null && "status" in error ? Number((error as { status?: number }).status) : undefined + if (status === 401 || status === 403) { + setAccessDenied(true) + return + } + + const message = error instanceof Error ? error.message : "Failed to load access profiles" + setProfilesError(message) + showToast(message, () => { + void loadProfiles(clientKey) + }) + } finally { + setProfilesLoading(false) + } + }, [showToast]) + + useEffect(() => { + if (lacksExplicitAdminRole) return + + void (async () => { + await loadReferenceData() + })() + }, [lacksExplicitAdminRole, loadReferenceData]) + + useEffect(() => { + if (!selectedClientKey || accessDeniedMessage) return + + void (async () => { + await loadProfiles(selectedClientKey) + })() + }, [selectedClientKey, accessDeniedMessage, loadProfiles]) + + const resolveCell = useCallback((target: AccessScopeTarget): AccessGridCellData => { + const directProfile = (profilesByScope[buildScopeKey(target.apiId, target.operationId)] ?? null) as AccessProfile | null + if (directProfile) { + return { + target, + directProfile, + effective: createDirectPreview(directProfile, plansById), + } + } + + const apiProfile = target.kind === "global" ? null : profilesByScope[buildScopeKey(target.apiId, null)] ?? null + const globalProfile = profilesByScope[buildScopeKey(GLOBAL_API_ID, null)] ?? null + + if (target.kind === "operation" && apiProfile?.enabled) { + return { + target, + directProfile: null, + effective: createInheritedPreview("api", apiProfile, plansById), + } + } + + if (target.kind !== "global" && globalProfile?.enabled) { + return { + target, + directProfile: null, + effective: createInheritedPreview("global", globalProfile, plansById), + } + } + + if (selectedClient) { + return { + target, + directProfile: null, + effective: createInheritedPreview( + "client", + { + id: null, + planId: selectedClient.planId, + routingPolicyId: selectedClient.modelRoutingPolicyOverride ?? null, + allowedDeployments: selectedClient.allowedDeployments ?? [], + }, + plansById, + ), + } + } + + return { + target, + directProfile: null, + effective: null, + } + }, [plansById, profilesByScope, selectedClient]) + + const sections = useMemo(() => { + return apis.map((api) => { + const operationCells = (operationsByApi[api.id] ?? []).map((operation) => resolveCell({ + kind: "operation", + apiId: api.id, + apiDisplayName: api.displayName, + operationId: operation.id, + operationDisplayName: operation.displayName, + method: operation.method, + urlTemplate: operation.urlTemplate, + })) + + return { + api, + apiCell: resolveCell({ + kind: "api", + apiId: api.id, + apiDisplayName: api.displayName, + operationId: null, + }), + operationCells, + directOverrideCount: profiles.filter((profile) => profile.apiId === api.id).length, + expanded: expandedApiIds.includes(api.id), + loadingOperations: loadingOperationApiIds.includes(api.id), + operationError: operationErrors[api.id] ?? null, + } + }) + }, [apis, expandedApiIds, loadingOperationApiIds, operationErrors, operationsByApi, profiles, resolveCell]) + + const globalCell = useMemo(() => resolveCell({ + kind: "global", + apiId: GLOBAL_API_ID, + apiDisplayName: "Client-global", + operationId: null, + }), [resolveCell]) + + const allCellsByScopeKey = useMemo(() => { + const entries: Array<[string, AccessGridCellData]> = [[buildScopeKey(globalCell.target.apiId, globalCell.target.operationId), globalCell]] + + for (const section of sections) { + entries.push([buildScopeKey(section.apiCell.target.apiId, section.apiCell.target.operationId), section.apiCell]) + for (const operationCell of section.operationCells) { + entries.push([buildScopeKey(operationCell.target.apiId, operationCell.target.operationId), operationCell]) + } + } + + return Object.fromEntries(entries) + }, [globalCell, sections]) + + const handleSelectClient = (clientKey: string) => { + setSelectedClientKey(clientKey) + setExpandedApiIds([]) + setQueuedScopeKeys([]) + setEditorState(null) + } + + const handleToggleApi = (api: ApimApiSummary) => { + setExpandedApiIds((current) => { + const isExpanded = current.includes(api.id) + if (isExpanded) { + return current.filter((item) => item !== api.id) + } + + return [...current, api.id] + }) + + if (!operationsByApi[api.id]) { + void ensureOperationsLoaded(api) + } + } + + const handleOpenCell = (target: AccessScopeTarget, directProfile: AccessProfile | null, effective: EffectiveAccessPreview | null) => { + void (async () => { + if (directProfile) { + try { + const freshProfile = await fetchAccessProfile(directProfile.id) + setEditorState({ + mode: "single", + targets: [target], + existingProfile: freshProfile, + initialValues: buildInitialValues(freshProfile, effective), + }) + } catch (error) { + const message = error instanceof Error ? error.message : "Failed to load the latest Access Profile" + showToast(message) + } + return + } + + setEditorState({ + mode: "single", + targets: [target], + existingProfile: null, + initialValues: buildInitialValues(null, effective), + }) + })() + } + + const handleToggleQueuedScope = (target: AccessScopeTarget) => { + const scopeKey = buildScopeKey(target.apiId, target.operationId) + setQueuedScopeKeys((current) => current.includes(scopeKey) + ? current.filter((item) => item !== scopeKey) + : [...current, scopeKey]) + } + + const handleOpenBulkEditor = () => { + const selectedCells = queuedScopeKeys + .map((scopeKey) => allCellsByScopeKey[scopeKey]) + .filter((cell): cell is AccessGridCellData => Boolean(cell)) + + if (selectedCells.length === 0) return + + setEditorState({ + mode: "bulk", + targets: selectedCells.map((cell) => cell.target), + existingProfile: null, + initialValues: buildInitialValues(null, selectedCells[0].effective), + }) + } + + const handleSaveEditor = async (values: ProfileEditorValues) => { + if (!selectedClient || !editorState) return + + setSavingEditor(true) + try { + if (editorState.mode === "bulk") { + const payload = { + profiles: editorState.targets.map((target) => ({ + clientAppId: selectedClient.clientAppId, + tenantId: selectedClient.tenantId, + apiId: target.apiId, + operationId: target.operationId, + planId: values.planId, + routingPolicyId: values.routingPolicyId, + allowedDeployments: values.allowedDeployments, + enabled: values.enabled, + })), + } + + const response = await bulkCreateAccessProfiles(payload) + const failureCount = response.failed.length + if (failureCount > 0) { + showToast(`Created ${response.created} override${response.created === 1 ? "" : "s"}; ${failureCount} failed.`) + } else { + showToast(`Created ${response.created} access override${response.created === 1 ? "" : "s"}.`) + } + setQueuedScopeKeys([]) + } else if (editorState.existingProfile) { + await updateAccessProfile(editorState.existingProfile.id, values) + showToast("Access Profile updated.") + } else { + const target = editorState.targets[0] + await createAccessProfile({ + clientAppId: selectedClient.clientAppId, + tenantId: selectedClient.tenantId, + apiId: target.apiId, + operationId: target.operationId, + planId: values.planId, + routingPolicyId: values.routingPolicyId, + allowedDeployments: values.allowedDeployments, + enabled: values.enabled, + }) + showToast("Access Profile created.") + } + + setEditorState(null) + await loadProfiles(buildClientKey(selectedClient)) + } catch (error) { + const message = error instanceof Error ? error.message : "Failed to save Access Profile" + showToast(message) + } finally { + setSavingEditor(false) + } + } + + const handleDeleteEditor = async () => { + if (!editorState?.existingProfile || !selectedClient) return + + setDeletingEditor(true) + try { + await deleteAccessProfile(editorState.existingProfile.id) + setEditorState(null) + showToast("Access Profile deleted.") + await loadProfiles(buildClientKey(selectedClient)) + } catch (error) { + const message = error instanceof Error ? error.message : "Failed to delete Access Profile" + showToast(message) + } finally { + setDeletingEditor(false) + } + } + + const handleRefresh = () => { + void (async () => { + const nextClientKey = await loadReferenceData() + await refreshApis() + const clientKeyToUse = nextClientKey ?? selectedClientKeyRef.current + if (clientKeyToUse) { + await loadProfiles(clientKeyToUse) + } + })() + } + + return ( +
+
+
+
+ +

Access Profiles

+
+

+ Configure client-global, API-wide, and operation-level access overrides. Empty cells visualize the active cascade before you commit a direct override. +

+
+
+ {queuedScopeKeys.length > 0 && ( + + )} + +
+
+ + {accessDeniedMessage && ( +
+
+ +
+

{accessDeniedMessage}

+

Ask an administrator to grant the AIPolicy.Admin role, then refresh this page.

+
+
+
+ )} + + {!accessDeniedMessage && (referenceError || catalogError || profilesError) && ( +
+
+ + {referenceError ?? catalogError ?? profilesError} + +
+
+ )} + + {!accessDeniedMessage && ( +
+
+ {referenceLoading ? ( +
+ Loading clients… +
+ ) : ( + + )} +
+ +
+ {catalogLoading && apis.length === 0 ? ( +
+ Loading APIs… +
+ ) : ( + { void refreshOperations(api) }} + onOpenCell={handleOpenCell} + onToggleQueuedScope={handleToggleQueuedScope} + /> + )} +
+
+ )} + + {editorState && !accessDeniedMessage && ( + buildScopeKey(target.apiId, target.operationId)).join(",")}:${editorState.existingProfile?.id ?? "new"}`} + open + mode={editorState.mode} + targets={editorState.targets} + existingProfile={editorState.existingProfile} + initialValues={editorState.initialValues} + plans={plans} + routingPolicies={routingPolicies} + deployments={deployments} + saving={savingEditor} + deleting={deletingEditor} + onClose={() => setEditorState(null)} + onSave={handleSaveEditor} + onDelete={editorState.existingProfile ? handleDeleteEditor : undefined} + /> + )} + + {toast && ( +
+
+ +
+

{toast.message}

+
+ {toast.onRetry && ( + + )} + +
+
+
+
+ )} +
+ ) +} diff --git a/src/aipolicyengine-ui/src/pages/Apis.tsx b/src/aipolicyengine-ui/src/pages/Apis.tsx index 01b17ffb..ced2d562 100644 --- a/src/aipolicyengine-ui/src/pages/Apis.tsx +++ b/src/aipolicyengine-ui/src/pages/Apis.tsx @@ -1,4 +1,4 @@ -import { useCallback, useEffect, useMemo, useRef, useState } from "react" +import { useCallback, useEffect, useMemo, useState } from "react" import { useMsal } from "@azure/msal-react" import { AlertTriangle, Network, RefreshCcw } from "lucide-react" import { ApiTree } from "../components/apis/ApiTree" @@ -13,11 +13,10 @@ import { clearApiPolicy, clearOperationPolicy, fetchApiPolicy, - fetchApimApis, - fetchApimOperations, fetchApimTemplates, fetchOperationPolicy, } from "../api/apim" +import { useApimCatalog } from "../hooks/useApimCatalog" import type { PlanData } from "../types" import type { ApimApiSummary, @@ -146,7 +145,6 @@ function derivePlanDefaults(plans: PlanData[]): Record { export function Apis() { const { accounts } = useMsal() - const [apis, setApis] = useState([]) const [templates, setTemplates] = useState([]) const [plans, setPlans] = useState([]) const [initialLoading, setInitialLoading] = useState(true) @@ -155,10 +153,6 @@ export function Apis() { const [toast, setToast] = useState(null) const [expandedApiIds, setExpandedApiIds] = useState([]) - const [loadingOperationApiIds, setLoadingOperationApiIds] = useState([]) - const [operationsByApi, setOperationsByApi] = useState>({}) - const [operationErrors, setOperationErrors] = useState>({}) - const operationsByApiRef = useRef>({}) const [selectedTarget, setSelectedTarget] = useState(null) const [policyDocument, setPolicyDocument] = useState(null) @@ -176,14 +170,25 @@ export function Apis() { }, [accounts]) const lacksExplicitAdminRole = adminRoleClaims.length > 0 && !adminRoleClaims.includes("AIPolicy.Admin") + const { + apis, + catalogLoading, + catalogError, + accessDenied: catalogAccessDenied, + operationsByApi, + operationErrors, + loadingOperationApiIds, + refreshApis, + ensureOperationsLoaded, + refreshOperations, + } = useApimCatalog({ enabled: !lacksExplicitAdminRole }) + const effectiveAccessDeniedMessage = accessDeniedMessage ?? (catalogAccessDenied ? "You need AIPolicy.Admin role to use this page" : null) const selectedSummary = selectedTarget ? targetSummary(selectedTarget) : null const selectedKey = selectedTarget ? targetKey(selectedTarget) : undefined const busy = submittingAssignment || clearingAssignment const planDefaults = useMemo(() => derivePlanDefaults(plans), [plans]) - - useEffect(() => { - operationsByApiRef.current = operationsByApi - }, [operationsByApi]) + const pageLoading = initialLoading || catalogLoading + const pageError = initialError ?? catalogError const showToast = useCallback((message: string, onRetry?: () => void, retryLabel = "Retry") => { setToast({ message, onRetry, retryLabel: onRetry ? retryLabel : undefined }) @@ -226,80 +231,26 @@ export function Apis() { } }, [handleAccessError, showToast]) - const loadOperations = useCallback(async (api: ApimApiSummary) => { - if (loadingOperationApiIds.includes(api.id)) return - - setLoadingOperationApiIds((current) => [...current, api.id]) - setOperationErrors((current) => ({ ...current, [api.id]: null })) - - try { - const response = await fetchApimOperations(api.id) - setOperationsByApi((current) => ({ ...current, [api.id]: response.operations ?? [] })) - } catch (error) { - const status = getStatus(error) - if (status === 401 || status === 403) { - handleAccessError() - return - } - - const message = getErrorMessage(error, `Failed to load operations for ${api.displayName}`) - setOperationErrors((current) => ({ ...current, [api.id]: message })) - showToast(message, () => { - void loadOperations(api) - }) - } finally { - setLoadingOperationApiIds((current) => current.filter((apiId) => apiId !== api.id)) - } - }, [handleAccessError, loadingOperationApiIds, showToast]) - const loadInitialData = useCallback(async () => { setInitialLoading(true) setInitialError(null) setAccessDeniedMessage(null) try { - const [apisResponse, templatesResponse, plansResponse] = await Promise.all([ - fetchApimApis(), + const [templatesResponse, plansResponse] = await Promise.all([ fetchApimTemplates(), fetchPlans().catch(() => ({ plans: [] })), + refreshApis(), ]) - const nextApis = apisResponse.apis ?? [] - setApis(nextApis) setTemplates(templatesResponse.templates ?? []) setPlans(plansResponse.plans ?? []) - setOperationErrors({}) - setOperationsByApi({}) setExpandedApiIds([]) setInitialError(null) - - setSelectedTarget((current) => { - if (!current && nextApis.length > 0) { - return { kind: "api", api: nextApis[0] } - } - - if (!current) return null - - const matchingApi = nextApis.find((api) => api.id === current.api.id) - if (!matchingApi) { - return nextApis.length > 0 ? { kind: "api", api: nextApis[0] } : null - } - - if (current.kind === "api") { - return { kind: "api", api: matchingApi } - } - - const existingOperations = operationsByApiRef.current[matchingApi.id] ?? [] - const matchingOperation = existingOperations.find((operation) => operation.id === current.operation.id) - return matchingOperation - ? { kind: "operation", api: matchingApi, operation: matchingOperation } - : { kind: "api", api: matchingApi } - }) } catch (error) { const status = getStatus(error) if (status === 401 || status === 403) { handleAccessError() - setApis([]) setTemplates([]) setPlans([]) return @@ -313,7 +264,7 @@ export function Apis() { } finally { setInitialLoading(false) } - }, [handleAccessError, showToast]) + }, [handleAccessError, refreshApis, showToast]) useEffect(() => { if (lacksExplicitAdminRole) { @@ -326,14 +277,36 @@ export function Apis() { }, [lacksExplicitAdminRole, loadInitialData]) useEffect(() => { - if (!selectedTarget || accessDeniedMessage) { + setSelectedTarget((current) => { + if (!current) { + return apis[0] ? { kind: "api", api: apis[0] } : null + } + + const matchingApi = apis.find((api) => api.id === current.api.id) + if (!matchingApi) { + return apis[0] ? { kind: "api", api: apis[0] } : null + } + + if (current.kind === "api") { + return { kind: "api", api: matchingApi } + } + + const matchingOperation = (operationsByApi[matchingApi.id] ?? []).find((operation) => operation.id === current.operation.id) + return matchingOperation + ? { kind: "operation", api: matchingApi, operation: matchingOperation } + : { kind: "api", api: matchingApi } + }) + }, [apis, operationsByApi]) + + useEffect(() => { + if (!selectedTarget || effectiveAccessDeniedMessage) { setPolicyDocument(null) setPolicyError(null) return } void loadPolicy(selectedTarget) - }, [accessDeniedMessage, loadPolicy, selectedTarget]) + }, [effectiveAccessDeniedMessage, loadPolicy, selectedTarget]) useEffect(() => { if (!selectedTarget || !isPollingStatus(policyDocument?.assignment?.status)) return @@ -356,7 +329,7 @@ export function Apis() { }) if (!operationsByApi[api.id]) { - void loadOperations(api) + void ensureOperationsLoaded(api) } } @@ -366,7 +339,7 @@ export function Apis() { setExpandedApiIds((current) => [...current, api.id]) } if (!operationsByApi[api.id]) { - void loadOperations(api) + void ensureOperationsLoaded(api) } } @@ -448,7 +421,7 @@ export function Apis() { } } - if (accessDeniedMessage) { + if (effectiveAccessDeniedMessage) { return (
@@ -463,7 +436,7 @@ export function Apis() {
-

{accessDeniedMessage}

+

{effectiveAccessDeniedMessage}

Ask an administrator to grant the AIPolicy.Admin role, then refresh this page.

@@ -486,18 +459,18 @@ export function Apis() {
{Object.keys(planDefaults).length > 0 && Plan defaults available} -
- {initialError && ( + {pageError && (
- {initialError} + {pageError} @@ -507,7 +480,7 @@ export function Apis() {
- {initialLoading ? ( + {pageLoading ? (
Loading APIs…
@@ -523,7 +496,7 @@ export function Apis() { onApiSelect={handleApiSelect} onOperationSelect={handleOperationSelect} onRetryOperations={(api) => { - void loadOperations(api) + void refreshOperations(api) }} /> )} diff --git a/src/aipolicyengine-ui/src/types/accessProfiles.ts b/src/aipolicyengine-ui/src/types/accessProfiles.ts new file mode 100644 index 00000000..9ee5c181 --- /dev/null +++ b/src/aipolicyengine-ui/src/types/accessProfiles.ts @@ -0,0 +1,52 @@ +export interface AccessProfile { + id: string + partitionKey: string + clientAppId: string + tenantId: string + apiId: string + operationId: string | null + planId: string + routingPolicyId: string | null + allowedDeployments: string[] + enabled: boolean + createdBy: string + createdAt: string + updatedAt: string +} + +export interface AccessProfileCreateRequest { + clientAppId: string + tenantId: string + apiId: string + operationId?: string | null + planId: string + routingPolicyId?: string | null + allowedDeployments?: string[] + enabled?: boolean +} + +export interface AccessProfileUpdateRequest { + planId?: string + routingPolicyId?: string | null + allowedDeployments?: string[] + enabled?: boolean +} + +export interface AccessProfilesResponse { + profiles: AccessProfile[] +} + +export interface BulkAccessProfilesRequest { + profiles: AccessProfileCreateRequest[] +} + +export interface BulkAccessProfileFailure { + index: number + error: string + profileId: string | null +} + +export interface BulkAccessProfilesResponse { + created: number + failed: BulkAccessProfileFailure[] +} From ec75dd0476eace3ecdb86c7ecf947662a4e32ab4 Mon Sep 17 00:00:00 2001 From: Zack Way Date: Thu, 21 May 2026 18:09:56 -0400 Subject: [PATCH 14/14] docs(squad): record AAA M4-M5 completion + full layer ready for production - Added orchestration logs for Sydnor M4 (APIM templates, commit 24de42b5) and Kima M5 (admin UI, commit ec54c29c) - Added session log for M4-M5 milestone: 316 passing tests, M6 deferred - Merged inbox decision files for Sydnor M4 and Kima M5 into decisions.md - Archived full M4-M5 specs to decisions/archive/ - Updated all agent histories with M4-M5 completion and cross-team context - AAA authorization layer complete: M1-M3 backend + M4 templates + M5 admin UI - All 21 AAA tests active and passing; deployment ready Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .squad/agents/bunk/history.md | 38 +++++++++++ .squad/agents/freamon/history.md | 36 +++++++++++ .squad/agents/kima/history.md | 54 +++++++++++++++- .squad/agents/mcnulty/history.md | 33 +++++++++- .squad/agents/sydnor/history.md | 41 ++++++++++++ .squad/decisions.md | 26 ++++++++ .../{inbox => archive}/kima-aaa-m5-ui.md | 2 +- .../sydnor-aaa-m4-templates.md | 6 +- .../2026-05-21T22-07-10Z-aaa-m4-m5-shipped.md | 37 +++++++++++ .../2026-05-21T22-07-10Z-kima.md | 63 +++++++++++++++++++ .../2026-05-21T22-07-10Z-sydnor.md | 47 ++++++++++++++ 11 files changed, 377 insertions(+), 6 deletions(-) rename .squad/decisions/{inbox => archive}/kima-aaa-m5-ui.md (96%) rename .squad/decisions/{inbox => archive}/sydnor-aaa-m4-templates.md (80%) create mode 100644 .squad/log/2026-05-21T22-07-10Z-aaa-m4-m5-shipped.md create mode 100644 .squad/orchestration-log/2026-05-21T22-07-10Z-kima.md create mode 100644 .squad/orchestration-log/2026-05-21T22-07-10Z-sydnor.md diff --git a/.squad/agents/bunk/history.md b/.squad/agents/bunk/history.md index 75963185..9a0a22c5 100644 --- a/.squad/agents/bunk/history.md +++ b/.squad/agents/bunk/history.md @@ -512,3 +512,41 @@ When writing tests for deployed infrastructure: - Freamon M1-M3 contract firm → all resolver/precheck/log shapes now testable - Template diffs now visible → Sydnor can proceed with M4 APIM updates - All assertions passing → M4 and M5 in-flight now unblocked + +## 2026-05-21T22:07:10Z — AAA M1-M5 Complete, All 21 Tests Active + +**Status:** M1-M3 ✅ Complete (Freamon), M4-M5 ✅ Complete (Sydnor/Kima), All Tests Active + +**M4 Completion (Sydnor):** +- All 5 APIM templates updated (v1.0→1.1) +- Commit 24de42b5 +- 4 pending M4 template assertions activated and passing +- Template extraction, precheck URL diffs, log payload diffs, version bump all asserted + +**M5 Completion (Kima):** +- /access admin page shipped +- Client-first workflow, cascade visualization, shared hooks refactored +- Commit c54c29c + +**Test Matrix Final Results:** +- **Total:** 320 +- **Passed:** 316 (↑ +4 from M3, template-pending tests now active) +- **Skipped:** 4 (pre-existing Purview seam tests, not M4/M5 related) +- **Failed:** 0 + +**All 21 AAA Tests Now Active:** +- ✅ M1-M3 integration tests (13): resolver cascade, backward compat, endpoint contracts +- ✅ M4 template assertions (4): Template render, precheck URL diffs, log payload diffs, version bump +- ✅ Additional passing (4): Log integration, end-to-end cascade flow, endpoint contract validation + +**Cross-Team Coordination:** +- Freamon M1-M3: Backend fully consumed by UI + templates +- Sydnor M4: APIM templates shipped with metadata propagation +- Kima M5: UI integration complete, shared hooks now reusable + +**Deployment Ready:** +- Full AAA layer validated +- All integration paths tested +- Admin workflows functional end-to-end + +**Next:** PR review + merge; M6 (Redis caching) deferred as optional optimization diff --git a/.squad/agents/freamon/history.md b/.squad/agents/freamon/history.md index 2be2a458..c3e01535 100644 --- a/.squad/agents/freamon/history.md +++ b/.squad/agents/freamon/history.md @@ -106,3 +106,39 @@ For detailed work items, see: **Blocked Issue Resolved (Bunk coordination):** - Test matrix depended on M1-M3 endpoint contracts and audit trail shapes; now ready for Bunk's 21-test assertions and Sydnor's M4 template updates + +## 2026-05-21T22:07:10Z — AAA M1-M5 Full Layer Complete (Cross-Team Status) + +**Status:** M1-M3 ✅ Complete (Freamon), M4-M5 ✅ Complete (Sydnor/Kima) + +**M1-M3 Completion Summary (Freamon):** +- AccessProfile model, Cosmos repo, cascade resolver +- Admin CRUD + bulk endpoints +- Precheck integration with apiId/operationId support +- Log-ingest propagation of AccessProfileId, PlanId, ApiId, OperationId +- Commit 3d409d24 + +**M4-M5 Completion Summary (Sydnor/Kima):** +- **Sydnor M4 (Templates):** All 5 APIM templates v1.0→1.1, apiId/operationId capture, precheck URL extended, log payload enriched, 4 pending test assertions activated, commit 24de42b5 +- **Kima M5 (Admin UI):** /access page shipped, client-first workflow, cascade visualization, shared useApimCatalog hook refactor, commit ec54c29c + +**Full AAA Layer Status:** +- ✅ M1: AccessProfile model + Cosmos repo + IAccessProfileResolver cascade +- ✅ M2: Admin CRUD endpoints + bulk assign +- ✅ M3: Precheck integration + log-ingest propagation +- ✅ M4: APIM template updates + metadata propagation +- ✅ M5: Admin UI for Access Profile management + shared hooks + +**Test Results:** +- 320 total / 316 passed / 0 failed / 4 skipped (Purview seams) +- All 21 AAA tests active and passing + +**Deployment Ready:** +- Backend fully consumed by UI +- APIM templates validated +- Admin workflows end-to-end tested +- Cascade precedence enforced at all layers + +**M6 (Redis caching):** Deferred as optional optimization + +**Next:** PR review + merge to main; documentation finalization diff --git a/.squad/agents/kima/history.md b/.squad/agents/kima/history.md index 9647567c..5b4165ab 100644 --- a/.squad/agents/kima/history.md +++ b/.squad/agents/kima/history.md @@ -134,4 +134,56 @@ Build the `/access` admin page for Access Profile management (new per-client, pe - Start component structure (ClientSelector, ApiGrid, OperationGrid, ProfileForm) - Implement data fetch + caching patterns (parallel to API updates) - Polish flex+truncate styling for API/operation rows -- Mock data for component testing before API integration \ No newline at end of file +- Mock data for component testing before API integration + +## 2026-05-21T22:07:10Z — AAA M5 Admin UI Complete + +**Status:** ✅ COMPLETE + +**Commits:** +- Kima M5: `ec54c29c` + +**Delivered:** +- New `/access` admin page for Access Profile management + - **Client selector** (top): Searchable client dropdown using existing `/api/clients` + - **API grid** (main): Rows for each API, columns for Plan (read-only), Routing Policy (select), Deployments Allowed (multi-select), Enable toggle + - **Drill-down:** Click API row to expand operations table with per-operation override fields + - **Add/Edit form:** Modal with Plan selector, optional Routing Policy, optional deployment restrictions + - **Bulk action:** Select multiple APIs, apply same profile to all via `/api/access-profiles/bulk` +- Extracted shared `useApimCatalog` hook: lifted from `/apis` page, now used by both `/access` and `/apis` for APIM catalog loading + - Render-loop-safe ref/callback pattern preserved + - Eliminates duplicate API/operation loading logic + - Backward-compatible with existing `/apis` page +- UI behavior: + - Empty cells show effective cascade result (not blank) + - Direct overrides visually distinct from inherited values (CascadeBadge component) + - Disabled profiles stay visible, treated as non-winning per backend cascade + - API cards lazy-load operations when expanded + - Editor supports both single-scope and bulk creation + +**Validation:** +- ✅ `cd src\aipolicyengine-ui && npm run lint` — passed +- ✅ `cd src\aipolicyengine-ui && npm run build` — passed +- ✅ Shared hook refactored without breaking `/apis` page + +**Files Created/Modified:** +- `src/aipolicyengine-ui/src/App.tsx` — Route wiring for `/access` +- `src/aipolicyengine-ui/src/components/Layout.tsx` — Navigation item +- `src/aipolicyengine-ui/src/components/ui/dialog.tsx` — Drawer-style support via `contentClassName` +- `src/aipolicyengine-ui/src/hooks/useApimCatalog.ts` — Shared catalog loading hook +- `src/aipolicyengine-ui/src/api/accessProfiles.ts` — API client (NEW) +- `src/aipolicyengine-ui/src/types/accessProfiles.ts` — Type definitions (NEW) +- `src/aipolicyengine-ui/src/pages/AccessProfiles.tsx` — Main page (NEW) +- `src/aipolicyengine-ui/src/pages/Apis.tsx` — Refactored to use shared hook +- `src/aipolicyengine-ui/src/components/accessProfiles/CascadeBadge.tsx` — Cascade indicator (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/ClientList.tsx` — Client selector (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/ProfileEditor.tsx` — Edit form (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/ProfileGrid.tsx` — Matrix view (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/types.ts` — UI types (NEW) + +**Learning:** Shared hooks extracted from page-specific implementations reduce code duplication and improve maintainability. The render-loop debugging pattern (ref + stable callback identity) is now reusable across all APIM integration pages. + +**Cross-Team Notes:** +- Freamon M1-M3 backend contracts validated; all CRUD + bulk endpoints consumed +- Sydnor M4 templates shipped with metadata propagation; `/access` page now shows all cascade levels +- Bunk all 21 AAA tests active and passing; UI integration complete \ No newline at end of file diff --git a/.squad/agents/mcnulty/history.md b/.squad/agents/mcnulty/history.md index f8801f16..4a188162 100644 --- a/.squad/agents/mcnulty/history.md +++ b/.squad/agents/mcnulty/history.md @@ -81,4 +81,35 @@ All development work from Phase 0–3 (2026-03-31 to 2026-05-14) is documented i For detailed work items, see: - .squad/decisions.md — architectural decisions - .squad/orchestration-log/ — agent completion logs -- git log --oneline — implementation history \ No newline at end of file +- git log --oneline — implementation history +## 2026-05-21T22:07:10Z — AAA M1-M5 Layer Complete, Ready for Review + +**Status:** M1-M5 ✅ Complete (Freamon/Sydnor/Kima), Ready for PR + Merge + +**Designed Architecture (Approved 2026-05-21T21:28:06Z):** +- **Three-layer model:** Transport (APIM templates) → Authorization (Access Profiles) → Enforcement (Precheck + rate limiting) +- **Client identity:** Dual pattern (Entra JWT + subscription-key) v1; unify v2 if needed +- **Cascade resolution:** Most-specific-wins (operation > API > client > plan fallback) +- **Backward-compatible:** Existing deployments work unchanged; apiId/operationId optional + +**M1-M5 Delivered:** +- **M1 (Freamon):** AccessProfile model + Cosmos repo + IAccessProfileResolver cascade +- **M2 (Freamon):** Admin CRUD endpoints + bulk assign +- **M3 (Freamon):** Precheck integration + log-ingest propagation +- **M4 (Sydnor):** APIM templates v1.0→1.1, apiId/operationId capture, metadata propagation +- **M5 (Kima):** /access admin page, client-first workflow, cascade visualization + +**Test Coverage:** 320 total / 316 passed / 0 failed / 4 skipped (Purview seams) +- All 21 AAA tests active and passing +- Integration flows end-to-end validated +- Cascade precedence enforced at all layers + +**Deployment Status:** +- Backend fully functional and consumed by UI +- APIM templates validated +- Admin workflows functional +- M6 (Redis caching) deferred as optional optimization + +**Commits:** Freamon 3d409d24, Sydnor 24de42b5, Kima c54c29c + +**Next:** PR review + merge to main; documentation finalization diff --git a/.squad/agents/sydnor/history.md b/.squad/agents/sydnor/history.md index f42dd59c..0aa68978 100644 --- a/.squad/agents/sydnor/history.md +++ b/.squad/agents/sydnor/history.md @@ -344,4 +344,45 @@ Update all 5 APIM policy templates (version 1.0 → 1.1): - Coordinate with APIM deployment/staging validation - Parallel to Kima's M5 UI work +## 2026-05-21T22:07:10Z — AAA M4 Template Updates Complete + +**Status:** ✅ COMPLETE + +**Commits:** +- Sydnor M4: `24de42b5` + +**Delivered:** +- All 5 APIM templates updated (version 1.0 → 1.1): + - `entra-jwt-ai`, `entra-jwt-ai-dlp`: AI auth + DLP variants + - `subscription-key-ai`, `subscription-key-ai-dlp`: Subscription-key auth + DLP variants + - `entra-jwt-rest`: REST endpoint routing (active log payload + commented precheck-rest alternative) +- Each template now captures `apiIdValue` and `operationIdValue` from `context.Api.Id` and `context.Operation.Id` +- Precheck URL extended with query params: `&apiId={apiIdValue}&operationId={operationIdValue}` +- Response extraction: `planId`, `accessProfileId`, `allowedDeployments` from precheck response into context variables +- Outbound log payload enriched: Added `accessProfileId`, `planId`, `apiId`, `operationId` using lower-camel JSON naming +- Template manifests updated with version increment + +**Test Activation:** +- Unblocked 4 pending M4 assertions in `AccessProfilePrecheckTests` (Bunk coordination) +- All assertions replaced placeholders with concrete template file validations +- Template-variable extraction + payload diffs now passing + +**Validation:** +- ✅ `dotnet test` → 320 total / **316 passed** / 0 failed / 4 skipped +- ✅ 4 remaining skips are pre-existing Purview seam tests (not M4 related) +- ✅ All 21 AAA tests now active (no pending M4 blockers) + +**Files Modified:** +- `policies/templates/entra-jwt-ai/{policy.xml, template.json}` +- `policies/templates/entra-jwt-ai-dlp/{policy.xml, template.json}` +- `policies/templates/subscription-key-ai/{policy.xml, template.json}` +- `policies/templates/subscription-key-ai-dlp/{policy.xml, template.json}` +- `policies/templates/entra-jwt-rest/{policy.xml, template.json}` +- `src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs` + +**Cross-Team Notes:** +- Freamon M1-M3 precheck contracts validated; template integration complete +- Bunk all 21 AAA tests active and passing +- Kima M5 UI now consuming finalized template metadata; `/access` page shipped in parallel + diff --git a/.squad/decisions.md b/.squad/decisions.md index c41b1f0b..37a0bb13 100644 --- a/.squad/decisions.md +++ b/.squad/decisions.md @@ -12,6 +12,32 @@ ## Active Decisions +### 2026-05-21T22:07:10Z: Implementation status — AAA M1-M5 layer complete, ready for production +**By:** Scribe (logged from orchestration) +**Status:** Complete +**What:** +- **M1-M5 COMPLETE:** Full AAA per-client authorization layer shipped + - **M1 (Freamon):** AccessProfile model + Cosmos repo + - **M2 (Freamon):** Admin CRUD + bulk endpoints + - **M3 (Freamon):** Precheck integration with apiId/operationId + - **M4 (Sydnor):** APIM templates v1.0→1.1 (5 templates updated, commit `24de42b5`) + - **M5 (Kima):** `/access` admin page + shared hooks (commit `ec54c29c`) +- **All 21 AAA tests active:** 316 pass / 4 skip (pre-existing Purview seams) / 0 fail +- **M6 (Redis caching):** Deferred as optional optimization +- **UI/Backend Integration:** Full end-to-end workflows validated + - Cascade precedence enforced at all layers + - Effective state visible to users + - Bulk operations functional + +**Validation:** +- ✅ Sydnor: All 5 templates updated, 4 pending M4 tests unskipped + passing +- ✅ Kima: `/access` page + UI lint/build green, shared hook refactored with backward compat +- ✅ Integration: Backend fully consumed by UI, admin workflows tested + +**Why:** Mark full layer completion. AAA authorization layer production-ready for PR review + merge. + +**Next:** PR to main, optional M6 if performance tuning needed, documentation finalization. + ### 2026-05-21T21:48:19Z: Implementation status — AAA M1-M3 backend complete, M4-M5 parallel in-flight **By:** Scribe (logged from orchestration) **Status:** In-Flight diff --git a/.squad/decisions/inbox/kima-aaa-m5-ui.md b/.squad/decisions/archive/kima-aaa-m5-ui.md similarity index 96% rename from .squad/decisions/inbox/kima-aaa-m5-ui.md rename to .squad/decisions/archive/kima-aaa-m5-ui.md index 7ab94582..cd6f6c41 100644 --- a/.squad/decisions/inbox/kima-aaa-m5-ui.md +++ b/.squad/decisions/archive/kima-aaa-m5-ui.md @@ -27,7 +27,7 @@ - The APIM `/apis` page now uses the same shared catalog hook, avoiding duplicate API/operation loading logic and preserving the render-loop-safe ref/callback pattern. ## Contract and architecture alignment -- The UI follows McNulty’s client-first `/access` recommendation from `mcnulty-aaa-per-client-arch.md`. +- The UI follows McNulty's client-first `/access` recommendation from `mcnulty-aaa-per-client-arch.md`. - Access Profile IDs and scope modeling assume the shipped backend contract: - `ap:{clientAppId}:{tenantId}:{apiId}:{operationId|_all}` - `_global` for client-global defaults diff --git a/.squad/decisions/inbox/sydnor-aaa-m4-templates.md b/.squad/decisions/archive/sydnor-aaa-m4-templates.md similarity index 80% rename from .squad/decisions/inbox/sydnor-aaa-m4-templates.md rename to .squad/decisions/archive/sydnor-aaa-m4-templates.md index eef0a17e..af64797e 100644 --- a/.squad/decisions/inbox/sydnor-aaa-m4-templates.md +++ b/.squad/decisions/archive/sydnor-aaa-m4-templates.md @@ -5,12 +5,12 @@ - Added `apiIdValue` / `operationIdValue` capture to every template. - Updated the 4 AI templates (`entra-jwt-ai`, `entra-jwt-ai-dlp`, `subscription-key-ai`, `subscription-key-ai-dlp`) to append `apiId` + `operationId` to precheck calls and extract `accessProfileId` + `planId` from the precheck response. - Updated outbound log payloads to carry `accessProfileId`, `planId`, `apiId`, and `operationId` using the existing lower-camel JSON payload convention. -- Updated `entra-jwt-rest` per McNulty’s matrix: active outbound log payload now carries the new AAA fields; the commented `precheck-rest` alternative also shows the api/operation query params plus response-field extraction for future activation. +- Updated `entra-jwt-rest` per McNulty's matrix: active outbound log payload now carries the new AAA fields; the commented `precheck-rest` alternative also shows the api/operation query params plus response-field extraction for future activation. - Bumped all 5 template manifests from version `1.0` to `1.1`. -- Activated Bunk’s 4 pending M4 tests in `AccessProfilePrecheckTests` and replaced the placeholders with concrete assertions against the shipped template files. +- Activated Bunk's 4 pending M4 tests in `AccessProfilePrecheckTests` and replaced the placeholders with concrete assertions against the shipped template files. ## Contract alignment notes -- Precheck response field names were taken from Freamon’s shipped contract: `planId`, `accessProfileId`, `allowedDeployments`. +- Precheck response field names were taken from Freamon's shipped contract: `planId`, `accessProfileId`, `allowedDeployments`. - Outbound log payload additions use lower-camel JSON property names (`accessProfileId`, `planId`, `apiId`, `operationId`) to match the existing APIM payload style and the endpoint contract. - Used local APIM variable names `apiIdValue` / `operationIdValue` to avoid ambiguity with JSON property names while still emitting `apiId` / `operationId` on the wire. - Kept AI-template response extraction variable as `resolvedPlanId` so the log payload can cleanly map to `planId` and the audit model continues to distinguish resolved-plan metadata from legacy assignment fallback. diff --git a/.squad/log/2026-05-21T22-07-10Z-aaa-m4-m5-shipped.md b/.squad/log/2026-05-21T22-07-10Z-aaa-m4-m5-shipped.md new file mode 100644 index 00000000..d166e90e --- /dev/null +++ b/.squad/log/2026-05-21T22-07-10Z-aaa-m4-m5-shipped.md @@ -0,0 +1,37 @@ +# Session: AAA M4-M5 Shipped + Full Layer Complete +**Date:** 2026-05-21T22:07:10Z +**Participants:** Sydnor (M4 templates), Kima (M5 admin UI), full team coordination +**Branch:** seiggy/feature/apim-policy-management + +## Summary +M4 (APIM template updates) and M5 (admin UI) shipped on the same batch. Full AAA per-client authorization layer now complete (M1-M5). All 21 AAA tests active and passing (316/4/0). + +## Commits +- Sydnor M4: `24de42b5` (5 templates, version 1.0→1.1, test activation) +- Kima M5: `ec54c29c` (/access page, shared hook refactor) + +## Test Status +- **Total:** 320 +- **Passed:** 316 (↑ +4 from M3, template-pending tests now active) +- **Skipped:** 4 (pre-existing Purview seam tests) +- **Failed:** 0 + +## Scope Complete +- **M1:** AccessProfile model + Cosmos repo + resolver cascade ✅ +- **M2:** Admin CRUD endpoints + bulk ✅ +- **M3:** Precheck endpoint integration ✅ +- **M4:** APIM template updates + log payload ✅ +- **M5:** `/access` admin page + shared hooks ✅ + +**M6 (Redis caching):** Optional, deferred as optimization. + +## Deployment Ready +- Backend fully consumed by UI +- Template contracts validated +- Admin workflows end-to-end tested +- Cascade precedence enforced at all layers + +## Next +- PR review + merge to main +- Optional M6 (Redis caching) if performance tuning needed +- Documentation finalization diff --git a/.squad/orchestration-log/2026-05-21T22-07-10Z-kima.md b/.squad/orchestration-log/2026-05-21T22-07-10Z-kima.md new file mode 100644 index 00000000..da5d8b50 --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T22-07-10Z-kima.md @@ -0,0 +1,63 @@ +# Orchestration: Kima @ 2026-05-21T22:07:10Z + +## Agent +**Kima** — UI / AAA Management Page (M5) + +## Status +✅ COMPLETE + +## Scope Delivered +M5: `/access` admin page for Access Profile management + +**Client-First Workflow:** +- Searchable client selector (left sidebar) +- Access matrix (right main panel) showing APIs + operations +- Three cascade levels visualized: operation-level > API-wide > client-global +- Effective values shown (current cascade result before any override) + +**CRUD Integration:** +- GET `/api/access-profiles` — list profiles for selected client +- POST `/api/access-profiles` — create new profile +- PUT `/api/access-profiles/{id}` — update profile +- DELETE `/api/access-profiles/{id}` — delete profile +- POST `/api/access-profiles/bulk` — queued bulk creation for multiple scopes + +**UI Components:** +- `ClientList` — searchable client selector +- `ProfileGrid` — matrix showing APIs (rows) with operations (expandable), columns for Plan/Routing/Deployments +- `ProfileEditor` — drawer-style modal for create/edit with Plan selector, Routing Policy selector, deployment restrictions +- `CascadeBadge` — visual indicator showing which cascade level is currently active +- `useApimCatalog` — shared hook (lifted from `/apis` page) for APIM catalog loading (backward-compatible, render-loop-safe ref pattern) + +**Key Features:** +- Empty cells show effective cascade result (not blank) +- Direct overrides visually distinct from inherited values +- Disabled profiles stay visible, treated as non-winning per backend cascade +- API cards lazy-load operations when expanded +- Shared `useApimCatalog` hook eliminates duplicate API/operation loading logic +- Preserved render-loop-safe ref/callback pattern in `/apis` page refactor + +## Validation +- ✅ `cd src\aipolicyengine-ui && npm run lint` — passed +- ✅ `cd src\aipolicyengine-ui && npm run build` — passed + +## Files Created/Modified +- `src/aipolicyengine-ui/src/App.tsx` — Route wiring +- `src/aipolicyengine-ui/src/components/Layout.tsx` — Navigation item +- `src/aipolicyengine-ui/src/components/ui/dialog.tsx` — Drawer-style support via `contentClassName` +- `src/aipolicyengine-ui/src/hooks/useApimCatalog.ts` — Shared catalog loading hook +- `src/aipolicyengine-ui/src/api/accessProfiles.ts` — API client +- `src/aipolicyengine-ui/src/types/accessProfiles.ts` — Type definitions +- `src/aipolicyengine-ui/src/pages/AccessProfiles.tsx` — Main page (NEW) +- `src/aipolicyengine-ui/src/pages/Apis.tsx` — Refactored to use shared hook +- `src/aipolicyengine-ui/src/components/accessProfiles/CascadeBadge.tsx` — Cascade indicator (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/ClientList.tsx` — Client selector (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/ProfileEditor.tsx` — Edit form (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/ProfileGrid.tsx` — Matrix view (NEW) +- `src/aipolicyengine-ui/src/components/accessProfiles/types.ts` — UI types (NEW) + +## Commit +`ec54c29c` + +## Branch +`seiggy/feature/apim-policy-management` diff --git a/.squad/orchestration-log/2026-05-21T22-07-10Z-sydnor.md b/.squad/orchestration-log/2026-05-21T22-07-10Z-sydnor.md new file mode 100644 index 00000000..5bbedc4f --- /dev/null +++ b/.squad/orchestration-log/2026-05-21T22-07-10Z-sydnor.md @@ -0,0 +1,47 @@ +# Orchestration: Sydnor @ 2026-05-21T22:07:10Z + +## Agent +**Sydnor** — Infra / APIM templates (M4) + +## Status +✅ COMPLETE + +## Scope Delivered +M4: APIM policy template updates for AAA metadata propagation + +**All 5 templates updated (version 1.0 → 1.1):** +- `policies/templates/entra-jwt-ai/` — Added apiId/operationId capture, precheck URL extension, response extraction, log payload additions +- `policies/templates/entra-jwt-ai-dlp/` — Same updates for DLP variant +- `policies/templates/subscription-key-ai/` — Same updates for subscription-key auth +- `policies/templates/subscription-key-ai-dlp/` — Same updates for subscription-key DLP variant +- `policies/templates/entra-jwt-rest/` — Log payload additions active; precheck-rest alternative commented for future activation + +**Template Changes (per file):** +- Added APIM `set-variable` blocks: `apiIdValue = context.Api.Id`, `operationIdValue = context.Operation.Id` +- Precheck URL extension: `&apiId={apiIdValue}&operationId={operationIdValue}` +- Response extraction: `planId`, `accessProfileId`, `allowedDeployments` from precheck response +- Outbound log payload: Added `accessProfileId`, `planId`, `apiId`, `operationId` using lower-camel JSON convention + +**Test Activation:** +- Unblocked Bunk's 4 pending M4 template assertions in `AccessProfilePrecheckTests` +- All assertions now passing (concrete template files validate template-variable extraction + payload diffs) + +## Validation +- ✅ Template XML parsing verified +- ✅ `dotnet test` result: 320 total / **316 passed** / 0 failed / 4 skipped + - 4 remaining skips: pre-existing Purview seam tests (not M4 related) + - All 21 AAA tests now active (M4 template skips unskipped) + +## Files Modified +- `policies/templates/entra-jwt-ai/{policy.xml, template.json}` +- `policies/templates/entra-jwt-ai-dlp/{policy.xml, template.json}` +- `policies/templates/subscription-key-ai/{policy.xml, template.json}` +- `policies/templates/subscription-key-ai-dlp/{policy.xml, template.json}` +- `policies/templates/entra-jwt-rest/{policy.xml, template.json}` +- `src/AIPolicyEngine.Tests/Integration/AccessProfilePrecheckTests.cs` (unskipped 4 assertions) + +## Commit +`24de42b5` + +## Branch +`seiggy/feature/apim-policy-management`