diff --git a/.gitignore b/.gitignore index d4e5bb9..b643784 100644 --- a/.gitignore +++ b/.gitignore @@ -90,6 +90,12 @@ agent/gitleaks-report.json .env.* .claude/settings.local.json +# ────────────────────────────────────────────── +# Claude Code plugins +# ────────────────────────────────────────────── +.mcp.json +.remember/ + # ────────────────────────────────────────────── # Misc # ────────────────────────────────────────────── diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 04cbdfd..613657f 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -54,6 +54,14 @@ repos: files: ^agent/.*\.py$ stages: [pre-commit] + - id: docs-sync + name: sync docs → Starlight mirrors + entry: bash -lc 'cd "$(git rev-parse --show-toplevel)/docs" && node scripts/sync-starlight.mjs && git add src/content/docs/' + language: system + pass_filenames: false + files: ^(docs/(design|guides)/.*\.md$|CONTRIBUTING\.md$) + stages: [pre-commit] + - id: docs-astro-check name: astro check (docs) entry: bash -lc 'cd "$(git rev-parse --show-toplevel)/docs" && ./node_modules/.bin/astro check' diff --git a/AGENTS.md b/AGENTS.md index a466d6c..b254c35 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -38,6 +38,7 @@ Handler entry tests: `cdk/test/handlers/orchestrate-task.test.ts`, `create-task. ### Common mistakes - Editing **`docs/src/content/docs/`** instead of **`docs/guides/`** or **`docs/design/`** — content is generated; sync from sources. +- Adding or editing files in **`docs/design/`** or **`docs/guides/`** without running **`cd docs && node scripts/sync-starlight.mjs`** — CI will reject ("Fail build on mutation") because the Starlight mirror files in `docs/src/content/docs/` are stale. Always commit the regenerated mirrors alongside source changes. - Changing **`cdk/.../types.ts`** without updating **`cli/src/types.ts`** — CLI and API drift. - Running raw **`jest`/`tsc`/`cdk`** from muscle memory — prefer **`mise //cdk:test`**, **`mise //cdk:compile`**, **`mise //cdk:synth`** (see [Commands you can use](#commands-you-can-use)). - **`MISE_EXPERIMENTAL=1`** — required for namespaced tasks like **`mise //cdk:build`** (see [CONTRIBUTING.md](./CONTRIBUTING.md)). @@ -120,7 +121,7 @@ To build or test only the CLI subproject: ## Boundaries -- **Generated docs** — If you change docs sources (`docs/guides/`, `docs/design/`, `CONTRIBUTING.md`), run `mise //docs:sync` or `mise //docs:build`. +- **Generated docs (CI will reject if stale)** — Editing files in `docs/guides/`, `docs/design/`, or `CONTRIBUTING.md` requires regenerating Starlight mirrors under `docs/src/content/docs/`. Run **`cd docs && node scripts/sync-starlight.mjs`** (fast, <1 s) or **`mise //docs:sync`**, then commit the updated mirrors alongside your source changes. The pre-commit hook `docs-sync` does this automatically when prek hooks are installed, but if you bypass hooks (e.g. `--no-verify`), CI's "Fail build on mutation" step will catch it. - **Dependencies** — Add dependencies to the owning package `package.json` (`cdk/`, `cli/`, or `docs/`), then install via workspace/root install. -- **Build before commit** — Run a full build (`mise run build`) when done so tests/synth/docs/security checks stay in sync. +- **Build before commit** — Run a full build (`mise run build`) when done so tests/synth/docs/security checks stay in sync. This is especially critical for docs changes — the build includes `//docs:sync` which regenerates Starlight mirrors, and CI will fail if the committed mirrors don't match what the build produces. - **Major changes** — Before modifying existing files in a major way (large refactors, new stacks, changing the agent contract), ask first. diff --git a/CLAUDE.md b/CLAUDE.md index 43c994c..f84c43e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1 +1,3 @@ @AGENTS.md + +See also [README.md](./README.md) for the Claude Code plugin (`docs/abca-plugin/`), which provides interactive guided workflows for setup, deployment, repository onboarding, task submission, and troubleshooting via `/setup`, `/deploy`, `/onboard-repo`, `/submit-task`, `/status`, and `/troubleshoot` skills. Run Claude Code with `claude --plugin-dir docs/abca-plugin` to activate it. diff --git a/docs/abca-plugin/skills/deploy/SKILL.md b/docs/abca-plugin/skills/deploy/SKILL.md index ca0471c..ca7e92f 100644 --- a/docs/abca-plugin/skills/deploy/SKILL.md +++ b/docs/abca-plugin/skills/deploy/SKILL.md @@ -81,3 +81,16 @@ After a successful deploy, remind the user to: - Store/update the GitHub PAT in Secrets Manager if this is a fresh deployment - Onboard repositories via Blueprint constructs if needed - Run a smoke test: `curl -s -H "Authorization: $TOKEN" $API_URL/tasks` + +## Least-Privilege Deployment + +By default, CDK bootstrap grants `AdministratorAccess` to the CloudFormation execution role. For production or security-sensitive accounts, re-bootstrap with a scoped execution policy: + +```bash +cdk bootstrap aws://ACCOUNT/REGION \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT:policy/IaCRole-ABCA-Infrastructure" \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT:policy/IaCRole-ABCA-Application" \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT:policy/IaCRole-ABCA-Observability" +``` + +See `docs/design/DEPLOYMENT_ROLES.md` in the repo root for the complete least-privilege IAM policies, trust policy, runtime role inventory, and iterative tightening recommendations. diff --git a/docs/abca-plugin/skills/setup/SKILL.md b/docs/abca-plugin/skills/setup/SKILL.md index a0a86b6..fa786a2 100644 --- a/docs/abca-plugin/skills/setup/SKILL.md +++ b/docs/abca-plugin/skills/setup/SKILL.md @@ -52,11 +52,17 @@ If `mise run install` fails with "yarn: command not found", Corepack wasn't acti ## Phase 3: One-Time AWS Setup +On a fresh AWS account, X-Ray needs a CloudWatch Logs resource policy before it can write spans. Run both commands — the first creates the policy, the second sets the destination: + ```bash +ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) +aws logs put-resource-policy \ + --policy-name xray-spans-policy \ + --policy-document "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"XRaySpansAccess\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"xray.amazonaws.com\"},\"Action\":[\"logs:PutLogEvents\",\"logs:CreateLogGroup\",\"logs:CreateLogStream\"],\"Resource\":[\"arn:aws:logs:*:${ACCOUNT_ID}:log-group:aws/spans\",\"arn:aws:logs:*:${ACCOUNT_ID}:log-group:aws/spans:*\"]}]}" aws xray update-trace-segment-destination --destination CloudWatchLogs ``` -This must be run once per AWS account before first deployment. +These must be run once per AWS account before first deployment. If the `put-resource-policy` step is skipped, the `update-trace-segment-destination` command fails with `AccessDeniedException`. ## Phase 4: First Deployment diff --git a/docs/astro.config.mjs b/docs/astro.config.mjs index d5c8885..9c50391 100644 --- a/docs/astro.config.mjs +++ b/docs/astro.config.mjs @@ -40,7 +40,10 @@ export default defineConfig({ { label: 'Introduction', slug: 'index' }, { label: 'Getting Started', - items: [{ label: 'Quick Start', slug: 'getting-started/quick-start' }], + items: [ + { label: 'Quick Start', slug: 'getting-started/quick-start' }, + { label: 'Deployment Guide', slug: 'getting-started/deployment-guide' }, + ], }, { label: 'Using the Platform', diff --git a/docs/design/COST_MODEL.md b/docs/design/COST_MODEL.md index 68220ad..ffb726d 100644 --- a/docs/design/COST_MODEL.md +++ b/docs/design/COST_MODEL.md @@ -11,12 +11,16 @@ These costs are incurred regardless of task volume: | Component | Estimated cost | Notes | |---|---|---| | NAT Gateway (1×) | ~$32/month | Fixed hourly cost + data processing. Single AZ (see [COMPUTE.md - Network architecture](./COMPUTE.md)). | -| VPC Interface Endpoints (7×) | ~$50/month | $0.01/hr per endpoint per AZ. | +| VPC Interface Endpoints (7×, 2 AZs) | ~$102/month | $0.01/hr × 7 endpoints × 2 AZs × 730 hrs. | | VPC Flow Logs | ~$3/month | CloudWatch ingestion. | | DynamoDB (on-demand, idle) | ~$0/month | Pay-per-request; no cost when idle. | | CloudWatch Logs retention | ~$1–5/month | Depends on log volume. 90-day retention. | | API Gateway (idle) | ~$0/month | Pay-per-request. | -| **Total baseline** | **~$85–90/month** | | +| **Total baseline** | **~$140–150/month** | | + +### Scale-to-zero characteristics + +Most platform components are fully serverless and incur zero cost when idle: DynamoDB (PAY_PER_REQUEST), Lambda, API Gateway, ECS Fargate (cluster is free, when enabled), AgentCore Runtime (per-session), Bedrock (per-token), and Cognito (free tier). The always-on cost floor (~$140–150/month) is dominated by VPC networking infrastructure (NAT Gateway + 7 interface endpoints across 2 AZs) which is required for private subnet connectivity to AWS services and GitHub. See the [Deployment guide](../guides/DEPLOYMENT_GUIDE.md) for the full scale-to-zero breakdown. ## Per-task variable costs @@ -43,7 +47,7 @@ Assuming a typical task: 1–2 hours, Claude Sonnet, ~100K input tokens, ~20K ou | Model choice | 5–10× between Haiku and Opus | Default to Claude Sonnet; allow per-repo override. | | Turn count | Linear with turns | `max_turns` cap (default 100, configurable 1–500). | | Cost budget | Hard stop at budget | `max_budget_usd` cap (configurable $0.01–$100). Agent stops when budget is reached regardless of remaining turns. | -| Task duration | Sub-linear (compute is cheap; tokens dominate) | 8-hour max session timeout. | +| Task duration | Sub-linear (compute is cheap; tokens dominate) | AgentCore: 8-hour service limit; orchestrator: 9-hour `executionTimeout`. | | Prompt caching | 50–90% token cost reduction | Enable by default; cache system prompts and repo context. | | Concurrency | Linear with parallel tasks | Per-user and system-wide concurrency limits. | @@ -51,8 +55,8 @@ Assuming a typical task: 1–2 hours, Claude Sonnet, ~100K input tokens, ~20K ou | Scale | Tasks/month | Estimated monthly cost (infra + tasks) | |---|---|---| -| Low (1 developer) | 30–60 | $150–500 | -| Medium (small team) | 200–500 | $500–3,000 | +| Low (1 developer) | 30–60 | $200–550 | +| Medium (small team) | 200–500 | $550–3,000 | | High (org-wide) | 2,000–5,000 | $5,000–30,000 | These estimates assume Claude Sonnet with prompt caching enabled and average task complexity. @@ -72,8 +76,8 @@ For multi-user deployments, cost should be attributable to individual users and |---|---|---| | Turn limit | `max_turns` per task | 100 | | Cost budget | `max_budget_usd` per task | None (unlimited) | -| Session timeout | Orchestrator timeout | 8 hours | -| Concurrency limit | Per-user atomic counter | 2 concurrent tasks | +| Session timeout | Orchestrator timeout | 9 hours | +| Concurrency limit | Per-user atomic counter | 3 concurrent tasks | | System concurrency | System-wide counter | Account-level AgentCore quota | ## Additional guardrails @@ -85,7 +89,8 @@ For multi-user deployments, cost should be attributable to individual users and ## Reference -- [COMPUTE.md - Network architecture](./COMPUTE.md) - VPC infrastructure cost breakdown. -- [ORCHESTRATOR.md](./ORCHESTRATOR.md) - Polling cost analysis. -- [COMPUTE.md](./COMPUTE.md) - Compute option billing models. -- [OBSERVABILITY.md](./OBSERVABILITY.md) - Cost-related metrics (`agent.cost_usd`, token usage). +- [COMPUTE.md](./COMPUTE.md) -- Compute option billing models and network architecture. +- [ORCHESTRATOR.md](./ORCHESTRATOR.md) -- Polling cost analysis. +- [OBSERVABILITY.md](./OBSERVABILITY.md) -- Cost-related metrics (`agent.cost_usd`, token usage). +- [Deployment guide](../guides/DEPLOYMENT_GUIDE.md) -- Deployment choices, scale-to-zero analysis, AWS services inventory. +- [DEPLOYMENT_ROLES.md](./DEPLOYMENT_ROLES.md) -- Least-privilege IAM policies for deployment. diff --git a/docs/design/DEPLOYMENT_ROLES.md b/docs/design/DEPLOYMENT_ROLES.md new file mode 100644 index 0000000..b533c73 --- /dev/null +++ b/docs/design/DEPLOYMENT_ROLES.md @@ -0,0 +1,702 @@ +# Deployment roles + +This document defines least-privilege IAM policies for the CloudFormation execution role used during `cdk deploy`. The default CDK bootstrap grants `AdministratorAccess` to this role; the policies below scope it to only what ABCA needs. + +> **Origin**: These IAM policies were derived from a thorough review of the repository's CDK constructs, stacks, and handler code, then **validated against a live deployment** in `us-east-1` (create, update, task execution, and destroy). CloudTrail analysis identified 36 additional actions beyond the initial code review, and 7 deployment iterations refined the policies to their current form. The policies are split into three managed policies to stay under the IAM 6,144-character limit. + +## How CDK deployment roles work + +CDK uses a **four-role model** created during `cdk bootstrap`: + +1. **CDK Deploy Role** -- assumed by the CLI user to initiate deployment +2. **CDK File Publishing Role** -- uploads Lambda zip assets to S3 +3. **CDK Image Publishing Role** -- pushes Docker images to ECR +4. **CloudFormation Execution Role** -- assumed by CloudFormation to create/modify/delete resources + +The policy below is a **CloudFormation Execution Role** replacement. The other three roles are scoped by the bootstrap template and do not need modification for least-privilege deployment. + +## Using these policies + +The policies are split into three IAM managed policies (each under the 6,144-character limit): + +| Policy Name | Scope | +|-------------|-------| +| `IaCRole-ABCA-Infrastructure` | CloudFormation, IAM, VPC networking, Route 53 Resolver DNS Firewall | +| `IaCRole-ABCA-Application` | DynamoDB, Lambda, API Gateway, Cognito, WAFv2, EventBridge, Secrets Manager | +| `IaCRole-ABCA-Observability` | Bedrock AgentCore, Bedrock Guardrails, CloudWatch, X-Ray, S3, ECR, KMS, SSM, STS | + +> **Placeholder substitution**: Replace `ACCOUNT_ID` with your 12-digit AWS account ID and `REGION` with your deployment region (e.g., `us-east-1`) throughout this document. + +```bash +# Create all three policies in your account, then re-bootstrap: +cdk bootstrap aws://ACCOUNT_ID/REGION \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT_ID:policy/IaCRole-ABCA-Infrastructure" \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT_ID:policy/IaCRole-ABCA-Application" \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT_ID:policy/IaCRole-ABCA-Observability" +``` + +The `--cloudformation-execution-policies` flag can be repeated to attach multiple policies to the CloudFormation execution role. + +## Trust policy + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "cloudformation.amazonaws.com" + }, + "Action": "sts:AssumeRole" + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT_ID:root" + }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { + "sts:ExternalId": "cdk-hnb659fds" + } + } + } + ] +} +``` + +## IaCRole-ABCA + +For deploying the `backgroundagent-dev` stack. This single stack contains all platform resources including the AgentCore runtime, ECS compute (when enabled), API Gateway, Cognito, DynamoDB tables, VPC, DNS Firewall, and observability infrastructure. + +> **IAM managed policy size limit**: A single managed policy cannot exceed 6,144 characters. The permissions below are split into three policies to stay under this limit. Use all three when re-bootstrapping (see [Using these policies](#using-these-policies)). + +### IaCRole-ABCA-Infrastructure + +CloudFormation stack operations, IAM roles/policies, VPC networking, and Route 53 Resolver DNS Firewall. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "CloudFormationSelf", + "Effect": "Allow", + "Action": [ + "cloudformation:CreateStack", + "cloudformation:UpdateStack", + "cloudformation:DeleteStack", + "cloudformation:DescribeStacks", + "cloudformation:DescribeStackEvents", + "cloudformation:DescribeStackResources", + "cloudformation:GetTemplate", + "cloudformation:GetTemplateSummary", + "cloudformation:ListStackResources", + "cloudformation:CreateChangeSet", + "cloudformation:DeleteChangeSet", + "cloudformation:DescribeChangeSet", + "cloudformation:ExecuteChangeSet", + "cloudformation:SetStackPolicy", + "cloudformation:ValidateTemplate", + "cloudformation:ListChangeSets" + ], + "Resource": [ + "arn:aws:cloudformation:*:*:stack/backgroundagent-dev/*", + "arn:aws:cloudformation:*:*:stack/CDKToolkit/*" + ] + }, + { + "Sid": "IAMRolesAndPolicies", + "Effect": "Allow", + "Action": [ + "iam:CreateRole", + "iam:DeleteRole", + "iam:GetRole", + "iam:UpdateRole", + "iam:TagRole", + "iam:UntagRole", + "iam:ListRoleTags", + "iam:AttachRolePolicy", + "iam:DetachRolePolicy", + "iam:PutRolePolicy", + "iam:DeleteRolePolicy", + "iam:GetRolePolicy", + "iam:ListRolePolicies", + "iam:ListAttachedRolePolicies", + "iam:CreatePolicy", + "iam:DeletePolicy", + "iam:GetPolicy", + "iam:GetPolicyVersion", + "iam:CreatePolicyVersion", + "iam:DeletePolicyVersion", + "iam:ListPolicyVersions", + "iam:TagPolicy", + "iam:CreateServiceLinkedRole", + "iam:ListInstanceProfilesForRole" + ], + "Resource": [ + "arn:aws:iam::*:role/backgroundagent-dev-*", + "arn:aws:iam::*:policy/backgroundagent-dev-*", + "arn:aws:iam::*:role/aws-service-role/*" + ] + }, + { + "Sid": "IAMPassRole", + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "arn:aws:iam::*:role/backgroundagent-dev-*", + "Condition": { + "StringEquals": { + "iam:PassedToService": [ + "lambda.amazonaws.com", + "ecs-tasks.amazonaws.com", + "ecs.amazonaws.com", + "apigateway.amazonaws.com", + "logs.amazonaws.com", + "bedrock.amazonaws.com", + "bedrock-agentcore.amazonaws.com", + "events.amazonaws.com", + "vpc-flow-logs.amazonaws.com" + ] + } + } + }, + { + "Sid": "VPCNetworking", + "Effect": "Allow", + "Action": [ + "ec2:CreateVpc", + "ec2:DeleteVpc", + "ec2:DescribeVpcs", + "ec2:ModifyVpcAttribute", + "ec2:CreateSubnet", + "ec2:DeleteSubnet", + "ec2:DescribeSubnets", + "ec2:CreateInternetGateway", + "ec2:DeleteInternetGateway", + "ec2:AttachInternetGateway", + "ec2:DetachInternetGateway", + "ec2:DescribeInternetGateways", + "ec2:AllocateAddress", + "ec2:ReleaseAddress", + "ec2:DescribeAddresses", + "ec2:CreateNatGateway", + "ec2:DeleteNatGateway", + "ec2:DescribeNatGateways", + "ec2:CreateRouteTable", + "ec2:DeleteRouteTable", + "ec2:DescribeRouteTables", + "ec2:AssociateRouteTable", + "ec2:DisassociateRouteTable", + "ec2:CreateRoute", + "ec2:DeleteRoute", + "ec2:CreateSecurityGroup", + "ec2:DeleteSecurityGroup", + "ec2:DescribeSecurityGroups", + "ec2:AuthorizeSecurityGroupEgress", + "ec2:RevokeSecurityGroupEgress", + "ec2:AuthorizeSecurityGroupIngress", + "ec2:RevokeSecurityGroupIngress", + "ec2:CreateVpcEndpoint", + "ec2:DeleteVpcEndpoints", + "ec2:DescribeVpcEndpoints", + "ec2:ModifyVpcEndpoint", + "ec2:CreateFlowLogs", + "ec2:DeleteFlowLogs", + "ec2:DescribeFlowLogs", + "ec2:CreateTags", + "ec2:DeleteTags", + "ec2:DescribeTags", + "ec2:DescribeAvailabilityZones", + "ec2:DescribeNetworkInterfaces", + "ec2:DescribePrefixLists", + "ec2:DescribeNetworkAcls", + "ec2:DescribeVpcAttribute", + "ec2:ModifySubnetAttribute" + ], + "Resource": "*" + }, + { + "Sid": "Route53ResolverDNSFirewall", + "Effect": "Allow", + "Action": [ + "route53resolver:CreateFirewallRuleGroup", + "route53resolver:DeleteFirewallRuleGroup", + "route53resolver:GetFirewallRuleGroup", + "route53resolver:CreateFirewallRule", + "route53resolver:DeleteFirewallRule", + "route53resolver:ListFirewallRules", + "route53resolver:UpdateFirewallRule", + "route53resolver:CreateFirewallDomainList", + "route53resolver:DeleteFirewallDomainList", + "route53resolver:GetFirewallDomainList", + "route53resolver:UpdateFirewallDomains", + "route53resolver:AssociateFirewallRuleGroup", + "route53resolver:DisassociateFirewallRuleGroup", + "route53resolver:GetFirewallRuleGroupAssociation", + "route53resolver:ListFirewallRuleGroupAssociations", + "route53resolver:UpdateFirewallConfig", + "route53resolver:GetFirewallConfig", + "route53resolver:TagResource", + "route53resolver:UntagResource", + "route53resolver:ListTagsForResource", + "route53resolver:CreateResolverQueryLogConfig", + "route53resolver:DeleteResolverQueryLogConfig", + "route53resolver:GetResolverQueryLogConfig", + "route53resolver:AssociateResolverQueryLogConfig", + "route53resolver:DisassociateResolverQueryLogConfig", + "route53resolver:GetResolverQueryLogConfigAssociation", + "route53resolver:ListResolverQueryLogConfigAssociations", + "route53resolver:ListResolverQueryLogConfigs" + ], + "Resource": "*" + } + ] +} +``` + +### IaCRole-ABCA-Application + +DynamoDB tables, Lambda functions, API Gateway, Cognito, WAFv2, EventBridge, and Secrets Manager. When ECS Fargate compute is enabled, add the ECS statement below to this policy. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "DynamoDB", + "Effect": "Allow", + "Action": [ + "dynamodb:CreateTable", + "dynamodb:DeleteTable", + "dynamodb:DescribeTable", + "dynamodb:DescribeTimeToLive", + "dynamodb:UpdateTimeToLive", + "dynamodb:UpdateTable", + "dynamodb:UpdateContinuousBackups", + "dynamodb:DescribeContinuousBackups", + "dynamodb:TagResource", + "dynamodb:UntagResource", + "dynamodb:ListTagsOfResource", + "dynamodb:PutItem", + "dynamodb:UpdateItem", + "dynamodb:DescribeContributorInsights", + "dynamodb:DescribeKinesisStreamingDestination", + "dynamodb:GetResourcePolicy" + ], + "Resource": "arn:aws:dynamodb:*:*:table/backgroundagent-dev-*" + }, + { + "Sid": "Lambda", + "Effect": "Allow", + "Action": [ + "lambda:CreateFunction", + "lambda:DeleteFunction", + "lambda:GetFunction", + "lambda:GetFunctionConfiguration", + "lambda:UpdateFunctionCode", + "lambda:UpdateFunctionConfiguration", + "lambda:AddPermission", + "lambda:RemovePermission", + "lambda:GetPolicy", + "lambda:TagResource", + "lambda:UntagResource", + "lambda:ListTags", + "lambda:PublishVersion", + "lambda:CreateAlias", + "lambda:DeleteAlias", + "lambda:GetAlias", + "lambda:UpdateAlias", + "lambda:PutFunctionEventInvokeConfig", + "lambda:DeleteFunctionEventInvokeConfig", + "lambda:GetFunctionEventInvokeConfig", + "lambda:PutFunctionConcurrency", + "lambda:DeleteFunctionConcurrency", + "lambda:GetFunctionCodeSigningConfig", + "lambda:GetFunctionRecursionConfig", + "lambda:GetProvisionedConcurrencyConfig", + "lambda:GetRuntimeManagementConfig", + "lambda:ListVersionsByFunction", + "lambda:InvokeFunction" + ], + "Resource": [ + "arn:aws:lambda:*:*:function:backgroundagent-dev-*", + "arn:aws:lambda:*:*:function:backgroundagent-dev-AWS*" + ] + }, + { + "Sid": "APIGateway", + "Effect": "Allow", + "Action": [ + "apigateway:POST", + "apigateway:GET", + "apigateway:PUT", + "apigateway:PATCH", + "apigateway:DELETE", + "apigateway:TagResource", + "apigateway:UntagResource", + "apigateway:SetWebACL", + "apigateway:UpdateRestApiPolicy" + ], + "Resource": [ + "arn:aws:apigateway:*::/restapis", + "arn:aws:apigateway:*::/restapis/*", + "arn:aws:apigateway:*::/account", + "arn:aws:apigateway:*::/tags/*" + ] + }, + { + "Sid": "Cognito", + "Effect": "Allow", + "Action": [ + "cognito-idp:CreateUserPool", + "cognito-idp:DeleteUserPool", + "cognito-idp:DescribeUserPool", + "cognito-idp:UpdateUserPool", + "cognito-idp:CreateUserPoolClient", + "cognito-idp:DeleteUserPoolClient", + "cognito-idp:DescribeUserPoolClient", + "cognito-idp:UpdateUserPoolClient", + "cognito-idp:TagResource", + "cognito-idp:UntagResource", + "cognito-idp:ListTagsForResource", + "cognito-idp:GetUserPoolMfaConfig" + ], + "Resource": "arn:aws:cognito-idp:*:*:userpool/*" + }, + { + "Sid": "WAFv2", + "Effect": "Allow", + "Action": [ + "wafv2:CreateWebACL", + "wafv2:DeleteWebACL", + "wafv2:GetWebACL", + "wafv2:UpdateWebACL", + "wafv2:AssociateWebACL", + "wafv2:DisassociateWebACL", + "wafv2:ListTagsForResource", + "wafv2:TagResource", + "wafv2:UntagResource", + "wafv2:GetWebACLForResource" + ], + "Resource": [ + "arn:aws:wafv2:*:*:regional/webacl/*", + "arn:aws:wafv2:*:*:regional/managedruleset/*" + ] + }, + { + "Sid": "EventBridge", + "Effect": "Allow", + "Action": [ + "events:PutRule", + "events:DeleteRule", + "events:DescribeRule", + "events:PutTargets", + "events:RemoveTargets", + "events:ListTargetsByRule", + "events:TagResource", + "events:UntagResource", + "events:ListTagsForResource" + ], + "Resource": "arn:aws:events:*:*:rule/backgroundagent-dev-*" + }, + { + "Sid": "SecretsManager", + "Effect": "Allow", + "Action": [ + "secretsmanager:CreateSecret", + "secretsmanager:DeleteSecret", + "secretsmanager:DescribeSecret", + "secretsmanager:GetSecretValue", + "secretsmanager:PutSecretValue", + "secretsmanager:UpdateSecret", + "secretsmanager:TagResource", + "secretsmanager:UntagResource", + "secretsmanager:GetResourcePolicy", + "secretsmanager:PutResourcePolicy", + "secretsmanager:DeleteResourcePolicy" + ], + "Resource": [ + "arn:aws:secretsmanager:*:*:secret:backgroundagent-*", + "arn:aws:secretsmanager:*:*:secret:GitHubTokenSecret*" + ] + }, + { + "Sid": "SecretsManagerAccountLevel", + "Effect": "Allow", + "Action": "secretsmanager:GetRandomPassword", + "Resource": "*" + } + ] +} +``` + +### IaCRole-ABCA-Observability + +Bedrock AgentCore, Bedrock Guardrails, CloudWatch Logs/Dashboards/Alarms, X-Ray, S3 (CDK assets), KMS, ECR, SSM, and STS. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "BedrockAgentCore", + "Effect": "Allow", + "Action": [ + "bedrock-agentcore:*" + ], + "Resource": "*" + }, + { + "Sid": "BedrockGuardrailsAndLogging", + "Effect": "Allow", + "Action": [ + "bedrock:CreateGuardrail", + "bedrock:DeleteGuardrail", + "bedrock:GetGuardrail", + "bedrock:UpdateGuardrail", + "bedrock:CreateGuardrailVersion", + "bedrock:ListGuardrails", + "bedrock:TagResource", + "bedrock:UntagResource", + "bedrock:ListTagsForResource", + "bedrock:PutModelInvocationLoggingConfiguration", + "bedrock:DeleteModelInvocationLoggingConfiguration", + "bedrock:GetModelInvocationLoggingConfiguration" + ], + "Resource": "*" + }, + { + "Sid": "CloudWatchLogsAndDashboards", + "Effect": "Allow", + "Action": [ + "logs:CreateLogGroup", + "logs:DeleteLogGroup", + "logs:DescribeLogGroups", + "logs:PutRetentionPolicy", + "logs:DeleteRetentionPolicy", + "logs:TagLogGroup", + "logs:UntagLogGroup", + "logs:TagResource", + "logs:UntagResource", + "logs:ListTagsForResource", + "logs:ListTagsLogGroup", + "logs:PutResourcePolicy", + "logs:DeleteResourcePolicy", + "logs:DescribeResourcePolicies", + "cloudwatch:PutDashboard", + "cloudwatch:DeleteDashboards", + "cloudwatch:GetDashboard", + "cloudwatch:PutMetricAlarm", + "cloudwatch:DeleteAlarms", + "cloudwatch:DescribeAlarms", + "cloudwatch:TagResource", + "cloudwatch:UntagResource", + "logs:CreateDelivery", + "logs:DescribeDeliveries", + "logs:GetDelivery", + "logs:GetDeliveryDestination", + "logs:GetDeliveryDestinationPolicy", + "logs:GetDeliverySource", + "logs:PutDeliveryDestination", + "logs:PutDeliverySource", + "logs:DescribeIndexPolicies", + "cloudwatch:ListTagsForResource", + "logs:CreateLogDelivery", + "logs:DeleteLogDelivery", + "logs:GetLogDelivery", + "logs:UpdateLogDelivery", + "logs:ListLogDeliveries", + "logs:DeleteDelivery", + "logs:DeleteDeliverySource", + "logs:DeleteDeliveryDestination" + ], + "Resource": "*" + }, + { + "Sid": "S3CDKAssets", + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:GetBucketLocation", + "s3:ListBucket" + ], + "Resource": [ + "arn:aws:s3:::cdk-hnb659fds-assets-*", + "arn:aws:s3:::cdk-hnb659fds-assets-*/*" + ] + }, + { + "Sid": "KMSForCDKAssets", + "Effect": "Allow", + "Action": [ + "kms:CreateGrant", + "kms:Decrypt", + "kms:DescribeKey", + "kms:Encrypt", + "kms:GenerateDataKey" + ], + "Resource": "*" + }, + { + "Sid": "ECRForDockerAssets", + "Effect": "Allow", + "Action": [ + "ecr:CreateRepository", + "ecr:DescribeRepositories", + "ecr:GetAuthorizationToken", + "ecr:BatchCheckLayerAvailability", + "ecr:GetDownloadUrlForLayer", + "ecr:BatchGetImage", + "ecr:PutImage", + "ecr:InitiateLayerUpload", + "ecr:UploadLayerPart", + "ecr:CompleteLayerUpload", + "ecr:SetRepositoryPolicy", + "ecr:GetRepositoryPolicy", + "ecr:DeleteRepository", + "ecr:ListTagsForResource", + "ecr:TagResource" + ], + "Resource": [ + "arn:aws:ecr:*:*:repository/cdk-hnb659fds-container-assets-*", + "arn:aws:ecr:*:*:repository/backgroundagent-*" + ] + }, + { + "Sid": "ECRAuthToken", + "Effect": "Allow", + "Action": "ecr:GetAuthorizationToken", + "Resource": "*" + }, + { + "Sid": "XRay", + "Effect": "Allow", + "Action": [ + "xray:UpdateTraceSegmentDestination", + "xray:GetTraceSegmentDestination", + "xray:ListResourcePolicies", + "xray:PutResourcePolicy" + ], + "Resource": "*" + }, + { + "Sid": "SSMParameterStoreForCDK", + "Effect": "Allow", + "Action": [ + "ssm:GetParameter", + "ssm:GetParameters", + "ssm:PutParameter", + "ssm:DeleteParameter" + ], + "Resource": "arn:aws:ssm:*:*:parameter/cdk-bootstrap/*" + }, + { + "Sid": "STSForCDK", + "Effect": "Allow", + "Action": [ + "sts:AssumeRole", + "sts:GetCallerIdentity" + ], + "Resource": [ + "arn:aws:iam::*:role/cdk-hnb659fds-*" + ] + } + ] +} +``` + +### When ECS compute is enabled + +If you uncomment the ECS blocks in `cdk/src/stacks/agent.ts` to enable the Fargate compute backend, add the following statement to the `IaCRole-ABCA-Application` policy (the combined policy remains under the 6,144-character IAM limit): + +```json +{ + "Sid": "ECS", + "Effect": "Allow", + "Action": [ + "ecs:CreateCluster", + "ecs:DeleteCluster", + "ecs:DescribeClusters", + "ecs:UpdateCluster", + "ecs:UpdateClusterSettings", + "ecs:PutClusterCapacityProviders", + "ecs:RegisterTaskDefinition", + "ecs:DeregisterTaskDefinition", + "ecs:DescribeTaskDefinition", + "ecs:ListTaskDefinitions", + "ecs:TagResource", + "ecs:UntagResource", + "ecs:ListTagsForResource", + "ecs:PutAccountSetting" + ], + "Resource": "*" +} +``` + +## Runtime IAM roles (created by the stack) + +These roles are created inside the CloudFormation stack at deploy time, not by the deployer. They are documented here for a complete picture of the IAM footprint. + +| Role | Assumed By | Purpose | +|------|-----------|---------| +| AgentCore Runtime execution role | AgentCore Runtime | Runs MicroVM containers; DynamoDB, Secrets Manager, CloudWatch Logs, Bedrock, AgentCore Memory access | +| BedrockLoggingRole | `bedrock.amazonaws.com` | Writes model invocation logs to CloudWatch | +| TaskOrchestrator Lambda role | Lambda | Durable orchestrator; DynamoDB, Secrets Manager, AgentCore Runtime invocation, AgentCore Memory | +| ConcurrencyReconciler Lambda role | Lambda | Scheduled reconciliation; DynamoDB scan + conditional updates | +| TaskApi Lambda roles (9-10) | Lambda | API handler functions; DynamoDB, Secrets Manager (webhook handlers), Bedrock Guardrail, Lambda invoke | +| AwsCustomResource Lambda role | Lambda | Blueprint DDB writes, Bedrock logging config, DNS firewall config | +| API Gateway CloudWatch role | API Gateway | Pushes API Gateway access logs | +| VPC Flow Log role | VPC Flow Logs | Writes flow logs to CloudWatch | +| ECS task execution role (when enabled) | ECS (pull images) | ECR image pull, CloudWatch Logs write | +| ECS task role (when enabled) | ECS (container runtime) | DynamoDB, Secrets Manager, Bedrock InvokeModel | + +### CDK bootstrap roles + +| Role | Purpose | +|------|---------| +| `cdk-hnb659fds-deploy-role-*` | Assumed by CDK CLI to initiate deployments | +| `cdk-hnb659fds-cfn-exec-role-*` | Assumed by CloudFormation to create resources (**this is what IaCRole-ABCA replaces**) | +| `cdk-hnb659fds-file-publish-role-*` | Uploads Lambda zip assets to S3 | +| `cdk-hnb659fds-image-publish-role-*` | Pushes Docker images to ECR | +| `cdk-hnb659fds-lookup-role-*` | Context lookups (VPC, AZs, etc.) | + +## Resource-level permission constraints + +Several services require `Resource: "*"` because they do not support resource-level permissions for create/describe operations: + +| Service | Actions Requiring `"*"` | Reason | +|---------|------------------------|--------| +| EC2 (VPC) | `Create*`, `Describe*`, `Allocate*` | VPC resource ARNs unknown at policy creation time | +| Route 53 Resolver | All DNS Firewall actions | No resource-level ARN support for firewall rule groups | +| Bedrock | Guardrail + logging config actions | Account-level APIs (`PutModelInvocationLoggingConfiguration`) | +| Bedrock AgentCore | All actions (`bedrock-agentcore:*`) | CloudFormation resource handler uses internal action names that differ from the public API; wildcard required for reliable deployment | +| CloudWatch Logs | `CreateLogGroup`, `PutResourcePolicy` | Log group ARNs unknown at policy creation; resource policies are account-scoped | +| ECS | Cluster + task definition actions | `RegisterTaskDefinition` doesn't support resource-level permissions | +| ECR | `GetAuthorizationToken` | Account-level operation | +| KMS | `CreateGrant`, `Decrypt`, `Encrypt`, `GenerateDataKey` | CDK asset encryption keys; key ARNs unknown at policy time | +| Secrets Manager | `GetRandomPassword` | Account-level API (no secret ARN); isolated in its own statement with `Resource: "*"` | +| X-Ray | `UpdateTraceSegmentDestination`, `PutResourcePolicy` | Account-level operations | + +These constraints align with the CDK Nag `AwsSolutions-IAM5` suppressions in the codebase. + +## Iterative tightening + +These policies are conservative-but-scoped starting points. To tighten further: + +1. **Deploy once with CloudTrail enabled**, then use [IAM Access Analyzer policy generation](https://docs.aws.amazon.com/IAM/latest/UserGuide/access-analyzer-policy-generation.html) to generate a least-privilege policy based on the actual API calls recorded in CloudTrail. +2. **Replace `*` resources** with actual ARNs after the first deploy (e.g., once you know the VPC ID, scope EC2 actions to that VPC). +3. **Add region conditions** where possible (e.g., `"aws:RequestedRegion": "us-east-1"`) to prevent cross-region resource creation. +4. **Restrict `iam:AttachRolePolicy`** with an `iam:PolicyARN` condition to limit which policies can be attached to `backgroundagent-dev-*` roles. This requires enumerating the AWS managed policies CDK attaches (e.g., `service-role/AWSLambdaBasicExecutionRole`) from a synthesized template, so it is deferred to a post-deployment tightening pass. +5. **Scope `iam:CreateServiceLinkedRole`** with an `iam:AWSServiceName` condition to limit which AWS services can have service-linked roles created. After a first deploy, check CloudTrail for which service-linked roles were actually created and restrict accordingly. +6. **Scope KMS actions** with a `kms:ResourceAliases` condition (e.g., `"kms:ResourceAliases": "alias/cdk-hnb659fds-*"`) to limit `CreateGrant`, `Decrypt`, `Encrypt`, and `GenerateDataKey` to the deterministic CDK bootstrap key. +7. **Use permission boundaries** on the IaC role to set an outer limit even if the policy is too broad. +8. **Review after each CDK version upgrade** -- new CDK versions may add/remove custom resources that need different permissions. + +## Reference + +- [SECURITY.md](./SECURITY.md) -- Runtime IAM, memory isolation, custom step trust boundaries. +- [COMPUTE.md](./COMPUTE.md) -- Compute backend options (AgentCore vs ECS Fargate). +- [COST_MODEL.md](./COST_MODEL.md) -- Infrastructure baseline costs and scale-to-zero analysis. diff --git a/docs/guides/DEPLOYMENT_GUIDE.md b/docs/guides/DEPLOYMENT_GUIDE.md new file mode 100644 index 0000000..795580e --- /dev/null +++ b/docs/guides/DEPLOYMENT_GUIDE.md @@ -0,0 +1,123 @@ +# Deployment guide + +This guide covers deploying ABCA into an AWS account, including compute backend choices, scale-to-zero characteristics, and the complete AWS service inventory. For day-to-day development workflow, see the [Developer guide](./DEVELOPER_GUIDE.md). For a quick first deployment, see the [Quick start](./QUICK_START.md). For least-privilege IAM deployment roles, see [DEPLOYMENT_ROLES.md](../design/DEPLOYMENT_ROLES.md). + +## Architecture overview + +ABCA deploys as a **single CDK stack** (`backgroundagent-dev`) containing all platform resources. The stack uses a `ComputeStrategy` interface to support two compute backends within the same stack: + +| Aspect | AgentCore (default) | ECS Fargate (opt-in) | +|--------|--------------------|--------------------| +| **Compute** | Bedrock AgentCore Runtime (Firecracker MicroVMs) | ECS Fargate containers | +| **Resources** | 2 vCPU, 8 GB RAM, 2 GB max image size | 2 vCPU, 4 GB RAM | +| **Orchestration** | Durable Lambda (checkpoint/replay) | Same durable Lambda via `ComputeStrategy` | +| **Agent mode** | FastAPI server (HTTP invocation) | Batch (run-to-completion) | +| **Startup** | ~10s (warm MicroVM) | ~60-180s (Fargate cold start) | +| **Max duration** | 8 hours (AgentCore service limit) | 9 hours (orchestrator `executionTimeout`) | + +Both backends are orchestrated by the same durable Lambda function. The `ComputeStrategy` interface abstracts `startSession()`, `pollSession()`, and `stopSession()` -- the ECS strategy calls `ecs:RunTask` / `ecs:DescribeTasks` / `ecs:StopTask` directly from the Lambda. No Step Functions are used. + +ECS Fargate is currently **opt-in** -- the `EcsAgentCluster` construct is present in the stack code but commented out. To enable it, uncomment the ECS blocks in `cdk/src/stacks/agent.ts`. + +## Scale-to-zero analysis + +### Components that scale to zero (pay-per-use) + +| Component | Billing Model | Idle Cost | +|-----------|--------------|-----------| +| DynamoDB (5 tables) | PAY_PER_REQUEST | $0 | +| Lambda (all functions) | Per invocation | $0 | +| API Gateway REST | Per request | $0 | +| ECS Fargate tasks (when enabled) | Per running task | $0 (cluster is free) | +| AgentCore Runtime | Per session | $0 | +| Bedrock inference | Per token | $0 | +| AgentCore Memory | Proportional to usage | ~$0 | +| Cognito | Free tier (50K MAU) | $0 | + +### Components that do not scale to zero (always-on) + +| Component | Est. Monthly Idle Cost | Why | +|-----------|----------------------|-----| +| NAT Gateway (1x) | ~$32 | $0.045/hr fixed charge | +| VPC Interface Endpoints (7x, 2 AZs) | ~$102 | $0.01/hr × 7 endpoints × 2 AZs × 730 hrs | +| WAF v2 Web ACL | ~$5 | Base monthly charge | +| CloudWatch Dashboard | ~$3 | Per-dashboard charge | +| Secrets Manager (1+ secrets) | ~$0.40/secret | Per-secret monthly | +| CloudWatch Alarms | ~$0.10/alarm | Per standard alarm | +| CloudWatch Logs retention | ~$1-5 | Storage for retained logs | +| **Total always-on baseline** | **~$140-150/month** | | + +The dominant idle cost is VPC networking: 7 interface endpoints across 2 AZs (~$102/month) plus the NAT Gateway (~$32/month). + +For the full cost model including per-task costs, see [COST_MODEL.md](../design/COST_MODEL.md). + +## AWS services inventory + +### Compute + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| Bedrock AgentCore Runtime (MicroVMs) | Agent sessions (default) | Yes | +| ECS Fargate (when enabled) | Agent sessions (opt-in) | Yes | +| Lambda (Node.js 24, ARM64) | Orchestrator, API handlers, reconciler, custom resources | Yes | + +### AI/ML + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| Bedrock (Claude Sonnet 4.6, Opus 4, Haiku 4.5) | Agent reasoning, cross-region inference profiles | Yes | +| Bedrock Guardrails | Prompt injection detection on task input | Yes | +| Bedrock AgentCore Memory | Semantic + episodic extraction strategies | Yes | + +### Networking + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| VPC (public + private subnets, 2 AZs) | All compute | N/A (no direct cost) | +| NAT Gateway (1x) | Private subnet internet egress | **No** (~$32/mo) | +| VPC Interface Endpoints (7x, 2 AZs) | AWS service connectivity from private subnets | **No** (~$102/mo) | +| VPC Gateway Endpoints (2x: S3, DynamoDB) | S3 and DynamoDB connectivity | Yes (free) | +| Security Groups | HTTPS-only egress | N/A | +| Route 53 Resolver DNS Firewall | Domain allowlisting for agent egress | Minimal | + +### Storage / Database + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| DynamoDB (5 tables, PAY_PER_REQUEST) | Task state, events, concurrency, webhooks, repo config | Yes | +| S3 | CDK asset bucket, ECR image layers, FUSE session storage | Minimal | +| Secrets Manager | GitHub PAT, webhook HMAC secrets | **No** (~$0.40/secret/mo) | + +### API / Auth + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| API Gateway (REST) | Task REST API | Yes | +| Cognito User Pool | CLI/API authentication | Yes (free tier) | +| WAF v2 | API Gateway protection (managed rules + rate limiting) | **No** (~$5/mo base) | + +### Observability + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| CloudWatch Logs (multiple log groups) | Application, usage, model invocation, VPC flow, DNS query logs | **No** (storage) | +| CloudWatch Dashboard | Operational metrics visualization | **No** (~$3/mo) | +| CloudWatch Alarms | Orchestrator error alerting | **No** (~$0.10/alarm) | +| X-Ray | AgentCore Runtime tracing | Yes | + +### Infrastructure / Deployment + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| CloudFormation | Stack deployment, custom resources | N/A | +| ECR | Container image storage | Minimal | +| IAM | Roles and policies for all components | N/A | + +## Reference + +- [Quick start](./QUICK_START.md) -- Zero-to-first-PR in 6 steps. +- [Developer guide](./DEVELOPER_GUIDE.md) -- Local development, testing, repository onboarding. +- [User guide](./USER_GUIDE.md) -- API reference, CLI usage, task management. +- [DEPLOYMENT_ROLES.md](../design/DEPLOYMENT_ROLES.md) -- Least-privilege IAM policies for CloudFormation execution. +- [COST_MODEL.md](../design/COST_MODEL.md) -- Per-task costs, cost guardrails, cost at scale. +- [COMPUTE.md](../design/COMPUTE.md) -- Compute backend architecture and trade-offs. diff --git a/docs/guides/QUICK_START.md b/docs/guides/QUICK_START.md index d81feca..808ff02 100644 --- a/docs/guides/QUICK_START.md +++ b/docs/guides/QUICK_START.md @@ -6,7 +6,7 @@ Go from zero to your first agent-created pull request in about 30 minutes. This Install these before you begin: -- **AWS account** with credentials configured (`aws configure`) +- **AWS account** with credentials configured (`aws configure`). If you use named profiles, set `AWS_PROFILE` before running any commands in this guide. - **Docker** - for building the agent container image - **mise** - task runner ([install guide](https://mise.jdx.dev/getting-started.html)) - **AWS CDK CLI** - `npm install -g aws-cdk` (after mise is active) @@ -35,6 +35,8 @@ mise run build `mise run install` installs all JavaScript and Python dependencies across the monorepo. `mise run build` compiles the CDK app, the CLI, the agent image, and the docs site. A successful build means you are ready to deploy. +> **Note:** `mise run build` includes CDK synthesis, which queries AWS for availability zones. Your active AWS credentials must have at least `ec2:DescribeAvailabilityZones` permission, or the build will fail. If you use named profiles, make sure `AWS_PROFILE` is set before running the build. + ## Step 2 - Prepare a repository The agent works by cloning a GitHub repository, creating a branch, making code changes, running the build and tests, and opening a pull request. This means it needs **write access** to a real repository. @@ -75,7 +77,12 @@ The `repo` value must match **exactly** what you will pass to the CLI later (`ow The CDK stack deploys the full platform: API Gateway, Lambda functions (orchestrator, task CRUD, webhooks), DynamoDB tables, AgentCore Runtime, VPC with network isolation, Cognito user pool, and CloudWatch dashboards. ```bash -# One-time account setup (X-Ray destination) +# One-time account setup: allow X-Ray to write spans to CloudWatch Logs. +# On a fresh account, X-Ray needs a resource policy before the destination can be set. +ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) +aws logs put-resource-policy \ + --policy-name xray-spans-policy \ + --policy-document "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"XRaySpansAccess\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"xray.amazonaws.com\"},\"Action\":[\"logs:PutLogEvents\",\"logs:CreateLogGroup\",\"logs:CreateLogStream\"],\"Resource\":[\"arn:aws:logs:*:${ACCOUNT_ID}:log-group:aws/spans\",\"arn:aws:logs:*:${ACCOUNT_ID}:log-group:aws/spans:*\"]}]}" aws xray update-trace-segment-destination --destination CloudWatchLogs # Bootstrap CDK (first time only) @@ -85,7 +92,7 @@ mise run //cdk:bootstrap mise run //cdk:deploy ``` -The X-Ray command is a one-time per-account setup. CDK bootstrap provisions the staging resources CDK needs (S3 bucket, IAM roles). The deploy itself takes around 10 minutes - most of the time is spent building the Docker image and provisioning the AgentCore Runtime. +The X-Ray commands are a one-time per-account setup. On a fresh account the `put-resource-policy` call is required first — without it, the `update-trace-segment-destination` command fails with an `AccessDeniedException` because X-Ray cannot write to the `aws/spans` log group. CDK bootstrap provisions the staging resources CDK needs (S3 bucket, IAM roles). The deploy itself takes around 10 minutes - most of the time is spent building the Docker image and provisioning the AgentCore Runtime. ## Step 4 - Store the GitHub token @@ -183,7 +190,10 @@ Here is what the platform did after you ran `bgagent submit`: |---|---|---| | `yarn: command not found` | Corepack not enabled or mise not activated in your shell | Run `eval "$(mise activate zsh)"`, then `corepack enable && corepack prepare yarn@1.22.22 --activate` | | `MISE_EXPERIMENTAL required` | Namespaced tasks need the experimental flag | `export MISE_EXPERIMENTAL=1` | -| CDK deploy fails with "X-Ray Delivery Destination..." | Missing one-time account setup | `aws xray update-trace-segment-destination --destination CloudWatchLogs` | +| `AccessDeniedException` on `update-trace-segment-destination` | Fresh account missing CloudWatch Logs resource policy for X-Ray | Run `aws logs put-resource-policy` first (see Step 3) | +| CDK deploy fails with "X-Ray Delivery Destination..." | Missing one-time account setup | Run both X-Ray commands in Step 3 | +| `mise run build` fails with `ec2:DescribeAvailabilityZones` error | AWS credentials missing or insufficient for CDK synth | Set `AWS_PROFILE` or configure credentials with at least EC2 read access | +| CDK deploy prompts for approval and hangs | Non-interactive terminal (CI/CD, scripts) | Pass `--require-approval never` to `cdk deploy` or use an interactive terminal | | `put-secret-value` returns double-dot endpoint | `REGION` variable is empty | Set `REGION=us-east-1` (or your actual region) before running the command | | `REPO_NOT_ONBOARDED` on task submit | Blueprint `repo` does not match what you passed to the CLI | Check `cdk/src/stacks/agent.ts` - the `repo` value must be exactly `owner/repo` matching your fork | | `INSUFFICIENT_GITHUB_REPO_PERMISSIONS` | PAT is missing required permissions or is scoped to the wrong repo | Regenerate the PAT with Contents (read/write) and Pull requests (read/write) scoped to your fork, then update Secrets Manager | diff --git a/docs/scripts/sync-starlight.mjs b/docs/scripts/sync-starlight.mjs index f9a5519..c7b81f6 100644 --- a/docs/scripts/sync-starlight.mjs +++ b/docs/scripts/sync-starlight.mjs @@ -43,6 +43,7 @@ function rewriteDocsLinkTarget(target) { DEVELOPER_GUIDE: '/developer-guide/introduction', USER_GUIDE: '/using/overview', CONTRIBUTING: '/developer-guide/contributing', + DEPLOYMENT_GUIDE: '/getting-started/deployment-guide', }; /** `splitGuide` emits each `##` from DEVELOPER_GUIDE as its own page — map #anchors to those routes. */ @@ -210,6 +211,12 @@ mirrorMarkdownFile( path.join('src', 'content', 'docs', 'getting-started', 'Quick-start.md'), ); +// --- Deployment Guide: mirror to getting-started/ --- +mirrorMarkdownFile( + path.join(docsRoot, 'guides', 'DEPLOYMENT_GUIDE.md'), + path.join('src', 'content', 'docs', 'getting-started', 'Deployment-guide.md'), +); + // --- Prompt Guide: mirror to customizing/ --- mirrorMarkdownFile( path.join(docsRoot, 'guides', 'PROMPT_GUIDE.md'), diff --git a/docs/src/content/docs/architecture/Cost-model.md b/docs/src/content/docs/architecture/Cost-model.md index 2a742f6..9006a8b 100644 --- a/docs/src/content/docs/architecture/Cost-model.md +++ b/docs/src/content/docs/architecture/Cost-model.md @@ -15,12 +15,16 @@ These costs are incurred regardless of task volume: | Component | Estimated cost | Notes | |---|---|---| | NAT Gateway (1×) | ~$32/month | Fixed hourly cost + data processing. Single AZ (see [COMPUTE.md - Network architecture](/architecture/compute)). | -| VPC Interface Endpoints (7×) | ~$50/month | $0.01/hr per endpoint per AZ. | +| VPC Interface Endpoints (7×, 2 AZs) | ~$102/month | $0.01/hr × 7 endpoints × 2 AZs × 730 hrs. | | VPC Flow Logs | ~$3/month | CloudWatch ingestion. | | DynamoDB (on-demand, idle) | ~$0/month | Pay-per-request; no cost when idle. | | CloudWatch Logs retention | ~$1–5/month | Depends on log volume. 90-day retention. | | API Gateway (idle) | ~$0/month | Pay-per-request. | -| **Total baseline** | **~$85–90/month** | | +| **Total baseline** | **~$140–150/month** | | + +### Scale-to-zero characteristics + +Most platform components are fully serverless and incur zero cost when idle: DynamoDB (PAY_PER_REQUEST), Lambda, API Gateway, ECS Fargate (cluster is free, when enabled), AgentCore Runtime (per-session), Bedrock (per-token), and Cognito (free tier). The always-on cost floor (~$140–150/month) is dominated by VPC networking infrastructure (NAT Gateway + 7 interface endpoints across 2 AZs) which is required for private subnet connectivity to AWS services and GitHub. See the [Deployment guide](/getting-started/deployment-guide) for the full scale-to-zero breakdown. ## Per-task variable costs @@ -47,7 +51,7 @@ Assuming a typical task: 1–2 hours, Claude Sonnet, ~100K input tokens, ~20K ou | Model choice | 5–10× between Haiku and Opus | Default to Claude Sonnet; allow per-repo override. | | Turn count | Linear with turns | `max_turns` cap (default 100, configurable 1–500). | | Cost budget | Hard stop at budget | `max_budget_usd` cap (configurable $0.01–$100). Agent stops when budget is reached regardless of remaining turns. | -| Task duration | Sub-linear (compute is cheap; tokens dominate) | 8-hour max session timeout. | +| Task duration | Sub-linear (compute is cheap; tokens dominate) | AgentCore: 8-hour service limit; orchestrator: 9-hour `executionTimeout`. | | Prompt caching | 50–90% token cost reduction | Enable by default; cache system prompts and repo context. | | Concurrency | Linear with parallel tasks | Per-user and system-wide concurrency limits. | @@ -55,8 +59,8 @@ Assuming a typical task: 1–2 hours, Claude Sonnet, ~100K input tokens, ~20K ou | Scale | Tasks/month | Estimated monthly cost (infra + tasks) | |---|---|---| -| Low (1 developer) | 30–60 | $150–500 | -| Medium (small team) | 200–500 | $500–3,000 | +| Low (1 developer) | 30–60 | $200–550 | +| Medium (small team) | 200–500 | $550–3,000 | | High (org-wide) | 2,000–5,000 | $5,000–30,000 | These estimates assume Claude Sonnet with prompt caching enabled and average task complexity. @@ -76,8 +80,8 @@ For multi-user deployments, cost should be attributable to individual users and |---|---|---| | Turn limit | `max_turns` per task | 100 | | Cost budget | `max_budget_usd` per task | None (unlimited) | -| Session timeout | Orchestrator timeout | 8 hours | -| Concurrency limit | Per-user atomic counter | 2 concurrent tasks | +| Session timeout | Orchestrator timeout | 9 hours | +| Concurrency limit | Per-user atomic counter | 3 concurrent tasks | | System concurrency | System-wide counter | Account-level AgentCore quota | ## Additional guardrails @@ -89,7 +93,8 @@ For multi-user deployments, cost should be attributable to individual users and ## Reference -- [COMPUTE.md - Network architecture](/architecture/compute) - VPC infrastructure cost breakdown. -- [ORCHESTRATOR.md](/architecture/orchestrator) - Polling cost analysis. -- [COMPUTE.md](/architecture/compute) - Compute option billing models. -- [OBSERVABILITY.md](/architecture/observability) - Cost-related metrics (`agent.cost_usd`, token usage). +- [COMPUTE.md](/architecture/compute) -- Compute option billing models and network architecture. +- [ORCHESTRATOR.md](/architecture/orchestrator) -- Polling cost analysis. +- [OBSERVABILITY.md](/architecture/observability) -- Cost-related metrics (`agent.cost_usd`, token usage). +- [Deployment guide](/getting-started/deployment-guide) -- Deployment choices, scale-to-zero analysis, AWS services inventory. +- [DEPLOYMENT_ROLES.md](/architecture/deployment-roles) -- Least-privilege IAM policies for deployment. diff --git a/docs/src/content/docs/architecture/Deployment-roles.md b/docs/src/content/docs/architecture/Deployment-roles.md new file mode 100644 index 0000000..c021ce3 --- /dev/null +++ b/docs/src/content/docs/architecture/Deployment-roles.md @@ -0,0 +1,706 @@ +--- +title: Deployment roles +--- + +# Deployment roles + +This document defines least-privilege IAM policies for the CloudFormation execution role used during `cdk deploy`. The default CDK bootstrap grants `AdministratorAccess` to this role; the policies below scope it to only what ABCA needs. + +> **Origin**: These IAM policies were derived from a thorough review of the repository's CDK constructs, stacks, and handler code, then **validated against a live deployment** in `us-east-1` (create, update, task execution, and destroy). CloudTrail analysis identified 36 additional actions beyond the initial code review, and 7 deployment iterations refined the policies to their current form. The policies are split into three managed policies to stay under the IAM 6,144-character limit. + +## How CDK deployment roles work + +CDK uses a **four-role model** created during `cdk bootstrap`: + +1. **CDK Deploy Role** -- assumed by the CLI user to initiate deployment +2. **CDK File Publishing Role** -- uploads Lambda zip assets to S3 +3. **CDK Image Publishing Role** -- pushes Docker images to ECR +4. **CloudFormation Execution Role** -- assumed by CloudFormation to create/modify/delete resources + +The policy below is a **CloudFormation Execution Role** replacement. The other three roles are scoped by the bootstrap template and do not need modification for least-privilege deployment. + +## Using these policies + +The policies are split into three IAM managed policies (each under the 6,144-character limit): + +| Policy Name | Scope | +|-------------|-------| +| `IaCRole-ABCA-Infrastructure` | CloudFormation, IAM, VPC networking, Route 53 Resolver DNS Firewall | +| `IaCRole-ABCA-Application` | DynamoDB, Lambda, API Gateway, Cognito, WAFv2, EventBridge, Secrets Manager | +| `IaCRole-ABCA-Observability` | Bedrock AgentCore, Bedrock Guardrails, CloudWatch, X-Ray, S3, ECR, KMS, SSM, STS | + +> **Placeholder substitution**: Replace `ACCOUNT_ID` with your 12-digit AWS account ID and `REGION` with your deployment region (e.g., `us-east-1`) throughout this document. + +```bash +# Create all three policies in your account, then re-bootstrap: +cdk bootstrap aws://ACCOUNT_ID/REGION \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT_ID:policy/IaCRole-ABCA-Infrastructure" \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT_ID:policy/IaCRole-ABCA-Application" \ + --cloudformation-execution-policies "arn:aws:iam::ACCOUNT_ID:policy/IaCRole-ABCA-Observability" +``` + +The `--cloudformation-execution-policies` flag can be repeated to attach multiple policies to the CloudFormation execution role. + +## Trust policy + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "cloudformation.amazonaws.com" + }, + "Action": "sts:AssumeRole" + }, + { + "Effect": "Allow", + "Principal": { + "AWS": "arn:aws:iam::ACCOUNT_ID:root" + }, + "Action": "sts:AssumeRole", + "Condition": { + "StringEquals": { + "sts:ExternalId": "cdk-hnb659fds" + } + } + } + ] +} +``` + +## IaCRole-ABCA + +For deploying the `backgroundagent-dev` stack. This single stack contains all platform resources including the AgentCore runtime, ECS compute (when enabled), API Gateway, Cognito, DynamoDB tables, VPC, DNS Firewall, and observability infrastructure. + +> **IAM managed policy size limit**: A single managed policy cannot exceed 6,144 characters. The permissions below are split into three policies to stay under this limit. Use all three when re-bootstrapping (see [Using these policies](#using-these-policies)). + +### IaCRole-ABCA-Infrastructure + +CloudFormation stack operations, IAM roles/policies, VPC networking, and Route 53 Resolver DNS Firewall. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "CloudFormationSelf", + "Effect": "Allow", + "Action": [ + "cloudformation:CreateStack", + "cloudformation:UpdateStack", + "cloudformation:DeleteStack", + "cloudformation:DescribeStacks", + "cloudformation:DescribeStackEvents", + "cloudformation:DescribeStackResources", + "cloudformation:GetTemplate", + "cloudformation:GetTemplateSummary", + "cloudformation:ListStackResources", + "cloudformation:CreateChangeSet", + "cloudformation:DeleteChangeSet", + "cloudformation:DescribeChangeSet", + "cloudformation:ExecuteChangeSet", + "cloudformation:SetStackPolicy", + "cloudformation:ValidateTemplate", + "cloudformation:ListChangeSets" + ], + "Resource": [ + "arn:aws:cloudformation:*:*:stack/backgroundagent-dev/*", + "arn:aws:cloudformation:*:*:stack/CDKToolkit/*" + ] + }, + { + "Sid": "IAMRolesAndPolicies", + "Effect": "Allow", + "Action": [ + "iam:CreateRole", + "iam:DeleteRole", + "iam:GetRole", + "iam:UpdateRole", + "iam:TagRole", + "iam:UntagRole", + "iam:ListRoleTags", + "iam:AttachRolePolicy", + "iam:DetachRolePolicy", + "iam:PutRolePolicy", + "iam:DeleteRolePolicy", + "iam:GetRolePolicy", + "iam:ListRolePolicies", + "iam:ListAttachedRolePolicies", + "iam:CreatePolicy", + "iam:DeletePolicy", + "iam:GetPolicy", + "iam:GetPolicyVersion", + "iam:CreatePolicyVersion", + "iam:DeletePolicyVersion", + "iam:ListPolicyVersions", + "iam:TagPolicy", + "iam:CreateServiceLinkedRole", + "iam:ListInstanceProfilesForRole" + ], + "Resource": [ + "arn:aws:iam::*:role/backgroundagent-dev-*", + "arn:aws:iam::*:policy/backgroundagent-dev-*", + "arn:aws:iam::*:role/aws-service-role/*" + ] + }, + { + "Sid": "IAMPassRole", + "Effect": "Allow", + "Action": "iam:PassRole", + "Resource": "arn:aws:iam::*:role/backgroundagent-dev-*", + "Condition": { + "StringEquals": { + "iam:PassedToService": [ + "lambda.amazonaws.com", + "ecs-tasks.amazonaws.com", + "ecs.amazonaws.com", + "apigateway.amazonaws.com", + "logs.amazonaws.com", + "bedrock.amazonaws.com", + "bedrock-agentcore.amazonaws.com", + "events.amazonaws.com", + "vpc-flow-logs.amazonaws.com" + ] + } + } + }, + { + "Sid": "VPCNetworking", + "Effect": "Allow", + "Action": [ + "ec2:CreateVpc", + "ec2:DeleteVpc", + "ec2:DescribeVpcs", + "ec2:ModifyVpcAttribute", + "ec2:CreateSubnet", + "ec2:DeleteSubnet", + "ec2:DescribeSubnets", + "ec2:CreateInternetGateway", + "ec2:DeleteInternetGateway", + "ec2:AttachInternetGateway", + "ec2:DetachInternetGateway", + "ec2:DescribeInternetGateways", + "ec2:AllocateAddress", + "ec2:ReleaseAddress", + "ec2:DescribeAddresses", + "ec2:CreateNatGateway", + "ec2:DeleteNatGateway", + "ec2:DescribeNatGateways", + "ec2:CreateRouteTable", + "ec2:DeleteRouteTable", + "ec2:DescribeRouteTables", + "ec2:AssociateRouteTable", + "ec2:DisassociateRouteTable", + "ec2:CreateRoute", + "ec2:DeleteRoute", + "ec2:CreateSecurityGroup", + "ec2:DeleteSecurityGroup", + "ec2:DescribeSecurityGroups", + "ec2:AuthorizeSecurityGroupEgress", + "ec2:RevokeSecurityGroupEgress", + "ec2:AuthorizeSecurityGroupIngress", + "ec2:RevokeSecurityGroupIngress", + "ec2:CreateVpcEndpoint", + "ec2:DeleteVpcEndpoints", + "ec2:DescribeVpcEndpoints", + "ec2:ModifyVpcEndpoint", + "ec2:CreateFlowLogs", + "ec2:DeleteFlowLogs", + "ec2:DescribeFlowLogs", + "ec2:CreateTags", + "ec2:DeleteTags", + "ec2:DescribeTags", + "ec2:DescribeAvailabilityZones", + "ec2:DescribeNetworkInterfaces", + "ec2:DescribePrefixLists", + "ec2:DescribeNetworkAcls", + "ec2:DescribeVpcAttribute", + "ec2:ModifySubnetAttribute" + ], + "Resource": "*" + }, + { + "Sid": "Route53ResolverDNSFirewall", + "Effect": "Allow", + "Action": [ + "route53resolver:CreateFirewallRuleGroup", + "route53resolver:DeleteFirewallRuleGroup", + "route53resolver:GetFirewallRuleGroup", + "route53resolver:CreateFirewallRule", + "route53resolver:DeleteFirewallRule", + "route53resolver:ListFirewallRules", + "route53resolver:UpdateFirewallRule", + "route53resolver:CreateFirewallDomainList", + "route53resolver:DeleteFirewallDomainList", + "route53resolver:GetFirewallDomainList", + "route53resolver:UpdateFirewallDomains", + "route53resolver:AssociateFirewallRuleGroup", + "route53resolver:DisassociateFirewallRuleGroup", + "route53resolver:GetFirewallRuleGroupAssociation", + "route53resolver:ListFirewallRuleGroupAssociations", + "route53resolver:UpdateFirewallConfig", + "route53resolver:GetFirewallConfig", + "route53resolver:TagResource", + "route53resolver:UntagResource", + "route53resolver:ListTagsForResource", + "route53resolver:CreateResolverQueryLogConfig", + "route53resolver:DeleteResolverQueryLogConfig", + "route53resolver:GetResolverQueryLogConfig", + "route53resolver:AssociateResolverQueryLogConfig", + "route53resolver:DisassociateResolverQueryLogConfig", + "route53resolver:GetResolverQueryLogConfigAssociation", + "route53resolver:ListResolverQueryLogConfigAssociations", + "route53resolver:ListResolverQueryLogConfigs" + ], + "Resource": "*" + } + ] +} +``` + +### IaCRole-ABCA-Application + +DynamoDB tables, Lambda functions, API Gateway, Cognito, WAFv2, EventBridge, and Secrets Manager. When ECS Fargate compute is enabled, add the ECS statement below to this policy. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "DynamoDB", + "Effect": "Allow", + "Action": [ + "dynamodb:CreateTable", + "dynamodb:DeleteTable", + "dynamodb:DescribeTable", + "dynamodb:DescribeTimeToLive", + "dynamodb:UpdateTimeToLive", + "dynamodb:UpdateTable", + "dynamodb:UpdateContinuousBackups", + "dynamodb:DescribeContinuousBackups", + "dynamodb:TagResource", + "dynamodb:UntagResource", + "dynamodb:ListTagsOfResource", + "dynamodb:PutItem", + "dynamodb:UpdateItem", + "dynamodb:DescribeContributorInsights", + "dynamodb:DescribeKinesisStreamingDestination", + "dynamodb:GetResourcePolicy" + ], + "Resource": "arn:aws:dynamodb:*:*:table/backgroundagent-dev-*" + }, + { + "Sid": "Lambda", + "Effect": "Allow", + "Action": [ + "lambda:CreateFunction", + "lambda:DeleteFunction", + "lambda:GetFunction", + "lambda:GetFunctionConfiguration", + "lambda:UpdateFunctionCode", + "lambda:UpdateFunctionConfiguration", + "lambda:AddPermission", + "lambda:RemovePermission", + "lambda:GetPolicy", + "lambda:TagResource", + "lambda:UntagResource", + "lambda:ListTags", + "lambda:PublishVersion", + "lambda:CreateAlias", + "lambda:DeleteAlias", + "lambda:GetAlias", + "lambda:UpdateAlias", + "lambda:PutFunctionEventInvokeConfig", + "lambda:DeleteFunctionEventInvokeConfig", + "lambda:GetFunctionEventInvokeConfig", + "lambda:PutFunctionConcurrency", + "lambda:DeleteFunctionConcurrency", + "lambda:GetFunctionCodeSigningConfig", + "lambda:GetFunctionRecursionConfig", + "lambda:GetProvisionedConcurrencyConfig", + "lambda:GetRuntimeManagementConfig", + "lambda:ListVersionsByFunction", + "lambda:InvokeFunction" + ], + "Resource": [ + "arn:aws:lambda:*:*:function:backgroundagent-dev-*", + "arn:aws:lambda:*:*:function:backgroundagent-dev-AWS*" + ] + }, + { + "Sid": "APIGateway", + "Effect": "Allow", + "Action": [ + "apigateway:POST", + "apigateway:GET", + "apigateway:PUT", + "apigateway:PATCH", + "apigateway:DELETE", + "apigateway:TagResource", + "apigateway:UntagResource", + "apigateway:SetWebACL", + "apigateway:UpdateRestApiPolicy" + ], + "Resource": [ + "arn:aws:apigateway:*::/restapis", + "arn:aws:apigateway:*::/restapis/*", + "arn:aws:apigateway:*::/account", + "arn:aws:apigateway:*::/tags/*" + ] + }, + { + "Sid": "Cognito", + "Effect": "Allow", + "Action": [ + "cognito-idp:CreateUserPool", + "cognito-idp:DeleteUserPool", + "cognito-idp:DescribeUserPool", + "cognito-idp:UpdateUserPool", + "cognito-idp:CreateUserPoolClient", + "cognito-idp:DeleteUserPoolClient", + "cognito-idp:DescribeUserPoolClient", + "cognito-idp:UpdateUserPoolClient", + "cognito-idp:TagResource", + "cognito-idp:UntagResource", + "cognito-idp:ListTagsForResource", + "cognito-idp:GetUserPoolMfaConfig" + ], + "Resource": "arn:aws:cognito-idp:*:*:userpool/*" + }, + { + "Sid": "WAFv2", + "Effect": "Allow", + "Action": [ + "wafv2:CreateWebACL", + "wafv2:DeleteWebACL", + "wafv2:GetWebACL", + "wafv2:UpdateWebACL", + "wafv2:AssociateWebACL", + "wafv2:DisassociateWebACL", + "wafv2:ListTagsForResource", + "wafv2:TagResource", + "wafv2:UntagResource", + "wafv2:GetWebACLForResource" + ], + "Resource": [ + "arn:aws:wafv2:*:*:regional/webacl/*", + "arn:aws:wafv2:*:*:regional/managedruleset/*" + ] + }, + { + "Sid": "EventBridge", + "Effect": "Allow", + "Action": [ + "events:PutRule", + "events:DeleteRule", + "events:DescribeRule", + "events:PutTargets", + "events:RemoveTargets", + "events:ListTargetsByRule", + "events:TagResource", + "events:UntagResource", + "events:ListTagsForResource" + ], + "Resource": "arn:aws:events:*:*:rule/backgroundagent-dev-*" + }, + { + "Sid": "SecretsManager", + "Effect": "Allow", + "Action": [ + "secretsmanager:CreateSecret", + "secretsmanager:DeleteSecret", + "secretsmanager:DescribeSecret", + "secretsmanager:GetSecretValue", + "secretsmanager:PutSecretValue", + "secretsmanager:UpdateSecret", + "secretsmanager:TagResource", + "secretsmanager:UntagResource", + "secretsmanager:GetResourcePolicy", + "secretsmanager:PutResourcePolicy", + "secretsmanager:DeleteResourcePolicy" + ], + "Resource": [ + "arn:aws:secretsmanager:*:*:secret:backgroundagent-*", + "arn:aws:secretsmanager:*:*:secret:GitHubTokenSecret*" + ] + }, + { + "Sid": "SecretsManagerAccountLevel", + "Effect": "Allow", + "Action": "secretsmanager:GetRandomPassword", + "Resource": "*" + } + ] +} +``` + +### IaCRole-ABCA-Observability + +Bedrock AgentCore, Bedrock Guardrails, CloudWatch Logs/Dashboards/Alarms, X-Ray, S3 (CDK assets), KMS, ECR, SSM, and STS. + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "BedrockAgentCore", + "Effect": "Allow", + "Action": [ + "bedrock-agentcore:*" + ], + "Resource": "*" + }, + { + "Sid": "BedrockGuardrailsAndLogging", + "Effect": "Allow", + "Action": [ + "bedrock:CreateGuardrail", + "bedrock:DeleteGuardrail", + "bedrock:GetGuardrail", + "bedrock:UpdateGuardrail", + "bedrock:CreateGuardrailVersion", + "bedrock:ListGuardrails", + "bedrock:TagResource", + "bedrock:UntagResource", + "bedrock:ListTagsForResource", + "bedrock:PutModelInvocationLoggingConfiguration", + "bedrock:DeleteModelInvocationLoggingConfiguration", + "bedrock:GetModelInvocationLoggingConfiguration" + ], + "Resource": "*" + }, + { + "Sid": "CloudWatchLogsAndDashboards", + "Effect": "Allow", + "Action": [ + "logs:CreateLogGroup", + "logs:DeleteLogGroup", + "logs:DescribeLogGroups", + "logs:PutRetentionPolicy", + "logs:DeleteRetentionPolicy", + "logs:TagLogGroup", + "logs:UntagLogGroup", + "logs:TagResource", + "logs:UntagResource", + "logs:ListTagsForResource", + "logs:ListTagsLogGroup", + "logs:PutResourcePolicy", + "logs:DeleteResourcePolicy", + "logs:DescribeResourcePolicies", + "cloudwatch:PutDashboard", + "cloudwatch:DeleteDashboards", + "cloudwatch:GetDashboard", + "cloudwatch:PutMetricAlarm", + "cloudwatch:DeleteAlarms", + "cloudwatch:DescribeAlarms", + "cloudwatch:TagResource", + "cloudwatch:UntagResource", + "logs:CreateDelivery", + "logs:DescribeDeliveries", + "logs:GetDelivery", + "logs:GetDeliveryDestination", + "logs:GetDeliveryDestinationPolicy", + "logs:GetDeliverySource", + "logs:PutDeliveryDestination", + "logs:PutDeliverySource", + "logs:DescribeIndexPolicies", + "cloudwatch:ListTagsForResource", + "logs:CreateLogDelivery", + "logs:DeleteLogDelivery", + "logs:GetLogDelivery", + "logs:UpdateLogDelivery", + "logs:ListLogDeliveries", + "logs:DeleteDelivery", + "logs:DeleteDeliverySource", + "logs:DeleteDeliveryDestination" + ], + "Resource": "*" + }, + { + "Sid": "S3CDKAssets", + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:GetBucketLocation", + "s3:ListBucket" + ], + "Resource": [ + "arn:aws:s3:::cdk-hnb659fds-assets-*", + "arn:aws:s3:::cdk-hnb659fds-assets-*/*" + ] + }, + { + "Sid": "KMSForCDKAssets", + "Effect": "Allow", + "Action": [ + "kms:CreateGrant", + "kms:Decrypt", + "kms:DescribeKey", + "kms:Encrypt", + "kms:GenerateDataKey" + ], + "Resource": "*" + }, + { + "Sid": "ECRForDockerAssets", + "Effect": "Allow", + "Action": [ + "ecr:CreateRepository", + "ecr:DescribeRepositories", + "ecr:GetAuthorizationToken", + "ecr:BatchCheckLayerAvailability", + "ecr:GetDownloadUrlForLayer", + "ecr:BatchGetImage", + "ecr:PutImage", + "ecr:InitiateLayerUpload", + "ecr:UploadLayerPart", + "ecr:CompleteLayerUpload", + "ecr:SetRepositoryPolicy", + "ecr:GetRepositoryPolicy", + "ecr:DeleteRepository", + "ecr:ListTagsForResource", + "ecr:TagResource" + ], + "Resource": [ + "arn:aws:ecr:*:*:repository/cdk-hnb659fds-container-assets-*", + "arn:aws:ecr:*:*:repository/backgroundagent-*" + ] + }, + { + "Sid": "ECRAuthToken", + "Effect": "Allow", + "Action": "ecr:GetAuthorizationToken", + "Resource": "*" + }, + { + "Sid": "XRay", + "Effect": "Allow", + "Action": [ + "xray:UpdateTraceSegmentDestination", + "xray:GetTraceSegmentDestination", + "xray:ListResourcePolicies", + "xray:PutResourcePolicy" + ], + "Resource": "*" + }, + { + "Sid": "SSMParameterStoreForCDK", + "Effect": "Allow", + "Action": [ + "ssm:GetParameter", + "ssm:GetParameters", + "ssm:PutParameter", + "ssm:DeleteParameter" + ], + "Resource": "arn:aws:ssm:*:*:parameter/cdk-bootstrap/*" + }, + { + "Sid": "STSForCDK", + "Effect": "Allow", + "Action": [ + "sts:AssumeRole", + "sts:GetCallerIdentity" + ], + "Resource": [ + "arn:aws:iam::*:role/cdk-hnb659fds-*" + ] + } + ] +} +``` + +### When ECS compute is enabled + +If you uncomment the ECS blocks in `cdk/src/stacks/agent.ts` to enable the Fargate compute backend, add the following statement to the `IaCRole-ABCA-Application` policy (the combined policy remains under the 6,144-character IAM limit): + +```json +{ + "Sid": "ECS", + "Effect": "Allow", + "Action": [ + "ecs:CreateCluster", + "ecs:DeleteCluster", + "ecs:DescribeClusters", + "ecs:UpdateCluster", + "ecs:UpdateClusterSettings", + "ecs:PutClusterCapacityProviders", + "ecs:RegisterTaskDefinition", + "ecs:DeregisterTaskDefinition", + "ecs:DescribeTaskDefinition", + "ecs:ListTaskDefinitions", + "ecs:TagResource", + "ecs:UntagResource", + "ecs:ListTagsForResource", + "ecs:PutAccountSetting" + ], + "Resource": "*" +} +``` + +## Runtime IAM roles (created by the stack) + +These roles are created inside the CloudFormation stack at deploy time, not by the deployer. They are documented here for a complete picture of the IAM footprint. + +| Role | Assumed By | Purpose | +|------|-----------|---------| +| AgentCore Runtime execution role | AgentCore Runtime | Runs MicroVM containers; DynamoDB, Secrets Manager, CloudWatch Logs, Bedrock, AgentCore Memory access | +| BedrockLoggingRole | `bedrock.amazonaws.com` | Writes model invocation logs to CloudWatch | +| TaskOrchestrator Lambda role | Lambda | Durable orchestrator; DynamoDB, Secrets Manager, AgentCore Runtime invocation, AgentCore Memory | +| ConcurrencyReconciler Lambda role | Lambda | Scheduled reconciliation; DynamoDB scan + conditional updates | +| TaskApi Lambda roles (9-10) | Lambda | API handler functions; DynamoDB, Secrets Manager (webhook handlers), Bedrock Guardrail, Lambda invoke | +| AwsCustomResource Lambda role | Lambda | Blueprint DDB writes, Bedrock logging config, DNS firewall config | +| API Gateway CloudWatch role | API Gateway | Pushes API Gateway access logs | +| VPC Flow Log role | VPC Flow Logs | Writes flow logs to CloudWatch | +| ECS task execution role (when enabled) | ECS (pull images) | ECR image pull, CloudWatch Logs write | +| ECS task role (when enabled) | ECS (container runtime) | DynamoDB, Secrets Manager, Bedrock InvokeModel | + +### CDK bootstrap roles + +| Role | Purpose | +|------|---------| +| `cdk-hnb659fds-deploy-role-*` | Assumed by CDK CLI to initiate deployments | +| `cdk-hnb659fds-cfn-exec-role-*` | Assumed by CloudFormation to create resources (**this is what IaCRole-ABCA replaces**) | +| `cdk-hnb659fds-file-publish-role-*` | Uploads Lambda zip assets to S3 | +| `cdk-hnb659fds-image-publish-role-*` | Pushes Docker images to ECR | +| `cdk-hnb659fds-lookup-role-*` | Context lookups (VPC, AZs, etc.) | + +## Resource-level permission constraints + +Several services require `Resource: "*"` because they do not support resource-level permissions for create/describe operations: + +| Service | Actions Requiring `"*"` | Reason | +|---------|------------------------|--------| +| EC2 (VPC) | `Create*`, `Describe*`, `Allocate*` | VPC resource ARNs unknown at policy creation time | +| Route 53 Resolver | All DNS Firewall actions | No resource-level ARN support for firewall rule groups | +| Bedrock | Guardrail + logging config actions | Account-level APIs (`PutModelInvocationLoggingConfiguration`) | +| Bedrock AgentCore | All actions (`bedrock-agentcore:*`) | CloudFormation resource handler uses internal action names that differ from the public API; wildcard required for reliable deployment | +| CloudWatch Logs | `CreateLogGroup`, `PutResourcePolicy` | Log group ARNs unknown at policy creation; resource policies are account-scoped | +| ECS | Cluster + task definition actions | `RegisterTaskDefinition` doesn't support resource-level permissions | +| ECR | `GetAuthorizationToken` | Account-level operation | +| KMS | `CreateGrant`, `Decrypt`, `Encrypt`, `GenerateDataKey` | CDK asset encryption keys; key ARNs unknown at policy time | +| Secrets Manager | `GetRandomPassword` | Account-level API (no secret ARN); isolated in its own statement with `Resource: "*"` | +| X-Ray | `UpdateTraceSegmentDestination`, `PutResourcePolicy` | Account-level operations | + +These constraints align with the CDK Nag `AwsSolutions-IAM5` suppressions in the codebase. + +## Iterative tightening + +These policies are conservative-but-scoped starting points. To tighten further: + +1. **Deploy once with CloudTrail enabled**, then use [IAM Access Analyzer policy generation](https://docs.aws.amazon.com/IAM/latest/UserGuide/access-analyzer-policy-generation.html) to generate a least-privilege policy based on the actual API calls recorded in CloudTrail. +2. **Replace `*` resources** with actual ARNs after the first deploy (e.g., once you know the VPC ID, scope EC2 actions to that VPC). +3. **Add region conditions** where possible (e.g., `"aws:RequestedRegion": "us-east-1"`) to prevent cross-region resource creation. +4. **Restrict `iam:AttachRolePolicy`** with an `iam:PolicyARN` condition to limit which policies can be attached to `backgroundagent-dev-*` roles. This requires enumerating the AWS managed policies CDK attaches (e.g., `service-role/AWSLambdaBasicExecutionRole`) from a synthesized template, so it is deferred to a post-deployment tightening pass. +5. **Scope `iam:CreateServiceLinkedRole`** with an `iam:AWSServiceName` condition to limit which AWS services can have service-linked roles created. After a first deploy, check CloudTrail for which service-linked roles were actually created and restrict accordingly. +6. **Scope KMS actions** with a `kms:ResourceAliases` condition (e.g., `"kms:ResourceAliases": "alias/cdk-hnb659fds-*"`) to limit `CreateGrant`, `Decrypt`, `Encrypt`, and `GenerateDataKey` to the deterministic CDK bootstrap key. +7. **Use permission boundaries** on the IaC role to set an outer limit even if the policy is too broad. +8. **Review after each CDK version upgrade** -- new CDK versions may add/remove custom resources that need different permissions. + +## Reference + +- [SECURITY.md](/architecture/security) -- Runtime IAM, memory isolation, custom step trust boundaries. +- [COMPUTE.md](/architecture/compute) -- Compute backend options (AgentCore vs ECS Fargate). +- [COST_MODEL.md](/architecture/cost-model) -- Infrastructure baseline costs and scale-to-zero analysis. diff --git a/docs/src/content/docs/getting-started/Deployment-guide.md b/docs/src/content/docs/getting-started/Deployment-guide.md new file mode 100644 index 0000000..1ea3ee1 --- /dev/null +++ b/docs/src/content/docs/getting-started/Deployment-guide.md @@ -0,0 +1,127 @@ +--- +title: Deployment guide +--- + +# Deployment guide + +This guide covers deploying ABCA into an AWS account, including compute backend choices, scale-to-zero characteristics, and the complete AWS service inventory. For day-to-day development workflow, see the [Developer guide](/developer-guide/introduction). For a quick first deployment, see the [Quick start](/getting-started/quick-start). For least-privilege IAM deployment roles, see [DEPLOYMENT_ROLES.md](/architecture/deployment-roles). + +## Architecture overview + +ABCA deploys as a **single CDK stack** (`backgroundagent-dev`) containing all platform resources. The stack uses a `ComputeStrategy` interface to support two compute backends within the same stack: + +| Aspect | AgentCore (default) | ECS Fargate (opt-in) | +|--------|--------------------|--------------------| +| **Compute** | Bedrock AgentCore Runtime (Firecracker MicroVMs) | ECS Fargate containers | +| **Resources** | 2 vCPU, 8 GB RAM, 2 GB max image size | 2 vCPU, 4 GB RAM | +| **Orchestration** | Durable Lambda (checkpoint/replay) | Same durable Lambda via `ComputeStrategy` | +| **Agent mode** | FastAPI server (HTTP invocation) | Batch (run-to-completion) | +| **Startup** | ~10s (warm MicroVM) | ~60-180s (Fargate cold start) | +| **Max duration** | 8 hours (AgentCore service limit) | 9 hours (orchestrator `executionTimeout`) | + +Both backends are orchestrated by the same durable Lambda function. The `ComputeStrategy` interface abstracts `startSession()`, `pollSession()`, and `stopSession()` -- the ECS strategy calls `ecs:RunTask` / `ecs:DescribeTasks` / `ecs:StopTask` directly from the Lambda. No Step Functions are used. + +ECS Fargate is currently **opt-in** -- the `EcsAgentCluster` construct is present in the stack code but commented out. To enable it, uncomment the ECS blocks in `cdk/src/stacks/agent.ts`. + +## Scale-to-zero analysis + +### Components that scale to zero (pay-per-use) + +| Component | Billing Model | Idle Cost | +|-----------|--------------|-----------| +| DynamoDB (5 tables) | PAY_PER_REQUEST | $0 | +| Lambda (all functions) | Per invocation | $0 | +| API Gateway REST | Per request | $0 | +| ECS Fargate tasks (when enabled) | Per running task | $0 (cluster is free) | +| AgentCore Runtime | Per session | $0 | +| Bedrock inference | Per token | $0 | +| AgentCore Memory | Proportional to usage | ~$0 | +| Cognito | Free tier (50K MAU) | $0 | + +### Components that do not scale to zero (always-on) + +| Component | Est. Monthly Idle Cost | Why | +|-----------|----------------------|-----| +| NAT Gateway (1x) | ~$32 | $0.045/hr fixed charge | +| VPC Interface Endpoints (7x, 2 AZs) | ~$102 | $0.01/hr × 7 endpoints × 2 AZs × 730 hrs | +| WAF v2 Web ACL | ~$5 | Base monthly charge | +| CloudWatch Dashboard | ~$3 | Per-dashboard charge | +| Secrets Manager (1+ secrets) | ~$0.40/secret | Per-secret monthly | +| CloudWatch Alarms | ~$0.10/alarm | Per standard alarm | +| CloudWatch Logs retention | ~$1-5 | Storage for retained logs | +| **Total always-on baseline** | **~$140-150/month** | | + +The dominant idle cost is VPC networking: 7 interface endpoints across 2 AZs (~$102/month) plus the NAT Gateway (~$32/month). + +For the full cost model including per-task costs, see [COST_MODEL.md](/architecture/cost-model). + +## AWS services inventory + +### Compute + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| Bedrock AgentCore Runtime (MicroVMs) | Agent sessions (default) | Yes | +| ECS Fargate (when enabled) | Agent sessions (opt-in) | Yes | +| Lambda (Node.js 24, ARM64) | Orchestrator, API handlers, reconciler, custom resources | Yes | + +### AI/ML + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| Bedrock (Claude Sonnet 4.6, Opus 4, Haiku 4.5) | Agent reasoning, cross-region inference profiles | Yes | +| Bedrock Guardrails | Prompt injection detection on task input | Yes | +| Bedrock AgentCore Memory | Semantic + episodic extraction strategies | Yes | + +### Networking + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| VPC (public + private subnets, 2 AZs) | All compute | N/A (no direct cost) | +| NAT Gateway (1x) | Private subnet internet egress | **No** (~$32/mo) | +| VPC Interface Endpoints (7x, 2 AZs) | AWS service connectivity from private subnets | **No** (~$102/mo) | +| VPC Gateway Endpoints (2x: S3, DynamoDB) | S3 and DynamoDB connectivity | Yes (free) | +| Security Groups | HTTPS-only egress | N/A | +| Route 53 Resolver DNS Firewall | Domain allowlisting for agent egress | Minimal | + +### Storage / Database + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| DynamoDB (5 tables, PAY_PER_REQUEST) | Task state, events, concurrency, webhooks, repo config | Yes | +| S3 | CDK asset bucket, ECR image layers, FUSE session storage | Minimal | +| Secrets Manager | GitHub PAT, webhook HMAC secrets | **No** (~$0.40/secret/mo) | + +### API / Auth + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| API Gateway (REST) | Task REST API | Yes | +| Cognito User Pool | CLI/API authentication | Yes (free tier) | +| WAF v2 | API Gateway protection (managed rules + rate limiting) | **No** (~$5/mo base) | + +### Observability + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| CloudWatch Logs (multiple log groups) | Application, usage, model invocation, VPC flow, DNS query logs | **No** (storage) | +| CloudWatch Dashboard | Operational metrics visualization | **No** (~$3/mo) | +| CloudWatch Alarms | Orchestrator error alerting | **No** (~$0.10/alarm) | +| X-Ray | AgentCore Runtime tracing | Yes | + +### Infrastructure / Deployment + +| Service | Used By | Scales to Zero | +|---------|---------|---------------| +| CloudFormation | Stack deployment, custom resources | N/A | +| ECR | Container image storage | Minimal | +| IAM | Roles and policies for all components | N/A | + +## Reference + +- [Quick start](/getting-started/quick-start) -- Zero-to-first-PR in 6 steps. +- [Developer guide](/developer-guide/introduction) -- Local development, testing, repository onboarding. +- [User guide](/using/overview) -- API reference, CLI usage, task management. +- [DEPLOYMENT_ROLES.md](/architecture/deployment-roles) -- Least-privilege IAM policies for CloudFormation execution. +- [COST_MODEL.md](/architecture/cost-model) -- Per-task costs, cost guardrails, cost at scale. +- [COMPUTE.md](/architecture/compute) -- Compute backend architecture and trade-offs. diff --git a/docs/src/content/docs/getting-started/Quick-start.md b/docs/src/content/docs/getting-started/Quick-start.md index 46e965b..b4c78fd 100644 --- a/docs/src/content/docs/getting-started/Quick-start.md +++ b/docs/src/content/docs/getting-started/Quick-start.md @@ -10,7 +10,7 @@ Go from zero to your first agent-created pull request in about 30 minutes. This Install these before you begin: -- **AWS account** with credentials configured (`aws configure`) +- **AWS account** with credentials configured (`aws configure`). If you use named profiles, set `AWS_PROFILE` before running any commands in this guide. - **Docker** - for building the agent container image - **mise** - task runner ([install guide](https://mise.jdx.dev/getting-started.html)) - **AWS CDK CLI** - `npm install -g aws-cdk` (after mise is active) @@ -39,6 +39,8 @@ mise run build `mise run install` installs all JavaScript and Python dependencies across the monorepo. `mise run build` compiles the CDK app, the CLI, the agent image, and the docs site. A successful build means you are ready to deploy. +> **Note:** `mise run build` includes CDK synthesis, which queries AWS for availability zones. Your active AWS credentials must have at least `ec2:DescribeAvailabilityZones` permission, or the build will fail. If you use named profiles, make sure `AWS_PROFILE` is set before running the build. + ## Step 2 - Prepare a repository The agent works by cloning a GitHub repository, creating a branch, making code changes, running the build and tests, and opening a pull request. This means it needs **write access** to a real repository. @@ -79,7 +81,12 @@ The `repo` value must match **exactly** what you will pass to the CLI later (`ow The CDK stack deploys the full platform: API Gateway, Lambda functions (orchestrator, task CRUD, webhooks), DynamoDB tables, AgentCore Runtime, VPC with network isolation, Cognito user pool, and CloudWatch dashboards. ```bash -# One-time account setup (X-Ray destination) +# One-time account setup: allow X-Ray to write spans to CloudWatch Logs. +# On a fresh account, X-Ray needs a resource policy before the destination can be set. +ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) +aws logs put-resource-policy \ + --policy-name xray-spans-policy \ + --policy-document "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Sid\":\"XRaySpansAccess\",\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"xray.amazonaws.com\"},\"Action\":[\"logs:PutLogEvents\",\"logs:CreateLogGroup\",\"logs:CreateLogStream\"],\"Resource\":[\"arn:aws:logs:*:${ACCOUNT_ID}:log-group:aws/spans\",\"arn:aws:logs:*:${ACCOUNT_ID}:log-group:aws/spans:*\"]}]}" aws xray update-trace-segment-destination --destination CloudWatchLogs # Bootstrap CDK (first time only) @@ -89,7 +96,7 @@ mise run //cdk:bootstrap mise run //cdk:deploy ``` -The X-Ray command is a one-time per-account setup. CDK bootstrap provisions the staging resources CDK needs (S3 bucket, IAM roles). The deploy itself takes around 10 minutes - most of the time is spent building the Docker image and provisioning the AgentCore Runtime. +The X-Ray commands are a one-time per-account setup. On a fresh account the `put-resource-policy` call is required first — without it, the `update-trace-segment-destination` command fails with an `AccessDeniedException` because X-Ray cannot write to the `aws/spans` log group. CDK bootstrap provisions the staging resources CDK needs (S3 bucket, IAM roles). The deploy itself takes around 10 minutes - most of the time is spent building the Docker image and provisioning the AgentCore Runtime. ## Step 4 - Store the GitHub token @@ -187,7 +194,10 @@ Here is what the platform did after you ran `bgagent submit`: |---|---|---| | `yarn: command not found` | Corepack not enabled or mise not activated in your shell | Run `eval "$(mise activate zsh)"`, then `corepack enable && corepack prepare yarn@1.22.22 --activate` | | `MISE_EXPERIMENTAL required` | Namespaced tasks need the experimental flag | `export MISE_EXPERIMENTAL=1` | -| CDK deploy fails with "X-Ray Delivery Destination..." | Missing one-time account setup | `aws xray update-trace-segment-destination --destination CloudWatchLogs` | +| `AccessDeniedException` on `update-trace-segment-destination` | Fresh account missing CloudWatch Logs resource policy for X-Ray | Run `aws logs put-resource-policy` first (see Step 3) | +| CDK deploy fails with "X-Ray Delivery Destination..." | Missing one-time account setup | Run both X-Ray commands in Step 3 | +| `mise run build` fails with `ec2:DescribeAvailabilityZones` error | AWS credentials missing or insufficient for CDK synth | Set `AWS_PROFILE` or configure credentials with at least EC2 read access | +| CDK deploy prompts for approval and hangs | Non-interactive terminal (CI/CD, scripts) | Pass `--require-approval never` to `cdk deploy` or use an interactive terminal | | `put-secret-value` returns double-dot endpoint | `REGION` variable is empty | Set `REGION=us-east-1` (or your actual region) before running the command | | `REPO_NOT_ONBOARDED` on task submit | Blueprint `repo` does not match what you passed to the CLI | Check `cdk/src/stacks/agent.ts` - the `repo` value must be exactly `owner/repo` matching your fork | | `INSUFFICIENT_GITHUB_REPO_PERMISSIONS` | PAT is missing required permissions or is scoped to the wrong repo | Regenerate the PAT with Contents (read/write) and Pull requests (read/write) scoped to your fork, then update Secrets Manager |