From bbe9aaaae033d0710e4cca2a8c4a21efc5ecda56 Mon Sep 17 00:00:00 2001
From: James <james.russo@heygen.com>
Date: Sun, 17 May 2026 00:04:40 +0000
Subject: [PATCH 1/2] docs(lambda): add docs/deploy/aws-lambda.mdx deployment
 guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

End-to-end deploy guide for the AWS Lambda surface. Covers:

  - Architecture diagram (Step Functions Plan → Map(N) → Assemble +
    the single Lambda function dispatching by Action; pulled from
    the distributed rendering plan §15.2).
  - Prerequisites table (AWS creds, SAM CLI, bun, repo checkout).
  - Three deployment paths: hyperframes lambda CLI (recommended),
    direct sam deploy against examples/aws-lambda/template.yaml,
    and HyperframesRenderStack CDK construct.
  - IAM bootstrap via hyperframes lambda policies user/role/validate.
  - Cost shape — how Lambda GB-seconds + SFN transitions roll up
    into the displayCost the progress verb prints.
  - Troubleshooting block with the typed error names operators
    actually hit (PLAN_HASH_MISMATCH, BROWSER_GPU_NOT_SOFTWARE,
    iam:CreateRole denial, stuck RUNNING, S3 Retain semantics).
  - "What's NOT in v1" callout so adopters don't burn time looking
    for webhooks / compositions verb / HDR support.

Registered under a new "Deploy" group in docs.json's Documentation
tab, sitting after Packages so the conceptual flow is "what you
can build" → "how to ship it."

No code changes.
---
 docs/deploy/aws-lambda.mdx | 190 +++++++++++++++++++++++++++++++++++++
 docs/docs.json             |   6 ++
 2 files changed, 196 insertions(+)
 create mode 100644 docs/deploy/aws-lambda.mdx

diff --git a/docs/deploy/aws-lambda.mdx b/docs/deploy/aws-lambda.mdx
new file mode 100644
index 000000000..59b226931
--- /dev/null
+++ b/docs/deploy/aws-lambda.mdx
@@ -0,0 +1,190 @@
+---
+title: AWS Lambda
+description: "Deploy distributed HyperFrames rendering to AWS Lambda and drive renders from a laptop or CI."
+---
+
+HyperFrames ships a first-class AWS Lambda deployment: one Lambda function fronts a Step Functions standard workflow that fans renders out across many parallel chunk workers, with intermediate artifacts in S3. End-to-end is three commands once your AWS credentials are configured.
+
+```bash
+hyperframes lambda deploy
+hyperframes lambda render ./my-project --width 1920 --height 1080 --wait
+hyperframes lambda destroy
+```
+
+## Architecture
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│ Step Functions state machine                                     │
+│   Plan → Map(N) RenderChunk → Assemble                           │
+└──────────────────────────────────────────────────────────────────┘
+                              │ dispatches by event.Action
+                              ▼
+┌──────────────────────────────────────────────────────────────────┐
+│ One Lambda function (packages/aws-lambda/dist/handler.zip)       │
+│   handler.mjs                                                    │
+│     ├─ Action="plan"        → @hyperframes/producer/distributed  │
+│     ├─ Action="renderChunk" → @hyperframes/producer/distributed  │
+│     └─ Action="assemble"    → @hyperframes/producer/distributed  │
+│   bin/ffmpeg                — ffmpeg-static                      │
+│   node_modules/@sparticuz/chromium/ — Lambda-optimised Chromium  │
+└──────────────────────────────────────────────────────────────────┘
+                              │ pure functions over local paths
+                              ▼
+┌──────────────────────────────────────────────────────────────────┐
+│ S3 bucket — plan tarball + per-chunk outputs + final mp4         │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+The Lambda handler is a thin dispatch: parse the Step Functions event, download inputs from S3 into `/tmp`, call the OSS primitive from `@hyperframes/producer/distributed`, upload outputs back, return a small JSON result. Everything heavy — capture, encode, audio mix — happens inside the OSS primitives.
+
+## Prerequisites
+
+| Tool | Why | Install |
+|------|-----|---------|
+| AWS credentials | The CLI and the deploy step both call AWS APIs. | Env vars, `~/.aws/credentials`, SSO, or IMDS — any chain `boto3` would resolve. |
+| AWS SAM CLI | `hyperframes lambda deploy/destroy` shells out to `sam deploy`/`sam delete`. | [Install guide](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html) |
+| `bun` | Used to build `packages/aws-lambda/dist/handler.zip` at deploy time. | `npm install -g bun` or [bun.sh](https://bun.sh) |
+| HyperFrames repo checkout | `lambda deploy` builds the Lambda handler ZIP from source. Adopters who deploy outside a checkout can set `HYPERFRAMES_REPO_ROOT` to point at one. | `git clone https://github.com/heygen-com/hyperframes` |
+
+## Three deployment paths
+
+### Path 1 — `hyperframes lambda` CLI (recommended)
+
+The CLI is a thin wrapper around the SAM template + the `@hyperframes/aws-lambda` SDK. For most adopters this is the right starting point.
+
+```bash
+hyperframes lambda deploy \
+  --stack-name=hyperframes-prod \
+  --region=us-east-1 \
+  --concurrency=8 \
+  --memory=10240
+```
+
+The default `--concurrency=8` is deliberately conservative for first-time users. The Lambda Map state's default would let an unbounded number of chunks fan out in parallel; 8 caps your worst-case spend on a runaway render at roughly `8 × (15 min × 10 GB × $0.0000167/GB-s) ≈ $1.20`. Raise it after you've sized your typical render's chunk count.
+
+After `deploy`, render anything with:
+
+```bash
+hyperframes lambda render ./my-project --width 1920 --height 1080 --wait
+```
+
+The `--wait` flag blocks and streams per-chunk progress + accrued cost; drop it to fire-and-forget, then poll with `hyperframes lambda progress <renderId>` on your own cadence.
+
+See the [CLI reference](/packages/cli#hyperframes-lambda) for full flag documentation.
+
+### Path 2 — Direct SAM deploy
+
+If you want to read the CloudFormation before you deploy, or you need to customise the topology (extra alarms, SNS subscribers, KMS keys, …), invoke SAM directly against the template at `examples/aws-lambda/template.yaml`:
+
+```bash
+cd packages/aws-lambda
+bun run build:zip                     # produces dist/handler.zip
+cd ../../examples/aws-lambda
+sam deploy \
+  --stack-name=hyperframes-prod \
+  --region=us-east-1 \
+  --resolve-s3 \
+  --capabilities CAPABILITY_IAM \
+  --no-confirm-changeset \
+  --parameter-overrides ChromeSource=sparticuz ReservedConcurrency=8
+```
+
+The template emits three CloudFormation outputs you'll need to invoke renders:
+
+- `RenderBucketName` — S3 bucket for plan tarballs + per-chunk outputs + final renders.
+- `RenderStateMachineArn` — the Step Functions standard workflow that orchestrates Plan → Map → Assemble.
+- `RenderFunctionArn` — the single Lambda function the state machine dispatches against.
+
+### Path 3 — CDK construct
+
+For users already running CDK, the `@hyperframes/aws-lambda` package exports a `HyperframesRenderStack` L2 construct that emits the same topology as the SAM template:
+
+```ts
+import { App, CfnOutput, Stack } from "aws-cdk-lib";
+import { HyperframesRenderStack } from "@hyperframes/aws-lambda/cdk";
+
+const app = new App();
+const stack = new Stack(app, "MyApp");
+const render = new HyperframesRenderStack(stack, "Render", {
+  projectName: "hyperframes",
+  lambdaMemoryMb: 10240,
+  reservedConcurrency: 8,
+  chromeSource: "sparticuz",
+});
+
+new CfnOutput(stack, "RenderBucketName", { value: render.bucket.bucketName });
+new CfnOutput(stack, "StateMachineArn", { value: render.stateMachine.stateMachineArn });
+```
+
+`aws-cdk-lib` and `constructs` are declared as **optional peer dependencies** of `@hyperframes/aws-lambda`, so consumers who only need the SDK don't pay the CDK import cost.
+
+The construct exposes `.bucket`, `.renderFunction`, and `.stateMachine` so you can wire dashboards, SNS topics, or other AWS resources alongside it without re-deriving ARNs.
+
+## IAM permissions
+
+The CLI ships a built-in IAM bootstrap to avoid the "User is not authorized to perform iam:CreateRole" first-deploy trap:
+
+```bash
+# Print an inline policy doc to attach to the IAM user that runs the CLI.
+hyperframes lambda policies user
+
+# Print { TrustRelationship, InlinePolicy } for a CloudFormation service role.
+hyperframes lambda policies role --principal=cloudformation
+
+# Validate a checked-in policy still covers the CLI's needs (exit non-zero on missing).
+hyperframes lambda policies validate ./infra/iam/hyperframes-deploy.json
+```
+
+The generated documents grant `Resource: "*"` for the CLI's required action set. After your first successful deploy you can narrow `Resource` to the deployed ARNs — predictable per the CloudFormation outputs above. Adopters running the CLI in CI typically check the policy doc into source control and run `policies validate` as a pre-deploy step to catch drift.
+
+## Cost shape
+
+Lambda renders are billed by GB-seconds (Lambda billed duration × configured memory) plus a tiny per-state-transition fee for Step Functions standard workflows. `hyperframes lambda progress` exposes the running tally:
+
+```bash
+hyperframes lambda progress my-render-id
+# Status:    SUCCEEDED
+# Progress:  100%
+# Frames:    480 / 480
+# Lambdas:   5
+# Cost:      $0.0214 (Lambda $0.0210 + SFN $0.0004)
+# Output:    s3://hyperframes-renders/.../output.mp4
+```
+
+The cost number is best-effort: Lambda billed duration comes from the handler's own `DurationMs` return value (which SFN history surfaces in the success payload) and S3 transfer is not included. The math is in `packages/aws-lambda/src/sdk/costAccounting.ts` if you want to verify; CLI-shown values match what AWS Billing reports within rounding noise.
+
+## Troubleshooting
+
+### `sam deploy` fails with "Stack already exists"
+
+Pass the same `--stack-name` you used the first time. SAM is idempotent — re-running on an existing stack resolves to a no-op or an in-place update.
+
+### `User is not authorized to perform iam:CreateRole`
+
+The IAM credential running `lambda deploy` doesn't have permission to create the service role CloudFormation needs. Run `hyperframes lambda policies user` and attach the printed policy to your IAM user (or take the `policies role` output and have your admin create a deploy role).
+
+### `Lambda function failed: PLAN_HASH_MISMATCH`
+
+Step Functions invoked a `renderChunk` with a plan hash that didn't match the planDir on S3. Almost always means the producer version differs between the local `plan()` build and the deployed Lambda ZIP. Re-run `hyperframes lambda deploy` (which rebuilds the ZIP) and re-render.
+
+### `Lambda function failed: BROWSER_GPU_NOT_SOFTWARE`
+
+The compiled composition reads `data-gpu-mode="hardware"` (or similar). Distributed renders require `gpu-mode="software"` — hardware GL is non-deterministic across chunk boundaries. Change the composition's `data-gpu-mode` or omit it (the default is software).
+
+### Render seems stuck at `RUNNING`
+
+Most often a Lambda cold-start chain on a many-chunk render. The Map state's reserved concurrency caps how many chunks can run in parallel — if you set `--concurrency=4` and your render has 16 chunks, the state machine processes them in batches of 4. `hyperframes lambda progress <id>` shows how many invocations are in flight.
+
+If progress doesn't advance for >10 minutes, check the Step Functions execution in the AWS console — failed Lambda invocations include the typed error name (`FONT_FETCH_FAILED`, `FFMPEG_VERSION_MISMATCH`, etc.) which short-circuits the state machine.
+
+### Tearing down doesn't reclaim S3 storage
+
+The render bucket is created with CloudFormation `Retain` on delete — `hyperframes lambda destroy` (or `sam delete`) tears the function + state machine down but the bucket survives. This is intentional: it protects final-rendered MP4s from being lost when you re-deploy. To fully reclaim storage, empty + delete the bucket via the AWS console / `aws s3 rb`.
+
+## What's NOT in the v1 surface
+
+- **Webhooks on completion.** Not in v1 — poll with `hyperframes lambda progress` or watch the Step Functions execution. A `--webhook` flag with an SNS topic is on the Phase 6c backlog.
+- **`compositions` discovery verb.** Coming separately (PR 6.10 on the plan); for now, point `lambda render` at the project directory containing your `index.html`.
+- **Multi-region.** Each `--region` is an independent stack. There is no built-in cross-region failover.
+- **HDR.** Distributed mode is SDR-only. HDR mp4 with bsf signaling is on the v1.5 backlog.
diff --git a/docs/docs.json b/docs/docs.json
index f85035cec..cb5a19452 100644
--- a/docs/docs.json
+++ b/docs/docs.json
@@ -206,6 +206,12 @@
               "packages/studio",
               "packages/cli"
             ]
+          },
+          {
+            "group": "Deploy",
+            "pages": [
+              "deploy/aws-lambda"
+            ]
           }
         ]
       },

From 4723c7d537cea548747e6f711299816a3075a065 Mon Sep 17 00:00:00 2001
From: James <james.russo@heygen.com>
Date: Sun, 17 May 2026 00:51:35 +0000
Subject: [PATCH 2/2] docs(lambda): address PR review on AWS Lambda deployment
 guide
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

One blocker + two important items from Vai's review:

  - The BROWSER_GPU_NOT_SOFTWARE troubleshooting entry pointed
    adopters at a non-existent `data-gpu-mode` composition attribute.
    Replaced with the actual root cause (Chrome launch flags +
    @sparticuz/chromium libs in the handler ZIP) and the actual
    remediation: rebuild + redeploy via `lambda deploy` (which
    always rebuilds the ZIP). The composition-attribute story
    would have sent users editing the wrong file entirely.

  - Added a `sites create` subsection under Path 1 so adopters
    running tight inner loops know how to reuse a project upload
    across many renders instead of re-tarring + re-uploading on
    each call. The CLI surface was first-class but the doc had
    been silent.

  - Added a Warning callout under Path 2 explaining that the SAM
    template's own ReservedConcurrency default is `-1` (unreserved)
    — a reader simplifying the Path 2 example by dropping the
    --parameter-overrides flag would silently switch to unreserved
    concurrency and pay the runaway-Map cost. The warning mirrors
    the cost-shape callout earlier in the page.
---
 docs/deploy/aws-lambda.mdx | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/docs/deploy/aws-lambda.mdx b/docs/deploy/aws-lambda.mdx
index 59b226931..5b232b7e2 100644
--- a/docs/deploy/aws-lambda.mdx
+++ b/docs/deploy/aws-lambda.mdx
@@ -73,6 +73,20 @@ The `--wait` flag blocks and streams per-chunk progress + accrued cost; drop it
 
 See the [CLI reference](/packages/cli#hyperframes-lambda) for full flag documentation.
 
+#### Pre-staging a project with `sites create`
+
+Re-rendering the same project tree on every `lambda render` call re-tars and re-uploads it each time. For tight inner loops (CI smoke jobs, prompt iteration in a demo flow), pre-stage the project once and reuse the upload:
+
+```bash
+hyperframes lambda sites create ./my-project
+# → Site ID: a1b2c3d4e5f6g7h8 (content-addressed)
+
+hyperframes lambda render ./my-project --site-id=a1b2c3d4e5f6g7h8 \
+  --width 1920 --height 1080 --wait
+```
+
+The `siteId` is content-addressed via a SHA-256 of the project tree; re-running `sites create` on an unchanged tree skips the upload via a `HeadObject` short-circuit. Pass the same `--site-id` to as many `lambda render` calls as you like — they all reuse the one S3 PUT.
+
 ### Path 2 — Direct SAM deploy
 
 If you want to read the CloudFormation before you deploy, or you need to customise the topology (extra alarms, SNS subscribers, KMS keys, …), invoke SAM directly against the template at `examples/aws-lambda/template.yaml`:
@@ -96,6 +110,10 @@ The template emits three CloudFormation outputs you'll need to invoke renders:
 - `RenderStateMachineArn` — the Step Functions standard workflow that orchestrates Plan → Map → Assemble.
 - `RenderFunctionArn` — the single Lambda function the state machine dispatches against.
 
+<Warning>
+The SAM template's own default for `ReservedConcurrency` is `-1` (unreserved, account-default). The Path 1 CLI overrides it to `8` to keep first-time spend bounded; if you drop `ReservedConcurrency` from `--parameter-overrides` here, you get the unreserved default. Set it explicitly unless you've already sized your typical render's fan-out.
+</Warning>
+
 ### Path 3 — CDK construct
 
 For users already running CDK, the `@hyperframes/aws-lambda` package exports a `HyperframesRenderStack` L2 construct that emits the same topology as the SAM template:
@@ -170,7 +188,14 @@ Step Functions invoked a `renderChunk` with a plan hash that didn't match the pl
 
 ### `Lambda function failed: BROWSER_GPU_NOT_SOFTWARE`
 
-The compiled composition reads `data-gpu-mode="hardware"` (or similar). Distributed renders require `gpu-mode="software"` — hardware GL is non-deterministic across chunk boundaries. Change the composition's `data-gpu-mode` or omit it (the default is software).
+The handler launched Chromium but the runtime probe found a non-SwiftShader GL backend. Hardware GL is non-deterministic across chunk boundaries, so distributed renders refuse it at the runtime-image / launch-flags layer (not at the composition layer). Rebuild the handler ZIP and redeploy:
+
+```bash
+bun run --cwd packages/aws-lambda build:zip
+hyperframes lambda deploy --stack-name=<your-stack>
+```
+
+The build pipeline pins `@sparticuz/chromium` + the Chrome flags (`--use-gl=swiftshader --use-angle=swiftshader`) so a fresh deploy almost always resolves this. If it persists, your stack's Lambda function is pointing at a stale handler ZIP from a previous deploy — `lambda deploy` always rebuilds, so re-running unsticks it.
 
 ### Render seems stuck at `RUNNING`