From bc77d2f8151fabdb55ed12ebc2440ff1acf7ac19 Mon Sep 17 00:00:00 2001 From: James Date: Fri, 15 May 2026 23:09:10 +0000 Subject: [PATCH 1/3] feat(lambda): add SAM template and sample events for AWS deployment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 6.2 of the distributed rendering plan (DISTRIBUTED-RENDERING-PLAN.md §15). Reference SAM template for deploying HyperFrames distributed rendering on AWS — one Lambda function in three roles, choreographed by a Step Functions standard workflow with a Map state for parallel chunk rendering. Resources created by the template: - Lambda function pointing at the Phase 6.1 ZIP - Step Functions state machine: Plan -> Map(N) RenderChunk -> Assemble - S3 bucket for plan tarballs, chunk outputs, final mp4 - IAM role for the state machine - CloudWatch alarm guarding against runaway chunk invocations Retry policy: 4 attempts, 2s initial, 2x backoff, max 60s, with the typed non-retryable error codes from plan §9.3 explicitly opted out. CodeUri points at packages/aws-lambda/dist/handler.zip; sam deploy resolves the local path and uploads to a SAM-managed bucket on first deploy. Validated: sam validate --lint passes against the template. This is part of the 8-PR Phase 6 stack; PR 6.2 of 8. --- .gitignore | 5 +- examples/aws-lambda/.gitignore | 3 + examples/aws-lambda/README.md | 170 ++++++++ .../aws-lambda/sample-events/assemble.json | 13 + examples/aws-lambda/sample-events/plan.json | 14 + .../sample-events/render-chunk.json | 8 + examples/aws-lambda/template.yaml | 393 ++++++++++++++++++ 7 files changed, 605 insertions(+), 1 deletion(-) create mode 100644 examples/aws-lambda/.gitignore create mode 100644 examples/aws-lambda/README.md create mode 100644 examples/aws-lambda/sample-events/assemble.json create mode 100644 examples/aws-lambda/sample-events/plan.json create mode 100644 examples/aws-lambda/sample-events/render-chunk.json create mode 100644 examples/aws-lambda/template.yaml diff --git a/.gitignore b/.gitignore index 76b9858be..93f34ba23 100644 --- a/.gitignore +++ b/.gitignore @@ -67,7 +67,10 @@ packages/producer/src/services/fontData.generated.ts # Local proof / test artifacts qa-artifacts/ my-video/ -examples/ +examples/* +# Tracked OSS examples — negations override the blanket `examples/*` ignore. +!examples/aws-lambda +!examples/aws-lambda/** packages/studio/data/ .desloppify/ .worktrees/ diff --git a/examples/aws-lambda/.gitignore b/examples/aws-lambda/.gitignore new file mode 100644 index 000000000..912214835 --- /dev/null +++ b/examples/aws-lambda/.gitignore @@ -0,0 +1,3 @@ +# SAM CLI state — written by `sam deploy --guided`, contains user choices. +samconfig.toml +.aws-sam/ diff --git a/examples/aws-lambda/README.md b/examples/aws-lambda/README.md new file mode 100644 index 000000000..fa6ebf701 --- /dev/null +++ b/examples/aws-lambda/README.md @@ -0,0 +1,170 @@ +# AWS Lambda + Step Functions deployment + +Reference SAM template for deploying HyperFrames distributed rendering on +AWS. One Lambda function, three roles (Plan / RenderChunk / Assemble), +choreographed by a Step Functions standard workflow with a Map state for +parallel chunk rendering. + +See [`packages/aws-lambda/README.md`](../../packages/aws-lambda/README.md) +for the Lambda handler architecture and +[`DISTRIBUTED-RENDERING-PLAN.md`](../../DISTRIBUTED-RENDERING-PLAN.md#15-aws-lambda-turnkey-deployment) +§15 for the design context. + +## Prerequisites + +- AWS account with IAM permissions to deploy CloudFormation stacks + containing Lambda, Step Functions, S3, IAM, and CloudWatch resources. +- [`sam` CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html) + installed (≥ 1.100). +- [`bun`](https://bun.sh) installed (≥ 1.3) to build the handler ZIP. + +## One-shot deploy + +```bash +# 1. Build the handler ZIP that `template.yaml`'s CodeUri points at. +bun install # at repo root +bun run --cwd packages/aws-lambda build:zip + +# 2. Deploy. First time: `--guided` to set stack name + region. +cd examples/aws-lambda +sam deploy --guided --resolve-s3 +``` + +`--resolve-s3` lets SAM pick (or create) a per-account bucket to host the +uploaded ZIP. After the first deploy, subsequent updates can omit +`--guided` and `--resolve-s3` — SAM remembers your choices in +`samconfig.toml`. + +## What gets created + +| Resource | Purpose | +| ---------------------------------------- | -------------------------------------------------------------------------------------------------- | +| `Render Lambda` | Single function, handler `handler.handler`. Dispatches on `event.Action`. | +| `Render State Machine` | Step Functions standard workflow. Plan → Map(N) RenderChunk → Assemble. | +| `Render Bucket` | S3 bucket for plan tarballs, chunk outputs, and final mp4. `renders/` prefix expires after 7 days. | +| IAM role for the state machine | Invokes the Lambda; writes CloudWatch logs; X-Ray traces. | +| IAM role for the Lambda (managed by SAM) | S3 CRUD on the render bucket; CloudWatch logs. | +| Runaway-invocation alarm | Fires if RenderChunk runs more than `ChunkInvocationAlarmThreshold` times in an hour. | + +## Running a render + +Upload your project as a zip to the render bucket, then start a Step +Functions execution: + +```bash +STACK_NAME=hyperframes-render # whatever you picked at deploy +RENDER_BUCKET=$(aws cloudformation describe-stacks \ + --stack-name "$STACK_NAME" \ + --query 'Stacks[0].Outputs[?OutputKey==`RenderBucketName`].OutputValue' \ + --output text) +STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \ + --stack-name "$STACK_NAME" \ + --query 'Stacks[0].Outputs[?OutputKey==`RenderStateMachineArn`].OutputValue' \ + --output text) + +# Tar + upload the project directory. The handler uses `tar` (not +# `unzip`, which Lambda's base image doesn't ship), so the on-the-wire +# archive format is `.tar.gz`. +tar -czf my-project.tar.gz -C ./my-project . +aws s3 cp my-project.tar.gz "s3://${RENDER_BUCKET}/projects/my-project.tar.gz" + +# Start the execution. The input JSON tells the state machine where to +# read inputs and write outputs. +aws stepfunctions start-execution \ + --state-machine-arn "$STATE_MACHINE_ARN" \ + --input "$(cat <- + HyperFrames distributed rendering — Step Functions standard workflow with + one Lambda function handling Plan, RenderChunk (fan-out via Map state), + and Assemble. One S3 bucket, one CloudWatch alarm for runaway concurrency. + + Built from the Phase 6.1 ZIP at packages/aws-lambda/dist/handler.zip. + + See: + - DISTRIBUTED-RENDERING-PLAN.md §15 (Lambda turnkey deployment) + - packages/aws-lambda/README.md (handler architecture) + - examples/aws-lambda/README.md (this directory's deploy guide) + +Parameters: + ProjectName: + Type: String + Default: hyperframes + Description: Name prefix applied to all created resources. + + LambdaMemoryMb: + Type: Number + Default: 10240 + AllowedValues: [2048, 3072, 4096, 5120, 6144, 7168, 8192, 9216, 10240] + Description: >- + Lambda memory in MB. Render workloads are CPU-bound; bumping memory + proportionally bumps the CPU share Lambda gives the function. 10 GB + (the max) is recommended for 1080p renders. + + LambdaTimeoutSec: + Type: Number + Default: 900 + MinValue: 60 + MaxValue: 900 + Description: >- + Per-invocation Lambda timeout. Render chunks at the default + chunkSize=240 frames complete in seconds; 15 minutes is the Lambda + hard ceiling and the default here to absorb cold-start variance. + + ReservedConcurrency: + Type: Number + Default: -1 + Description: >- + Lambda reserved concurrency cap. Set to a positive integer to bound + simultaneous chunk renders (e.g. 50 to limit cost). -1 means + unreserved (account default). + + ChromeSource: + Type: String + Default: sparticuz + AllowedValues: [sparticuz, chrome-headless-shell] + Description: >- + Which Chrome runtime the bundled ZIP was built with. Must match the + `--source=` flag passed to `build-zip.ts`. The handler reads this + via the HYPERFRAMES_LAMBDA_CHROME_SOURCE env var at boot. + + ChunkInvocationAlarmThreshold: + Type: Number + Default: 1000 + Description: >- + CloudWatch alarm threshold for total RenderChunk invocations per + hour. The runaway-Map state pathology would fan out far more + chunks than expected; an alarm at 10× the typical workload + protects against billing surprises. + +Conditions: + HasReservedConcurrency: !Not [!Equals [!Ref ReservedConcurrency, -1]] + +Globals: + Function: + Runtime: nodejs22.x + MemorySize: !Ref LambdaMemoryMb + Timeout: !Ref LambdaTimeoutSec + Architectures: [x86_64] + Environment: + Variables: + NODE_OPTIONS: "--enable-source-maps" + HYPERFRAMES_LAMBDA_CHROME_SOURCE: !Ref ChromeSource + +Resources: + # ── S3 bucket for plan tarballs, chunk outputs, and final renders ─────── + RenderBucket: + Type: AWS::S3::Bucket + DeletionPolicy: Retain + UpdateReplacePolicy: Retain + Properties: + # BucketName omitted — CloudFormation generates a unique name like + # "-renderbucket-". S3 bucket names are capped at + # 63 chars; a static !Sub expression including ProjectName + + # AWS::AccountId + AWS::Region trips that limit when ProjectName + # carries a timestamp (e.g. the smoke script's per-run stack name). + PublicAccessBlockConfiguration: + BlockPublicAcls: true + BlockPublicPolicy: true + IgnorePublicAcls: true + RestrictPublicBuckets: true + VersioningConfiguration: + Status: Suspended + LifecycleConfiguration: + Rules: + - Id: ExpireIntermediates + Status: Enabled + Prefix: renders/ + # Plan tarballs and chunk outputs are intermediate artifacts. + # Users keep the final mp4 (different key prefix); the rest + # can age out after a week to keep storage costs flat. + ExpirationInDays: 7 + + # ── Single Lambda function handling all three roles ────────────────────── + RenderFunction: + Type: AWS::Serverless::Function + Properties: + FunctionName: !Sub "${ProjectName}-render" + Description: >- + HyperFrames distributed render handler. Dispatches on event.Action. + Handler: handler.handler + # Local path is resolved by `sam build` + `sam deploy --resolve-s3` + # (or `--s3-bucket`); the resulting CodeUri rewrites to s3://. + CodeUri: ../../packages/aws-lambda/dist/handler.zip + PackageType: Zip + ReservedConcurrentExecutions: !If + - HasReservedConcurrency + - !Ref ReservedConcurrency + - !Ref AWS::NoValue + EphemeralStorage: + Size: 10240 + Environment: + Variables: + # Lambda's Node 22 runtime sets these by default; explicit for + # clarity + so users can override during local SAM invoke. + TMPDIR: /tmp + Policies: + - S3CrudPolicy: + BucketName: !Ref RenderBucket + # CloudWatch Logs perms are covered by SAM's default + # AWSLambdaBasicExecutionRole — explicit `CloudWatchLogsFullAccess` + # would be overscope (`logs:*` on `*`, including DeleteLogGroup + + # CreateExportTask). Reference templates shouldn't leak overbroad + # IAM into adopters' accounts. + + # ── CloudWatch log group for the state machine ────────────────────────── + # SAM doesn't auto-create one when `LoggingConfiguration` is set, so we + # define it explicitly — that way the IAM grant on the state-machine + # role has a destination to write to. + RenderStateMachineLogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: !Sub "/aws/states/${ProjectName}-render" + RetentionInDays: 30 + + # ── Step Functions state machine: Plan → Map(N) RenderChunk → Assemble ── + RenderStateMachine: + Type: AWS::Serverless::StateMachine + Properties: + Name: !Sub "${ProjectName}-render" + Type: STANDARD + Tracing: + Enabled: true + Logging: + # Without this, the `WriteCloudwatchLogs` grant on the state + # machine role would be unused — operators would see zero + # execution history outside the Step Functions console. + # `Level: ERROR` keeps log volume low; bump to `ALL` for + # heavy debugging. + Level: ERROR + IncludeExecutionData: false + Destinations: + - CloudWatchLogsLogGroup: + LogGroupArn: !GetAtt RenderStateMachineLogGroup.Arn + Definition: + Comment: >- + HyperFrames distributed render orchestration. See + DISTRIBUTED-RENDERING-PLAN.md §2.5 for the architecture. + StartAt: Plan + States: + Plan: + Type: Task + Resource: arn:aws:states:::lambda:invoke + Parameters: + FunctionName: !GetAtt RenderFunction.Arn + Payload: + Action: plan + ProjectS3Uri.$: "$.ProjectS3Uri" + PlanOutputS3Prefix.$: "$.PlanOutputS3Prefix" + Config.$: "$.Config" + ResultSelector: + PlanS3Uri.$: "$.Payload.PlanS3Uri" + PlanHash.$: "$.Payload.PlanHash" + ChunkCount.$: "$.Payload.ChunkCount" + Format.$: "$.Payload.Format" + HasAudio.$: "$.Payload.HasAudio" + AudioS3Uri.$: "$.Payload.AudioS3Uri" + ResultPath: $.Plan + Retry: + - ErrorEquals: + # Per §9.3, these are non-retryable plan-time failures. + - FFMPEG_VERSION_MISMATCH + - PLAN_HASH_MISMATCH + - BROWSER_GPU_NOT_SOFTWARE + - FONT_FETCH_FAILED + - PLAN_TOO_LARGE + - FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED + MaxAttempts: 0 + - ErrorEquals: [States.ALL] + IntervalSeconds: 2 + MaxAttempts: 4 + BackoffRate: 2 + MaxDelaySeconds: 60 + Next: BuildChunkList + + BuildChunkList: + # Translate ChunkCount into an array `[0, 1, ..., N-1]` so the + # Map state below has something to iterate. Range is the + # idiomatic Step Functions intrinsic for this; no Lambda call + # required. + Type: Pass + Parameters: + ChunkIndexes.$: "States.ArrayRange(0, States.MathAdd($.Plan.ChunkCount, -1), 1)" + ResultPath: $.Iterator + Next: RenderChunks + + RenderChunks: + Type: Map + ItemsPath: $.Iterator.ChunkIndexes + ItemSelector: + ChunkIndex.$: "$$.Map.Item.Value" + PlanS3Uri.$: "$.Plan.PlanS3Uri" + PlanHash.$: "$.Plan.PlanHash" + ChunkOutputS3Prefix.$: "$.PlanOutputS3Prefix" + Format.$: "$.Plan.Format" + # Map fan-out cap derives from the Plan's chunkCount so + # caller-supplied `Config.maxParallelChunks` (which + # `plan()` honours when sizing the chunk list) is the + # single source of truth. A hardcoded value here would + # silently throttle adopters who scale up the chunk count + # in their event payload. + MaxConcurrencyPath: $.Plan.ChunkCount + ResultPath: $.Chunks + ItemProcessor: + ProcessorConfig: + Mode: INLINE + StartAt: RenderChunk + States: + RenderChunk: + Type: Task + Resource: arn:aws:states:::lambda:invoke + Parameters: + FunctionName: !GetAtt RenderFunction.Arn + Payload: + Action: renderChunk + ChunkIndex.$: "$.ChunkIndex" + PlanS3Uri.$: "$.PlanS3Uri" + PlanHash.$: "$.PlanHash" + ChunkOutputS3Prefix.$: "$.ChunkOutputS3Prefix" + Format.$: "$.Format" + ResultSelector: + ChunkS3Uri.$: "$.Payload.ChunkS3Uri" + ChunkIndex.$: "$.Payload.ChunkIndex" + Sha256.$: "$.Payload.Sha256" + Retry: + - ErrorEquals: + - FFMPEG_VERSION_MISMATCH + - PLAN_HASH_MISMATCH + - BROWSER_GPU_NOT_SOFTWARE + MaxAttempts: 0 + - ErrorEquals: [States.ALL] + IntervalSeconds: 2 + MaxAttempts: 4 + BackoffRate: 2 + MaxDelaySeconds: 60 + End: true + Next: Assemble + + Assemble: + Type: Task + Resource: arn:aws:states:::lambda:invoke + Parameters: + FunctionName: !GetAtt RenderFunction.Arn + Payload: + Action: assemble + PlanS3Uri.$: "$.Plan.PlanS3Uri" + ChunkS3Uris.$: "$.Chunks[*].ChunkS3Uri" + AudioS3Uri.$: "$.Plan.AudioS3Uri" + OutputS3Uri.$: "$.OutputS3Uri" + Format.$: "$.Plan.Format" + ResultSelector: + OutputS3Uri.$: "$.Payload.OutputS3Uri" + FramesEncoded.$: "$.Payload.FramesEncoded" + FileSize.$: "$.Payload.FileSize" + ResultPath: $.Output + Retry: + - ErrorEquals: + # Per §9.3, these are non-retryable failures that + # surface at assemble time too — ffmpeg-driven concat + # picks up version drift, plan-hash re-verification + # at assemble, and format checks. Skip the retry + # storm; fail fast. + - FFMPEG_VERSION_MISMATCH + - PLAN_HASH_MISMATCH + - FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED + MaxAttempts: 0 + - ErrorEquals: [States.ALL] + IntervalSeconds: 2 + MaxAttempts: 4 + BackoffRate: 2 + MaxDelaySeconds: 60 + End: true + Role: !GetAtt RenderStateMachineRole.Arn + + RenderStateMachineRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: states.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: InvokeRenderFunction + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: lambda:InvokeFunction + Resource: !GetAtt RenderFunction.Arn + - PolicyName: WriteCloudwatchLogs + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - logs:CreateLogDelivery + - logs:GetLogDelivery + - logs:UpdateLogDelivery + - logs:DeleteLogDelivery + - logs:ListLogDeliveries + - logs:PutResourcePolicy + - logs:DescribeResourcePolicies + - logs:DescribeLogGroups + Resource: "*" + - PolicyName: XRayTracing + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - xray:PutTraceSegments + - xray:PutTelemetryRecords + Resource: "*" + + # ── CloudWatch alarm: runaway chunk invocations ───────────────────────── + RenderChunkInvocationAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${ProjectName}-runaway-chunk-invocations" + AlarmDescription: >- + Fires if RenderChunk Lambda invocations exceed the configured + threshold in a 1-hour window. The Map state's MaxConcurrency cap + protects against simultaneous fan-out, but a runaway state + machine that triggers many sequential renders would still rack + up cost; this alarm catches that pattern. + Namespace: AWS/Lambda + MetricName: Invocations + Dimensions: + - Name: FunctionName + Value: !Ref RenderFunction + Statistic: Sum + Period: 3600 + EvaluationPeriods: 1 + Threshold: !Ref ChunkInvocationAlarmThreshold + ComparisonOperator: GreaterThanThreshold + TreatMissingData: notBreaching + +Outputs: + RenderBucketName: + Description: S3 bucket for plan tarballs, chunk outputs, and final renders. + Value: !Ref RenderBucket + Export: + Name: !Sub "${AWS::StackName}-RenderBucket" + + RenderFunctionArn: + Description: ARN of the Lambda function. Pass to `aws lambda invoke` for local testing. + Value: !GetAtt RenderFunction.Arn + Export: + Name: !Sub "${AWS::StackName}-RenderFunctionArn" + + RenderStateMachineArn: + Description: ARN of the Step Functions state machine. Pass to `aws stepfunctions start-execution`. + Value: !Ref RenderStateMachine + Export: + Name: !Sub "${AWS::StackName}-RenderStateMachineArn" From 8a66eb3231ca28d57fb4c7729f3f595ae2c5ca45 Mon Sep 17 00:00:00 2001 From: James Date: Sat, 16 May 2026 17:28:29 +0000 Subject: [PATCH 2/3] fix(lambda): address PR 879 review feedback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add CloudWatch alarms for Lambda Errors metric (5min window, threshold 1) and Step Functions ExecutionsFailed metric. The existing runaway- invocations alarm catches too-many-calls but missed silent per-chunk failures and retry-exhaustion. - Document VersioningConfiguration: Suspended tradeoff inline. Adopters treating the final mp4 as user-keepable should bump to Enabled. - Cost-allocation Tags on RenderBucket + Lambda Globals. - Lambda Tracing: Active so X-Ray spans don't terminate at the SF→Lambda boundary (the state machine already had tracing). - State-machine top-level TimeoutSeconds: 3600 as defensive ceiling on the whole choreography — catches Plan-retry storms before they hit individual task budgets. - AssertChunkCount Choice state: if Plan ever returns ChunkCount=0 the Map would silently iterate zero times and Assemble would receive an empty ChunkS3Uris[] producing an empty output. Fail-fast with typed PLAN_TOO_LARGE error instead. - Architecture comment: explicit x86_64-only constraint from @sparticuz/chromium so adopters trying Graviton don't get bitten. --- examples/aws-lambda/template.yaml | 108 +++++++++++++++++++++++++++++- 1 file changed, 107 insertions(+), 1 deletion(-) diff --git a/examples/aws-lambda/template.yaml b/examples/aws-lambda/template.yaml index 1c1e5fa10..9838f88b0 100644 --- a/examples/aws-lambda/template.yaml +++ b/examples/aws-lambda/template.yaml @@ -71,7 +71,24 @@ Globals: Runtime: nodejs22.x MemorySize: !Ref LambdaMemoryMb Timeout: !Ref LambdaTimeoutSec + # x86_64 is required for @sparticuz/chromium — its prebuilt + # Chromium ships x86_64-only. Adopters who switch to a custom + # ARM-built chrome-headless-shell can change this to `arm64`, but + # the default ZIP build will fail to launch on Graviton. Architectures: [x86_64] + # Lambda function-level X-Ray tracing. The state machine already + # has Tracing.Enabled: true; without this, X-Ray traces would + # terminate at the Step Functions → Lambda boundary instead of + # following into per-function spans. + Tracing: Active + # Cost-allocation tags. Setting these at the Globals level applies + # to every AWS::Serverless::Function in the template — there's + # only one today, but the contract is portable to multi-function + # variants. Bucket + state-machine carry the same tags resource- + # locally because Globals only covers functions. + Tags: + Project: !Ref ProjectName + HyperFramesComponent: lambda-renderer Environment: Variables: NODE_OPTIONS: "--enable-source-maps" @@ -95,6 +112,13 @@ Resources: IgnorePublicAcls: true RestrictPublicBuckets: true VersioningConfiguration: + # `Suspended` keeps storage costs flat — versions are not + # retained on overwrites. Tradeoff: if an adopter writes their + # final rendered mp4 to this bucket and a re-render overwrites + # the same key, the prior version is gone. Adopters who treat + # the final mp4 as user-keepable should set this to `Enabled` + # (intermediates under `renders/` still expire via the + # lifecycle rule below regardless). Status: Suspended LifecycleConfiguration: Rules: @@ -105,6 +129,11 @@ Resources: # Users keep the final mp4 (different key prefix); the rest # can age out after a week to keep storage costs flat. ExpirationInDays: 7 + Tags: + - Key: Project + Value: !Ref ProjectName + - Key: HyperFramesComponent + Value: lambda-renderer # ── Single Lambda function handling all three roles ────────────────────── RenderFunction: @@ -171,6 +200,13 @@ Resources: Comment: >- HyperFrames distributed render orchestration. See DISTRIBUTED-RENDERING-PLAN.md §2.5 for the architecture. + # Defensive 1-hour ceiling on the whole choreography. The + # individual states already have retries + per-task timeouts; + # this catches pathological runaways (Plan-retry storm, + # stuck-state-machine bugs) at the top before per-task budgets + # compound into a multi-hour execution. The longest legitimate + # render observed in PR 880's eval was ~3 minutes. + TimeoutSeconds: 3600 StartAt: Plan States: Plan: @@ -217,7 +253,25 @@ Resources: Parameters: ChunkIndexes.$: "States.ArrayRange(0, States.MathAdd($.Plan.ChunkCount, -1), 1)" ResultPath: $.Iterator - Next: RenderChunks + Next: AssertChunkCount + + AssertChunkCount: + # Defensive gate: `resolveChunkPlan` guarantees ChunkCount ≥ 1, + # but if some future regression let a zero-chunk plan through, + # `RenderChunks` (Map state) would iterate zero times and + # `Assemble` would receive an empty `ChunkS3Uris` array — silently + # producing an empty output. Fail fast instead. + Type: Choice + Choices: + - Variable: $.Plan.ChunkCount + NumericGreaterThan: 0 + Next: RenderChunks + Default: PlanProducedZeroChunks + + PlanProducedZeroChunks: + Type: Fail + Error: PLAN_TOO_LARGE + Cause: Plan returned ChunkCount=0 — non-retryable producer-side invariant violation. RenderChunks: Type: Map @@ -373,6 +427,58 @@ Resources: ComparisonOperator: GreaterThanThreshold TreatMissingData: notBreaching + # ── CloudWatch alarm: Lambda function errors ──────────────────────────── + # Fires on any non-zero error rate. The invocation alarm above catches + # *too many calls*; this catches *calls that failed*. Without it, + # silent per-chunk failures (a non-retryable error inside the + # producer) would only surface by reading Step Functions execution + # history. + RenderFunctionErrorsAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${ProjectName}-render-function-errors" + AlarmDescription: >- + Fires if the render Lambda reports any errors in a 5-minute + window. Set EvaluationPeriods=1 so a single failure pages. + Namespace: AWS/Lambda + MetricName: Errors + Dimensions: + - Name: FunctionName + Value: !Ref RenderFunction + Statistic: Sum + Period: 300 + EvaluationPeriods: 1 + Threshold: 1 + ComparisonOperator: GreaterThanOrEqualToThreshold + TreatMissingData: notBreaching + + # ── CloudWatch alarm: Step Functions execution failures ───────────────── + # Fires when a state-machine execution reaches a terminal failure + # state (typed non-retryable, retry-exhausted, or top-level timeout). + # Complementary to the Lambda Errors alarm: SFN failures include + # Choice-state Fail branches (PlanProducedZeroChunks) that bypass + # Lambda entirely, plus retry-exhaustion of transient errors that + # individual Lambda invocations counted as successful "retries". + RenderStateMachineFailedAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${ProjectName}-render-state-machine-failed" + AlarmDescription: >- + Fires when the render state machine reports a failed + execution. Catches retry-exhaustion + typed non-retryable + + TimeoutSeconds cases that the Lambda Errors metric misses. + Namespace: AWS/States + MetricName: ExecutionsFailed + Dimensions: + - Name: StateMachineArn + Value: !Ref RenderStateMachine + Statistic: Sum + Period: 300 + EvaluationPeriods: 1 + Threshold: 1 + ComparisonOperator: GreaterThanOrEqualToThreshold + TreatMissingData: notBreaching + Outputs: RenderBucketName: Description: S3 bucket for plan tarballs, chunk outputs, and final renders. From c3d486c33aa813b3d4e251f928ff4484889d0ca4 Mon Sep 17 00:00:00 2001 From: James Date: Sat, 16 May 2026 19:08:52 +0000 Subject: [PATCH 3/3] docs(lambda): drop internal plan-doc refs from SAM example + template --- examples/aws-lambda/README.md | 18 ++++++++---------- examples/aws-lambda/template.yaml | 26 +++++++++++++++----------- 2 files changed, 23 insertions(+), 21 deletions(-) diff --git a/examples/aws-lambda/README.md b/examples/aws-lambda/README.md index fa6ebf701..bc680b9fd 100644 --- a/examples/aws-lambda/README.md +++ b/examples/aws-lambda/README.md @@ -6,9 +6,7 @@ choreographed by a Step Functions standard workflow with a Map state for parallel chunk rendering. See [`packages/aws-lambda/README.md`](../../packages/aws-lambda/README.md) -for the Lambda handler architecture and -[`DISTRIBUTED-RENDERING-PLAN.md`](../../DISTRIBUTED-RENDERING-PLAN.md#15-aws-lambda-turnkey-deployment) -§15 for the design context. +for the Lambda handler architecture. ## Prerequisites @@ -146,7 +144,8 @@ fully tear down. A 60-second 1080p30 composition at default chunkSize=240 (8 chunks) typically costs ~$0.04 in Lambda time + ~$0.001 in Step Functions. -Validate with PR 6.3's real-AWS benchmark once it lands. +The eval script under `scripts/eval.sh` produces real per-fixture cost +numbers when you run it against your own AWS account. ## Troubleshooting @@ -161,10 +160,9 @@ Validate with PR 6.3's real-AWS benchmark once it lands. the state machine execution history for an unintended Map fan-out, or raise the threshold if your workload genuinely exceeds it. -## What's NOT in this PR +## What's NOT in this directory -- A real-AWS deploy + benchmark workflow (PR 6.3). -- CDK construct shipping the same topology programmatically (PR 6.4). -- `hyperframes lambda deploy / render / progress / destroy` CLI (PR 6.5). -- Migration guide (PR 6.8). -- Lambda RIE local smoke harness mode (PR 6.6). +- CDK construct shipping the same topology programmatically — follow-up. +- `hyperframes lambda deploy / render / progress / destroy` CLI — follow-up. +- Migration guide — follow-up. +- Lambda RIE local smoke harness mode — follow-up. diff --git a/examples/aws-lambda/template.yaml b/examples/aws-lambda/template.yaml index 9838f88b0..86921eb89 100644 --- a/examples/aws-lambda/template.yaml +++ b/examples/aws-lambda/template.yaml @@ -3,12 +3,12 @@ Transform: AWS::Serverless-2016-10-31 Description: >- HyperFrames distributed rendering — Step Functions standard workflow with one Lambda function handling Plan, RenderChunk (fan-out via Map state), - and Assemble. One S3 bucket, one CloudWatch alarm for runaway concurrency. + and Assemble. One S3 bucket, alarms for runaway concurrency, Lambda + errors, and Step Functions execution failures. - Built from the Phase 6.1 ZIP at packages/aws-lambda/dist/handler.zip. + Built from the handler ZIP at packages/aws-lambda/dist/handler.zip. See: - - DISTRIBUTED-RENDERING-PLAN.md §15 (Lambda turnkey deployment) - packages/aws-lambda/README.md (handler architecture) - examples/aws-lambda/README.md (this directory's deploy guide) @@ -198,8 +198,8 @@ Resources: LogGroupArn: !GetAtt RenderStateMachineLogGroup.Arn Definition: Comment: >- - HyperFrames distributed render orchestration. See - DISTRIBUTED-RENDERING-PLAN.md §2.5 for the architecture. + HyperFrames distributed render orchestration: Plan → Map(N) + RenderChunk → Assemble. # Defensive 1-hour ceiling on the whole choreography. The # individual states already have retries + per-task timeouts; # this catches pathological runaways (Plan-retry storm, @@ -229,7 +229,11 @@ Resources: ResultPath: $.Plan Retry: - ErrorEquals: - # Per §9.3, these are non-retryable plan-time failures. + # These error names are thrown by the producer's plan + # stage when retrying can never help — version skew, + # determinism violations, GPU misconfiguration, font + # fetch failures, plan-size cap, unsupported format. + # Fail fast rather than burning ~120s of retry budget. - FFMPEG_VERSION_MISMATCH - PLAN_HASH_MISMATCH - BROWSER_GPU_NOT_SOFTWARE @@ -344,11 +348,11 @@ Resources: ResultPath: $.Output Retry: - ErrorEquals: - # Per §9.3, these are non-retryable failures that - # surface at assemble time too — ffmpeg-driven concat - # picks up version drift, plan-hash re-verification - # at assemble, and format checks. Skip the retry - # storm; fail fast. + # Same non-retryable error names as the Plan state's + # gate — these surface at assemble time too because + # ffmpeg-driven concat picks up version drift and we + # re-verify plan hash + format at assemble. Skip the + # retry storm; fail fast. - FFMPEG_VERSION_MISMATCH - PLAN_HASH_MISMATCH - FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED