diff --git a/.gitignore b/.gitignore index 76b9858be..93f34ba23 100644 --- a/.gitignore +++ b/.gitignore @@ -67,7 +67,10 @@ packages/producer/src/services/fontData.generated.ts # Local proof / test artifacts qa-artifacts/ my-video/ -examples/ +examples/* +# Tracked OSS examples — negations override the blanket `examples/*` ignore. +!examples/aws-lambda +!examples/aws-lambda/** packages/studio/data/ .desloppify/ .worktrees/ diff --git a/examples/aws-lambda/.gitignore b/examples/aws-lambda/.gitignore new file mode 100644 index 000000000..912214835 --- /dev/null +++ b/examples/aws-lambda/.gitignore @@ -0,0 +1,3 @@ +# SAM CLI state — written by `sam deploy --guided`, contains user choices. +samconfig.toml +.aws-sam/ diff --git a/examples/aws-lambda/README.md b/examples/aws-lambda/README.md new file mode 100644 index 000000000..bc680b9fd --- /dev/null +++ b/examples/aws-lambda/README.md @@ -0,0 +1,168 @@ +# AWS Lambda + Step Functions deployment + +Reference SAM template for deploying HyperFrames distributed rendering on +AWS. One Lambda function, three roles (Plan / RenderChunk / Assemble), +choreographed by a Step Functions standard workflow with a Map state for +parallel chunk rendering. + +See [`packages/aws-lambda/README.md`](../../packages/aws-lambda/README.md) +for the Lambda handler architecture. + +## Prerequisites + +- AWS account with IAM permissions to deploy CloudFormation stacks + containing Lambda, Step Functions, S3, IAM, and CloudWatch resources. +- [`sam` CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html) + installed (≥ 1.100). +- [`bun`](https://bun.sh) installed (≥ 1.3) to build the handler ZIP. + +## One-shot deploy + +```bash +# 1. Build the handler ZIP that `template.yaml`'s CodeUri points at. +bun install # at repo root +bun run --cwd packages/aws-lambda build:zip + +# 2. Deploy. First time: `--guided` to set stack name + region. +cd examples/aws-lambda +sam deploy --guided --resolve-s3 +``` + +`--resolve-s3` lets SAM pick (or create) a per-account bucket to host the +uploaded ZIP. After the first deploy, subsequent updates can omit +`--guided` and `--resolve-s3` — SAM remembers your choices in +`samconfig.toml`. + +## What gets created + +| Resource | Purpose | +| ---------------------------------------- | -------------------------------------------------------------------------------------------------- | +| `Render Lambda` | Single function, handler `handler.handler`. Dispatches on `event.Action`. | +| `Render State Machine` | Step Functions standard workflow. Plan → Map(N) RenderChunk → Assemble. | +| `Render Bucket` | S3 bucket for plan tarballs, chunk outputs, and final mp4. `renders/` prefix expires after 7 days. | +| IAM role for the state machine | Invokes the Lambda; writes CloudWatch logs; X-Ray traces. | +| IAM role for the Lambda (managed by SAM) | S3 CRUD on the render bucket; CloudWatch logs. | +| Runaway-invocation alarm | Fires if RenderChunk runs more than `ChunkInvocationAlarmThreshold` times in an hour. | + +## Running a render + +Upload your project as a zip to the render bucket, then start a Step +Functions execution: + +```bash +STACK_NAME=hyperframes-render # whatever you picked at deploy +RENDER_BUCKET=$(aws cloudformation describe-stacks \ + --stack-name "$STACK_NAME" \ + --query 'Stacks[0].Outputs[?OutputKey==`RenderBucketName`].OutputValue' \ + --output text) +STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \ + --stack-name "$STACK_NAME" \ + --query 'Stacks[0].Outputs[?OutputKey==`RenderStateMachineArn`].OutputValue' \ + --output text) + +# Tar + upload the project directory. The handler uses `tar` (not +# `unzip`, which Lambda's base image doesn't ship), so the on-the-wire +# archive format is `.tar.gz`. +tar -czf my-project.tar.gz -C ./my-project . +aws s3 cp my-project.tar.gz "s3://${RENDER_BUCKET}/projects/my-project.tar.gz" + +# Start the execution. The input JSON tells the state machine where to +# read inputs and write outputs. +aws stepfunctions start-execution \ + --state-machine-arn "$STATE_MACHINE_ARN" \ + --input "$(cat <- + HyperFrames distributed rendering — Step Functions standard workflow with + one Lambda function handling Plan, RenderChunk (fan-out via Map state), + and Assemble. One S3 bucket, alarms for runaway concurrency, Lambda + errors, and Step Functions execution failures. + + Built from the handler ZIP at packages/aws-lambda/dist/handler.zip. + + See: + - packages/aws-lambda/README.md (handler architecture) + - examples/aws-lambda/README.md (this directory's deploy guide) + +Parameters: + ProjectName: + Type: String + Default: hyperframes + Description: Name prefix applied to all created resources. + + LambdaMemoryMb: + Type: Number + Default: 10240 + AllowedValues: [2048, 3072, 4096, 5120, 6144, 7168, 8192, 9216, 10240] + Description: >- + Lambda memory in MB. Render workloads are CPU-bound; bumping memory + proportionally bumps the CPU share Lambda gives the function. 10 GB + (the max) is recommended for 1080p renders. + + LambdaTimeoutSec: + Type: Number + Default: 900 + MinValue: 60 + MaxValue: 900 + Description: >- + Per-invocation Lambda timeout. Render chunks at the default + chunkSize=240 frames complete in seconds; 15 minutes is the Lambda + hard ceiling and the default here to absorb cold-start variance. + + ReservedConcurrency: + Type: Number + Default: -1 + Description: >- + Lambda reserved concurrency cap. Set to a positive integer to bound + simultaneous chunk renders (e.g. 50 to limit cost). -1 means + unreserved (account default). + + ChromeSource: + Type: String + Default: sparticuz + AllowedValues: [sparticuz, chrome-headless-shell] + Description: >- + Which Chrome runtime the bundled ZIP was built with. Must match the + `--source=` flag passed to `build-zip.ts`. The handler reads this + via the HYPERFRAMES_LAMBDA_CHROME_SOURCE env var at boot. + + ChunkInvocationAlarmThreshold: + Type: Number + Default: 1000 + Description: >- + CloudWatch alarm threshold for total RenderChunk invocations per + hour. The runaway-Map state pathology would fan out far more + chunks than expected; an alarm at 10× the typical workload + protects against billing surprises. + +Conditions: + HasReservedConcurrency: !Not [!Equals [!Ref ReservedConcurrency, -1]] + +Globals: + Function: + Runtime: nodejs22.x + MemorySize: !Ref LambdaMemoryMb + Timeout: !Ref LambdaTimeoutSec + # x86_64 is required for @sparticuz/chromium — its prebuilt + # Chromium ships x86_64-only. Adopters who switch to a custom + # ARM-built chrome-headless-shell can change this to `arm64`, but + # the default ZIP build will fail to launch on Graviton. + Architectures: [x86_64] + # Lambda function-level X-Ray tracing. The state machine already + # has Tracing.Enabled: true; without this, X-Ray traces would + # terminate at the Step Functions → Lambda boundary instead of + # following into per-function spans. + Tracing: Active + # Cost-allocation tags. Setting these at the Globals level applies + # to every AWS::Serverless::Function in the template — there's + # only one today, but the contract is portable to multi-function + # variants. Bucket + state-machine carry the same tags resource- + # locally because Globals only covers functions. + Tags: + Project: !Ref ProjectName + HyperFramesComponent: lambda-renderer + Environment: + Variables: + NODE_OPTIONS: "--enable-source-maps" + HYPERFRAMES_LAMBDA_CHROME_SOURCE: !Ref ChromeSource + +Resources: + # ── S3 bucket for plan tarballs, chunk outputs, and final renders ─────── + RenderBucket: + Type: AWS::S3::Bucket + DeletionPolicy: Retain + UpdateReplacePolicy: Retain + Properties: + # BucketName omitted — CloudFormation generates a unique name like + # "-renderbucket-". S3 bucket names are capped at + # 63 chars; a static !Sub expression including ProjectName + + # AWS::AccountId + AWS::Region trips that limit when ProjectName + # carries a timestamp (e.g. the smoke script's per-run stack name). + PublicAccessBlockConfiguration: + BlockPublicAcls: true + BlockPublicPolicy: true + IgnorePublicAcls: true + RestrictPublicBuckets: true + VersioningConfiguration: + # `Suspended` keeps storage costs flat — versions are not + # retained on overwrites. Tradeoff: if an adopter writes their + # final rendered mp4 to this bucket and a re-render overwrites + # the same key, the prior version is gone. Adopters who treat + # the final mp4 as user-keepable should set this to `Enabled` + # (intermediates under `renders/` still expire via the + # lifecycle rule below regardless). + Status: Suspended + LifecycleConfiguration: + Rules: + - Id: ExpireIntermediates + Status: Enabled + Prefix: renders/ + # Plan tarballs and chunk outputs are intermediate artifacts. + # Users keep the final mp4 (different key prefix); the rest + # can age out after a week to keep storage costs flat. + ExpirationInDays: 7 + Tags: + - Key: Project + Value: !Ref ProjectName + - Key: HyperFramesComponent + Value: lambda-renderer + + # ── Single Lambda function handling all three roles ────────────────────── + RenderFunction: + Type: AWS::Serverless::Function + Properties: + FunctionName: !Sub "${ProjectName}-render" + Description: >- + HyperFrames distributed render handler. Dispatches on event.Action. + Handler: handler.handler + # Local path is resolved by `sam build` + `sam deploy --resolve-s3` + # (or `--s3-bucket`); the resulting CodeUri rewrites to s3://. + CodeUri: ../../packages/aws-lambda/dist/handler.zip + PackageType: Zip + ReservedConcurrentExecutions: !If + - HasReservedConcurrency + - !Ref ReservedConcurrency + - !Ref AWS::NoValue + EphemeralStorage: + Size: 10240 + Environment: + Variables: + # Lambda's Node 22 runtime sets these by default; explicit for + # clarity + so users can override during local SAM invoke. + TMPDIR: /tmp + Policies: + - S3CrudPolicy: + BucketName: !Ref RenderBucket + # CloudWatch Logs perms are covered by SAM's default + # AWSLambdaBasicExecutionRole — explicit `CloudWatchLogsFullAccess` + # would be overscope (`logs:*` on `*`, including DeleteLogGroup + + # CreateExportTask). Reference templates shouldn't leak overbroad + # IAM into adopters' accounts. + + # ── CloudWatch log group for the state machine ────────────────────────── + # SAM doesn't auto-create one when `LoggingConfiguration` is set, so we + # define it explicitly — that way the IAM grant on the state-machine + # role has a destination to write to. + RenderStateMachineLogGroup: + Type: AWS::Logs::LogGroup + Properties: + LogGroupName: !Sub "/aws/states/${ProjectName}-render" + RetentionInDays: 30 + + # ── Step Functions state machine: Plan → Map(N) RenderChunk → Assemble ── + RenderStateMachine: + Type: AWS::Serverless::StateMachine + Properties: + Name: !Sub "${ProjectName}-render" + Type: STANDARD + Tracing: + Enabled: true + Logging: + # Without this, the `WriteCloudwatchLogs` grant on the state + # machine role would be unused — operators would see zero + # execution history outside the Step Functions console. + # `Level: ERROR` keeps log volume low; bump to `ALL` for + # heavy debugging. + Level: ERROR + IncludeExecutionData: false + Destinations: + - CloudWatchLogsLogGroup: + LogGroupArn: !GetAtt RenderStateMachineLogGroup.Arn + Definition: + Comment: >- + HyperFrames distributed render orchestration: Plan → Map(N) + RenderChunk → Assemble. + # Defensive 1-hour ceiling on the whole choreography. The + # individual states already have retries + per-task timeouts; + # this catches pathological runaways (Plan-retry storm, + # stuck-state-machine bugs) at the top before per-task budgets + # compound into a multi-hour execution. The longest legitimate + # render observed in PR 880's eval was ~3 minutes. + TimeoutSeconds: 3600 + StartAt: Plan + States: + Plan: + Type: Task + Resource: arn:aws:states:::lambda:invoke + Parameters: + FunctionName: !GetAtt RenderFunction.Arn + Payload: + Action: plan + ProjectS3Uri.$: "$.ProjectS3Uri" + PlanOutputS3Prefix.$: "$.PlanOutputS3Prefix" + Config.$: "$.Config" + ResultSelector: + PlanS3Uri.$: "$.Payload.PlanS3Uri" + PlanHash.$: "$.Payload.PlanHash" + ChunkCount.$: "$.Payload.ChunkCount" + Format.$: "$.Payload.Format" + HasAudio.$: "$.Payload.HasAudio" + AudioS3Uri.$: "$.Payload.AudioS3Uri" + ResultPath: $.Plan + Retry: + - ErrorEquals: + # These error names are thrown by the producer's plan + # stage when retrying can never help — version skew, + # determinism violations, GPU misconfiguration, font + # fetch failures, plan-size cap, unsupported format. + # Fail fast rather than burning ~120s of retry budget. + - FFMPEG_VERSION_MISMATCH + - PLAN_HASH_MISMATCH + - BROWSER_GPU_NOT_SOFTWARE + - FONT_FETCH_FAILED + - PLAN_TOO_LARGE + - FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED + MaxAttempts: 0 + - ErrorEquals: [States.ALL] + IntervalSeconds: 2 + MaxAttempts: 4 + BackoffRate: 2 + MaxDelaySeconds: 60 + Next: BuildChunkList + + BuildChunkList: + # Translate ChunkCount into an array `[0, 1, ..., N-1]` so the + # Map state below has something to iterate. Range is the + # idiomatic Step Functions intrinsic for this; no Lambda call + # required. + Type: Pass + Parameters: + ChunkIndexes.$: "States.ArrayRange(0, States.MathAdd($.Plan.ChunkCount, -1), 1)" + ResultPath: $.Iterator + Next: AssertChunkCount + + AssertChunkCount: + # Defensive gate: `resolveChunkPlan` guarantees ChunkCount ≥ 1, + # but if some future regression let a zero-chunk plan through, + # `RenderChunks` (Map state) would iterate zero times and + # `Assemble` would receive an empty `ChunkS3Uris` array — silently + # producing an empty output. Fail fast instead. + Type: Choice + Choices: + - Variable: $.Plan.ChunkCount + NumericGreaterThan: 0 + Next: RenderChunks + Default: PlanProducedZeroChunks + + PlanProducedZeroChunks: + Type: Fail + Error: PLAN_TOO_LARGE + Cause: Plan returned ChunkCount=0 — non-retryable producer-side invariant violation. + + RenderChunks: + Type: Map + ItemsPath: $.Iterator.ChunkIndexes + ItemSelector: + ChunkIndex.$: "$$.Map.Item.Value" + PlanS3Uri.$: "$.Plan.PlanS3Uri" + PlanHash.$: "$.Plan.PlanHash" + ChunkOutputS3Prefix.$: "$.PlanOutputS3Prefix" + Format.$: "$.Plan.Format" + # Map fan-out cap derives from the Plan's chunkCount so + # caller-supplied `Config.maxParallelChunks` (which + # `plan()` honours when sizing the chunk list) is the + # single source of truth. A hardcoded value here would + # silently throttle adopters who scale up the chunk count + # in their event payload. + MaxConcurrencyPath: $.Plan.ChunkCount + ResultPath: $.Chunks + ItemProcessor: + ProcessorConfig: + Mode: INLINE + StartAt: RenderChunk + States: + RenderChunk: + Type: Task + Resource: arn:aws:states:::lambda:invoke + Parameters: + FunctionName: !GetAtt RenderFunction.Arn + Payload: + Action: renderChunk + ChunkIndex.$: "$.ChunkIndex" + PlanS3Uri.$: "$.PlanS3Uri" + PlanHash.$: "$.PlanHash" + ChunkOutputS3Prefix.$: "$.ChunkOutputS3Prefix" + Format.$: "$.Format" + ResultSelector: + ChunkS3Uri.$: "$.Payload.ChunkS3Uri" + ChunkIndex.$: "$.Payload.ChunkIndex" + Sha256.$: "$.Payload.Sha256" + Retry: + - ErrorEquals: + - FFMPEG_VERSION_MISMATCH + - PLAN_HASH_MISMATCH + - BROWSER_GPU_NOT_SOFTWARE + MaxAttempts: 0 + - ErrorEquals: [States.ALL] + IntervalSeconds: 2 + MaxAttempts: 4 + BackoffRate: 2 + MaxDelaySeconds: 60 + End: true + Next: Assemble + + Assemble: + Type: Task + Resource: arn:aws:states:::lambda:invoke + Parameters: + FunctionName: !GetAtt RenderFunction.Arn + Payload: + Action: assemble + PlanS3Uri.$: "$.Plan.PlanS3Uri" + ChunkS3Uris.$: "$.Chunks[*].ChunkS3Uri" + AudioS3Uri.$: "$.Plan.AudioS3Uri" + OutputS3Uri.$: "$.OutputS3Uri" + Format.$: "$.Plan.Format" + ResultSelector: + OutputS3Uri.$: "$.Payload.OutputS3Uri" + FramesEncoded.$: "$.Payload.FramesEncoded" + FileSize.$: "$.Payload.FileSize" + ResultPath: $.Output + Retry: + - ErrorEquals: + # Same non-retryable error names as the Plan state's + # gate — these surface at assemble time too because + # ffmpeg-driven concat picks up version drift and we + # re-verify plan hash + format at assemble. Skip the + # retry storm; fail fast. + - FFMPEG_VERSION_MISMATCH + - PLAN_HASH_MISMATCH + - FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED + MaxAttempts: 0 + - ErrorEquals: [States.ALL] + IntervalSeconds: 2 + MaxAttempts: 4 + BackoffRate: 2 + MaxDelaySeconds: 60 + End: true + Role: !GetAtt RenderStateMachineRole.Arn + + RenderStateMachineRole: + Type: AWS::IAM::Role + Properties: + AssumeRolePolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Principal: + Service: states.amazonaws.com + Action: sts:AssumeRole + Policies: + - PolicyName: InvokeRenderFunction + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: lambda:InvokeFunction + Resource: !GetAtt RenderFunction.Arn + - PolicyName: WriteCloudwatchLogs + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - logs:CreateLogDelivery + - logs:GetLogDelivery + - logs:UpdateLogDelivery + - logs:DeleteLogDelivery + - logs:ListLogDeliveries + - logs:PutResourcePolicy + - logs:DescribeResourcePolicies + - logs:DescribeLogGroups + Resource: "*" + - PolicyName: XRayTracing + PolicyDocument: + Version: "2012-10-17" + Statement: + - Effect: Allow + Action: + - xray:PutTraceSegments + - xray:PutTelemetryRecords + Resource: "*" + + # ── CloudWatch alarm: runaway chunk invocations ───────────────────────── + RenderChunkInvocationAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${ProjectName}-runaway-chunk-invocations" + AlarmDescription: >- + Fires if RenderChunk Lambda invocations exceed the configured + threshold in a 1-hour window. The Map state's MaxConcurrency cap + protects against simultaneous fan-out, but a runaway state + machine that triggers many sequential renders would still rack + up cost; this alarm catches that pattern. + Namespace: AWS/Lambda + MetricName: Invocations + Dimensions: + - Name: FunctionName + Value: !Ref RenderFunction + Statistic: Sum + Period: 3600 + EvaluationPeriods: 1 + Threshold: !Ref ChunkInvocationAlarmThreshold + ComparisonOperator: GreaterThanThreshold + TreatMissingData: notBreaching + + # ── CloudWatch alarm: Lambda function errors ──────────────────────────── + # Fires on any non-zero error rate. The invocation alarm above catches + # *too many calls*; this catches *calls that failed*. Without it, + # silent per-chunk failures (a non-retryable error inside the + # producer) would only surface by reading Step Functions execution + # history. + RenderFunctionErrorsAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${ProjectName}-render-function-errors" + AlarmDescription: >- + Fires if the render Lambda reports any errors in a 5-minute + window. Set EvaluationPeriods=1 so a single failure pages. + Namespace: AWS/Lambda + MetricName: Errors + Dimensions: + - Name: FunctionName + Value: !Ref RenderFunction + Statistic: Sum + Period: 300 + EvaluationPeriods: 1 + Threshold: 1 + ComparisonOperator: GreaterThanOrEqualToThreshold + TreatMissingData: notBreaching + + # ── CloudWatch alarm: Step Functions execution failures ───────────────── + # Fires when a state-machine execution reaches a terminal failure + # state (typed non-retryable, retry-exhausted, or top-level timeout). + # Complementary to the Lambda Errors alarm: SFN failures include + # Choice-state Fail branches (PlanProducedZeroChunks) that bypass + # Lambda entirely, plus retry-exhaustion of transient errors that + # individual Lambda invocations counted as successful "retries". + RenderStateMachineFailedAlarm: + Type: AWS::CloudWatch::Alarm + Properties: + AlarmName: !Sub "${ProjectName}-render-state-machine-failed" + AlarmDescription: >- + Fires when the render state machine reports a failed + execution. Catches retry-exhaustion + typed non-retryable + + TimeoutSeconds cases that the Lambda Errors metric misses. + Namespace: AWS/States + MetricName: ExecutionsFailed + Dimensions: + - Name: StateMachineArn + Value: !Ref RenderStateMachine + Statistic: Sum + Period: 300 + EvaluationPeriods: 1 + Threshold: 1 + ComparisonOperator: GreaterThanOrEqualToThreshold + TreatMissingData: notBreaching + +Outputs: + RenderBucketName: + Description: S3 bucket for plan tarballs, chunk outputs, and final renders. + Value: !Ref RenderBucket + Export: + Name: !Sub "${AWS::StackName}-RenderBucket" + + RenderFunctionArn: + Description: ARN of the Lambda function. Pass to `aws lambda invoke` for local testing. + Value: !GetAtt RenderFunction.Arn + Export: + Name: !Sub "${AWS::StackName}-RenderFunctionArn" + + RenderStateMachineArn: + Description: ARN of the Step Functions state machine. Pass to `aws stepfunctions start-execution`. + Value: !Ref RenderStateMachine + Export: + Name: !Sub "${AWS::StackName}-RenderStateMachineArn"