Skip to content

feat(lambda): add SAM template and sample events for AWS deployment#879

Open
jrusso1020 wants to merge 1 commit into
05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probefrom
06-2-feat_lambda_add_sam_template
Open

feat(lambda): add SAM template and sample events for AWS deployment#879
jrusso1020 wants to merge 1 commit into
05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probefrom
06-2-feat_lambda_add_sam_template

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 15, 2026

What

Adds examples/aws-lambda/ — a reference SAM template that deploys the
HyperFrames Lambda handler from PR 6.1 alongside the Step Functions
state machine, S3 bucket, IAM roles, and CloudWatch alarm needed to run
distributed renders end-to-end on AWS.

This is PR 6.2 of the 8-PR Phase 6 stack. Stacked on top of #878
(PR 6.1: handler + ZIP).

Why

PR 6.1 produced a deployable ZIP but had nowhere to put it. This PR is
the deployment surface — what sam deploy --guided creates in a user's
own AWS account. Step Functions is the AWS-native fan-out primitive
(Map state), fits Lambda's 15-minute per-invocation cap, and gives
per-stage visibility into Plan / RenderChunk / Assemble timing.

The template is purely a reference for OSS adopters; PR 6.4 (CDK) and
PR 6.5 (hyperframes lambda deploy CLI) ship the same topology
through alternative deployment surfaces.

How

State machine choreography:

Plan
  ↓ ResultPath: $.Plan
BuildChunkList     (Pass state; expands ChunkCount into [0..N-1])
  ↓ Iterator.ChunkIndexes
Map state (MaxConcurrency=16)
  └── RenderChunk  (one Lambda invocation per chunk)
  ↓ Chunks[]
Assemble

Retry policy: 4 attempts, 2s initial, 2× backoff, max 60s — the policy
spec'd in plan §9.4. Typed non-retryable error codes from §9.3
(FFMPEG_VERSION_MISMATCH, PLAN_HASH_MISMATCH,
BROWSER_GPU_NOT_SOFTWARE, FONT_FETCH_FAILED, PLAN_TOO_LARGE,
FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED) are explicitly opted out of
retry. The Map state's RenderChunk path picks the subset of
non-retryables relevant to chunk workers.

Lambda function points at ../../packages/aws-lambda/dist/handler.zip
via SAM's CodeUri local-path resolution. On sam deploy, the local ZIP
is uploaded to SAM's managed staging bucket and CodeUri rewrites to
that S3 URI. Override via the HandlerZipUri parameter for
pre-uploaded ZIPs.

S3 bucket: Retain on stack delete; renders/ prefix expires after 7
days (plan tarballs + chunk outputs are intermediate); user-keepable
artifacts go under a different prefix. Public access fully blocked.

CloudWatch alarm fires if RenderChunk invocations exceed
ChunkInvocationAlarmThreshold per hour. The Map state's
MaxConcurrency cap protects against simultaneous fan-out but not
against a runaway state machine that loops; this alarm catches that.

Notable parameter knobs:

  • LambdaMemoryMb — 10 GB default; Lambda allocates CPU proportionally
  • LambdaTimeoutSec — 900s default (Lambda hard ceiling)
  • ReservedConcurrency-1 (unreserved) by default; set to bound cost
  • ChromeSource — must match --source= passed to build-zip.ts

What's NOT in this PR (deferred to 6.3+):

  • A real-AWS deploy + benchmark workflow (PR 6.3)
  • CDK construct shipping the same topology (PR 6.4)
  • npx hyperframes lambda deploy CLI (PR 6.5)
  • Lambda RIE local smoke harness mode (PR 6.6)
  • Deployment docs page (PR 6.7) and migration guide (PR 6.8)

How is it tested

sam validate --lint against the template — passes:

/.../examples/aws-lambda/template.yaml is a valid SAM Template

Sample event payloads at examples/aws-lambda/sample-events/ cover all
three handler actions and slot into sam local invoke RenderFunction --event <path> for local dispatch tests.

End-to-end real-AWS validation lands in PR 6.3 — that workflow does
sam deploy → start state machine execution → assert PSNR ≥ 50 dB →
sam delete against a HeyGen test AWS account.

Test plan

  • sam validate --lint passes.
  • YAML parses with CloudFormation intrinsic functions (!Ref, !Sub,
    !GetAtt, !If, !Not, !Equals, !Sub).
  • All four state machine states (Plan, BuildChunkList,
    RenderChunks, Assemble) are reachable.
  • Retry policies match plan §9.3 + §9.4.
  • bunx oxlint + bunx oxfmt --check clean on the directory.
  • End-to-end deploy → render → teardown — lands in PR 6.3.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator Author

jrusso1020 commented May 15, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jrusso1020 jrusso1020 force-pushed the 05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe branch from 619c204 to 5eabdf4 Compare May 16, 2026 00:17
@jrusso1020 jrusso1020 force-pushed the 06-2-feat_lambda_add_sam_template branch from 43f5015 to 8601666 Compare May 16, 2026 00:17
@jrusso1020 jrusso1020 force-pushed the 05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe branch from 5eabdf4 to ef55431 Compare May 16, 2026 01:50
@jrusso1020 jrusso1020 force-pushed the 06-2-feat_lambda_add_sam_template branch from 8601666 to cf8d228 Compare May 16, 2026 01:50
Phase 6.2 of the distributed rendering plan (DISTRIBUTED-RENDERING-PLAN.md
§15). Reference SAM template for deploying HyperFrames distributed
rendering on AWS — one Lambda function in three roles, choreographed by
a Step Functions standard workflow with a Map state for parallel chunk
rendering.

Resources created by the template:
  - Lambda function pointing at the Phase 6.1 ZIP
  - Step Functions state machine: Plan -> Map(N) RenderChunk -> Assemble
  - S3 bucket for plan tarballs, chunk outputs, final mp4
  - IAM role for the state machine
  - CloudWatch alarm guarding against runaway chunk invocations

Retry policy: 4 attempts, 2s initial, 2x backoff, max 60s, with the
typed non-retryable error codes from plan §9.3 explicitly opted out.

CodeUri points at packages/aws-lambda/dist/handler.zip; sam deploy
resolves the local path and uploads to a SAM-managed bucket on first
deploy.

Validated: sam validate --lint passes against the template.

This is part of the 8-PR Phase 6 stack; PR 6.2 of 8.
@jrusso1020 jrusso1020 force-pushed the 05-15-feat_lambda_add_lambda_handler_zip_bundling_and_beginframe_probe branch from ef55431 to a1d2874 Compare May 16, 2026 02:24
@jrusso1020 jrusso1020 force-pushed the 06-2-feat_lambda_add_sam_template branch from cf8d228 to d76e0b3 Compare May 16, 2026 02:24
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #879 — SAM template for AWS Lambda deployment

Reviewed: examples/aws-lambda/template.yaml, sample events, README, .gitignore changes.

Critical (90-100)

CloudWatchLogsFullAccess is overly broad (confidence: 92)
packages/producerexamples/aws-lambda/template.yaml, line ~170 (RenderFunction Policies)

The Lambda function's Policies block includes CloudWatchLogsFullAccess, an AWS-managed policy that grants logs:* on * — including DeleteLogGroup, PutRetentionPolicy, CreateExportTask, etc. The Lambda only needs to write its own logs.

Replace with the SAM shorthand that scopes to just the function's log group:

Policies:
  - S3CrudPolicy:
      BucketName: !Ref RenderBucket
  # SAM auto-creates the log group and attaches a scoped logs policy
  # when you omit CloudWatchLogsFullAccess. If you need explicit control:
  - Statement:
      - Effect: Allow
        Action:
          - logs:CreateLogGroup
          - logs:CreateLogStream
          - logs:PutLogEvents
        Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${ProjectName}-render:*"

Actually, SAM's AWS::Serverless::Function already auto-attaches a scoped CloudWatch Logs policy by default — you can just remove the CloudWatchLogsFullAccess line entirely and it still works.


Everything else looks solid:

  • State machine choreography (Plan -> BuildChunkList -> Map -> Assemble) is correct and idiomatic.
  • States.ArrayRange + States.MathAdd for zero-indexed chunk list is the right intrinsic.
  • Retry policy matches the plan spec (4 attempts, 2s/2x/60s) with typed non-retryable errors.
  • S3 bucket has public access blocked, Retain on delete, 7-day lifecycle on intermediates.
  • IAM role for state machine is properly scoped (invoke only the render function ARN).
  • CloudWatch alarm for runaway invocations is a good cost-protection measure.
  • ReservedConcurrency conditional via AWS::NoValue is correct.
  • Sample events are well-formed for all three actions.
  • README is thorough with deploy, run, troubleshoot, and cost model sections.

Approving — the CloudWatchLogsFullAccess issue is real but non-blocking for an example template (users should tighten it for production, and SAM's default logging policy covers the gap). Flag it in a follow-up or just drop the line.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference SAM template for the Phase 6 Lambda + Step Functions deploy. Good shape overall — single function with action dispatch, Map fan-out with MaxConcurrency, retain-on-delete S3 with lifecycle for intermediates, typed non-retryables matching plan §9.3. sam validate --lint clean.

Audited: examples/aws-lambda/template.yaml (end-to-end), examples/aws-lambda/README.md, sample events, .gitignore change.
Trusting: plan §9.3 / §9.4 error code list (cited from PR body; the doc isn't on this branch).
No prior reviews on this PR.

Calibrated strengths

  • examples/aws-lambda/template.yaml:60-67RenderBucket.PublicAccessBlockConfiguration sets all four flags. Correct for an OSS reference that will be copy-pasted.
  • template.yaml:74-82LifecycleConfiguration scopes ExpirationInDays: 7 to renders/ prefix only; comment explains why intermediates expire and user-keepables don't. Clear contract.
  • template.yaml:140-149 — typed non-retryable list in Plan state matches the PR-body §9.3 enumeration exactly. The BackoffRate: 2, MaxDelaySeconds: 60 envelope matches §9.4.
  • .gitignore:67-72 — the examples/* blacklist + !examples/aws-lambda negation is the right shape and the inline comment names the contract. Future OSS examples slot in cleanly.

Blockers

  • template.yaml:32-39HandlerZipUri parameter is dead. Declared with a description that says "override if you've pre-uploaded the ZIP," but nothing in Resources references !Ref HandlerZipUri. RenderFunction.CodeUri is hardcoded to the local path ../../packages/aws-lambda/dist/handler.zip. Either wire it via a HasHandlerZipUri condition + CodeUri: !If [HasHandlerZipUri, !Ref HandlerZipUri, ../../packages/aws-lambda/dist/handler.zip] (so the override actually works), or delete the parameter. A documented knob that does nothing is worse than no knob — adopters will set it, deploy, and wonder why their custom ZIP wasn't used.

  • template.yaml:357-360HandlerZipKey output is mislabeled. Description says "S3 key of the deployed handler ZIP. Useful for diffing across deploys." Value is !Ref RenderFunction, which returns the Lambda function NAME (per CFN docs), not an S3 key. Either fix the value (e.g. expose the actual S3 key via a custom resource or GetAtt RenderFunction.CodeS3Key if available), or drop the output. Misleading output on a reference template propagates into adopters' tooling.

Important

  • template.yaml:122CloudWatchLogsFullAccess on the Lambda is overscope. This is the AWS-managed logs:* on *. The Lambda only needs to write to its own log group, which the default SAM execution role already grants. Drop this policy — AWSLambdaBasicExecutionRole (applied by default for AWS::Serverless::Function) covers the legitimate need. Reference templates leak overpermissive IAM into every adopter's account.

  • template.yaml:202-209Assemble state has no typed non-retryables. Only States.ALL with 4 retries. Per the PR-body §9.3 list, at least FFMPEG_VERSION_MISMATCH, PLAN_HASH_MISMATCH, and probably FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED can fire at assemble time (ffmpeg-driven concat, plan-hash re-verification). Currently a misversioned ffmpeg or a plan/chunk mismatch will retry 4× with backoff, burning ~120s + Lambda cost before failing. Mirror the typed-error gate from Plan (line 140-149).

  • template.yaml:121-122 — state-machine WriteCloudwatchLogs policy grants log-delivery perms, but no LoggingConfiguration on the state machine itself. Perms granted, never used. Either add LoggingConfiguration (Level: ERROR minimum) so adopters get useful execution logs, or drop the policy. Right now adopters get the IAM grant with no logs and have to discover the gap by reading docs.

  • template.yaml:177MaxConcurrency: 16 hardcoded but Config.maxParallelChunks is in the event payload. Two different concurrency caps in two different places. Adopter overrides maxParallelChunks: 32 in their event, but Map silently caps at 16. Either parameterize via MaxConcurrencyPath: $.Config.maxParallelChunks to derive it from the event, or rename/document the divergence. Same surprise applies to ReservedConcurrency — set it to 8 and the Map still tries to fan out 16. Cross-link the two caps in the parameter descriptions at minimum.

  • No alarm on render failures. RenderChunkInvocationAlarm only catches runaway invocation count. There's no alarm on Lambda Errors metric or on Step Functions ExecutionsFailed. A reference template should ship at least one error-rate alarm so adopters notice silent failures without polling Step Functions console. Plan-state retries that exhaust will just disappear into CloudWatch logs.

Nits

  • template.yaml:75-82VersioningConfiguration.Status: Suspended for an artifact bucket is fine for cost, but the final mp4 has no protection from accidental overwrite. Adopters will likely want Status: Enabled for the keepables; worth a comment naming the tradeoff.
  • template.yaml — no Tags block on any resource. Cost-allocation tags (Project: ${ProjectName}) on Lambda + bucket + state machine would help adopters track HyperFrames spend. Single Globals block fixes the Lambda; per-resource for the rest.
  • template.yaml:93-95RenderFunction has no Tracing: Active. State machine tracing is enabled, but without Lambda tracing the spans terminate at the SF→Lambda boundary. Two-line fix; meaningful for §15 troubleshooting.
  • No top-level state-machine TimeoutSeconds. The choreography is bounded (Plan + Map(N) + Assemble), so it's not unbounded — but a defensive 1h ceiling would catch the runaway-Plan-retry pathology earlier than the invocation alarm's 1h window.
  • template.yaml:160-161States.ArrayRange(0, States.MathAdd($.Plan.ChunkCount, -1), 1) returns [] when ChunkCount=0. If Plan ever legitimately produces zero chunks, Map runs zero iterations and Assemble receives ChunkS3Uris: []. Worth a Choice state gate or an explicit Plan-level invariant that ChunkCount >= 1.

Notes

  • CI: only Graphite mergeability is pending; required Detect changes / regression / player-perf / preview-regression are green. State is unstable only because some optional shards are skipping by path-filter. Not a verdict-blocker.
  • x86_64 architecture is correct for @sparticuz/chromium; flag in the parameter description so adopters who try to switch to chrome-headless-shell ARM64 don't get bitten silently.

Verdict

Verdict: REQUEST CHANGES
Reasoning: Two correctness bugs in a reference template (dead HandlerZipUri param, mislabeled HandlerZipKey output) will mislead OSS adopters who copy-paste the template; combined with CloudWatchLogsFullAccess overscope and the missing typed-error gate on Assemble, the surface needs to be tightened before this is the canonical example.

Review by Vai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants