feat(lambda): add SAM template and sample events for AWS deployment#879
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
619c204 to
5eabdf4
Compare
43f5015 to
8601666
Compare
5eabdf4 to
ef55431
Compare
8601666 to
cf8d228
Compare
Phase 6.2 of the distributed rendering plan (DISTRIBUTED-RENDERING-PLAN.md §15). Reference SAM template for deploying HyperFrames distributed rendering on AWS — one Lambda function in three roles, choreographed by a Step Functions standard workflow with a Map state for parallel chunk rendering. Resources created by the template: - Lambda function pointing at the Phase 6.1 ZIP - Step Functions state machine: Plan -> Map(N) RenderChunk -> Assemble - S3 bucket for plan tarballs, chunk outputs, final mp4 - IAM role for the state machine - CloudWatch alarm guarding against runaway chunk invocations Retry policy: 4 attempts, 2s initial, 2x backoff, max 60s, with the typed non-retryable error codes from plan §9.3 explicitly opted out. CodeUri points at packages/aws-lambda/dist/handler.zip; sam deploy resolves the local path and uploads to a SAM-managed bucket on first deploy. Validated: sam validate --lint passes against the template. This is part of the 8-PR Phase 6 stack; PR 6.2 of 8.
ef55431 to
a1d2874
Compare
cf8d228 to
d76e0b3
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: PR #879 — SAM template for AWS Lambda deployment
Reviewed: examples/aws-lambda/template.yaml, sample events, README, .gitignore changes.
Critical (90-100)
CloudWatchLogsFullAccess is overly broad (confidence: 92)
packages/producer → examples/aws-lambda/template.yaml, line ~170 (RenderFunction Policies)
The Lambda function's Policies block includes CloudWatchLogsFullAccess, an AWS-managed policy that grants logs:* on * — including DeleteLogGroup, PutRetentionPolicy, CreateExportTask, etc. The Lambda only needs to write its own logs.
Replace with the SAM shorthand that scopes to just the function's log group:
Policies:
- S3CrudPolicy:
BucketName: !Ref RenderBucket
# SAM auto-creates the log group and attaches a scoped logs policy
# when you omit CloudWatchLogsFullAccess. If you need explicit control:
- Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${ProjectName}-render:*"Actually, SAM's AWS::Serverless::Function already auto-attaches a scoped CloudWatch Logs policy by default — you can just remove the CloudWatchLogsFullAccess line entirely and it still works.
Everything else looks solid:
- State machine choreography (Plan -> BuildChunkList -> Map -> Assemble) is correct and idiomatic.
States.ArrayRange+States.MathAddfor zero-indexed chunk list is the right intrinsic.- Retry policy matches the plan spec (4 attempts, 2s/2x/60s) with typed non-retryable errors.
- S3 bucket has public access blocked,
Retainon delete, 7-day lifecycle on intermediates. - IAM role for state machine is properly scoped (invoke only the render function ARN).
- CloudWatch alarm for runaway invocations is a good cost-protection measure.
ReservedConcurrencyconditional viaAWS::NoValueis correct.- Sample events are well-formed for all three actions.
- README is thorough with deploy, run, troubleshoot, and cost model sections.
Approving — the CloudWatchLogsFullAccess issue is real but non-blocking for an example template (users should tighten it for production, and SAM's default logging policy covers the gap). Flag it in a follow-up or just drop the line.
vanceingalls
left a comment
There was a problem hiding this comment.
Reference SAM template for the Phase 6 Lambda + Step Functions deploy. Good shape overall — single function with action dispatch, Map fan-out with MaxConcurrency, retain-on-delete S3 with lifecycle for intermediates, typed non-retryables matching plan §9.3. sam validate --lint clean.
Audited: examples/aws-lambda/template.yaml (end-to-end), examples/aws-lambda/README.md, sample events, .gitignore change.
Trusting: plan §9.3 / §9.4 error code list (cited from PR body; the doc isn't on this branch).
No prior reviews on this PR.
Calibrated strengths
examples/aws-lambda/template.yaml:60-67—RenderBucket.PublicAccessBlockConfigurationsets all four flags. Correct for an OSS reference that will be copy-pasted.template.yaml:74-82—LifecycleConfigurationscopesExpirationInDays: 7torenders/prefix only; comment explains why intermediates expire and user-keepables don't. Clear contract.template.yaml:140-149— typed non-retryable list inPlanstate matches the PR-body §9.3 enumeration exactly. TheBackoffRate: 2,MaxDelaySeconds: 60envelope matches §9.4..gitignore:67-72— theexamples/*blacklist +!examples/aws-lambdanegation is the right shape and the inline comment names the contract. Future OSS examples slot in cleanly.
Blockers
-
template.yaml:32-39—HandlerZipUriparameter is dead. Declared with a description that says "override if you've pre-uploaded the ZIP," but nothing inResourcesreferences!Ref HandlerZipUri.RenderFunction.CodeUriis hardcoded to the local path../../packages/aws-lambda/dist/handler.zip. Either wire it via aHasHandlerZipUricondition +CodeUri: !If [HasHandlerZipUri, !Ref HandlerZipUri, ../../packages/aws-lambda/dist/handler.zip](so the override actually works), or delete the parameter. A documented knob that does nothing is worse than no knob — adopters will set it, deploy, and wonder why their custom ZIP wasn't used. -
template.yaml:357-360—HandlerZipKeyoutput is mislabeled. Description says "S3 key of the deployed handler ZIP. Useful for diffing across deploys." Value is!Ref RenderFunction, which returns the Lambda function NAME (per CFN docs), not an S3 key. Either fix the value (e.g. expose the actual S3 key via a custom resource orGetAtt RenderFunction.CodeS3Keyif available), or drop the output. Misleading output on a reference template propagates into adopters' tooling.
Important
-
template.yaml:122—CloudWatchLogsFullAccesson the Lambda is overscope. This is the AWS-managedlogs:*on*. The Lambda only needs to write to its own log group, which the default SAM execution role already grants. Drop this policy —AWSLambdaBasicExecutionRole(applied by default forAWS::Serverless::Function) covers the legitimate need. Reference templates leak overpermissive IAM into every adopter's account. -
template.yaml:202-209—Assemblestate has no typed non-retryables. OnlyStates.ALLwith 4 retries. Per the PR-body §9.3 list, at leastFFMPEG_VERSION_MISMATCH,PLAN_HASH_MISMATCH, and probablyFORMAT_NOT_SUPPORTED_IN_DISTRIBUTEDcan fire at assemble time (ffmpeg-driven concat, plan-hash re-verification). Currently a misversioned ffmpeg or a plan/chunk mismatch will retry 4× with backoff, burning ~120s + Lambda cost before failing. Mirror the typed-error gate fromPlan(line 140-149). -
template.yaml:121-122— state-machineWriteCloudwatchLogspolicy grants log-delivery perms, but noLoggingConfigurationon the state machine itself. Perms granted, never used. Either addLoggingConfiguration(Level: ERRORminimum) so adopters get useful execution logs, or drop the policy. Right now adopters get the IAM grant with no logs and have to discover the gap by reading docs. -
template.yaml:177—MaxConcurrency: 16hardcoded butConfig.maxParallelChunksis in the event payload. Two different concurrency caps in two different places. Adopter overridesmaxParallelChunks: 32in their event, but Map silently caps at 16. Either parameterize viaMaxConcurrencyPath: $.Config.maxParallelChunksto derive it from the event, or rename/document the divergence. Same surprise applies toReservedConcurrency— set it to 8 and the Map still tries to fan out 16. Cross-link the two caps in the parameter descriptions at minimum. -
No alarm on render failures.
RenderChunkInvocationAlarmonly catches runaway invocation count. There's no alarm on LambdaErrorsmetric or on Step FunctionsExecutionsFailed. A reference template should ship at least one error-rate alarm so adopters notice silent failures without polling Step Functions console. Plan-state retries that exhaust will just disappear into CloudWatch logs.
Nits
template.yaml:75-82—VersioningConfiguration.Status: Suspendedfor an artifact bucket is fine for cost, but the final mp4 has no protection from accidental overwrite. Adopters will likely wantStatus: Enabledfor the keepables; worth a comment naming the tradeoff.template.yaml— noTagsblock on any resource. Cost-allocation tags (Project: ${ProjectName}) on Lambda + bucket + state machine would help adopters track HyperFrames spend. SingleGlobalsblock fixes the Lambda; per-resource for the rest.template.yaml:93-95—RenderFunctionhas noTracing: Active. State machine tracing is enabled, but without Lambda tracing the spans terminate at the SF→Lambda boundary. Two-line fix; meaningful for §15 troubleshooting.- No top-level state-machine
TimeoutSeconds. The choreography is bounded (Plan + Map(N) + Assemble), so it's not unbounded — but a defensive 1h ceiling would catch the runaway-Plan-retry pathology earlier than the invocation alarm's 1h window. template.yaml:160-161—States.ArrayRange(0, States.MathAdd($.Plan.ChunkCount, -1), 1)returns[]whenChunkCount=0. If Plan ever legitimately produces zero chunks, Map runs zero iterations and Assemble receivesChunkS3Uris: []. Worth aChoicestate gate or an explicit Plan-level invariant thatChunkCount >= 1.
Notes
- CI: only Graphite mergeability is pending; required
Detect changes/regression/player-perf/preview-regressionare green. State isunstableonly because some optional shards are skipping by path-filter. Not a verdict-blocker. x86_64architecture is correct for@sparticuz/chromium; flag in the parameter description so adopters who try to switch tochrome-headless-shellARM64 don't get bitten silently.
Verdict
Verdict: REQUEST CHANGES
Reasoning: Two correctness bugs in a reference template (dead HandlerZipUri param, mislabeled HandlerZipKey output) will mislead OSS adopters who copy-paste the template; combined with CloudWatchLogsFullAccess overscope and the missing typed-error gate on Assemble, the surface needs to be tightened before this is the canonical example.
Review by Vai

What
Adds
examples/aws-lambda/— a reference SAM template that deploys theHyperFrames Lambda handler from PR 6.1 alongside the Step Functions
state machine, S3 bucket, IAM roles, and CloudWatch alarm needed to run
distributed renders end-to-end on AWS.
This is PR 6.2 of the 8-PR Phase 6 stack. Stacked on top of #878
(PR 6.1: handler + ZIP).
Why
PR 6.1 produced a deployable ZIP but had nowhere to put it. This PR is
the deployment surface — what
sam deploy --guidedcreates in a user'sown AWS account. Step Functions is the AWS-native fan-out primitive
(
Mapstate), fits Lambda's 15-minute per-invocation cap, and givesper-stage visibility into Plan / RenderChunk / Assemble timing.
The template is purely a reference for OSS adopters; PR 6.4 (CDK) and
PR 6.5 (
hyperframes lambda deployCLI) ship the same topologythrough alternative deployment surfaces.
How
State machine choreography:
Retry policy: 4 attempts, 2s initial, 2× backoff, max 60s — the policy
spec'd in plan §9.4. Typed non-retryable error codes from §9.3
(
FFMPEG_VERSION_MISMATCH,PLAN_HASH_MISMATCH,BROWSER_GPU_NOT_SOFTWARE,FONT_FETCH_FAILED,PLAN_TOO_LARGE,FORMAT_NOT_SUPPORTED_IN_DISTRIBUTED) are explicitly opted out ofretry. The Map state's RenderChunk path picks the subset of
non-retryables relevant to chunk workers.
Lambda function points at
../../packages/aws-lambda/dist/handler.zipvia SAM's CodeUri local-path resolution. On
sam deploy, the local ZIPis uploaded to SAM's managed staging bucket and CodeUri rewrites to
that S3 URI. Override via the
HandlerZipUriparameter forpre-uploaded ZIPs.
S3 bucket:
Retainon stack delete;renders/prefix expires after 7days (plan tarballs + chunk outputs are intermediate); user-keepable
artifacts go under a different prefix. Public access fully blocked.
CloudWatch alarm fires if RenderChunk invocations exceed
ChunkInvocationAlarmThresholdper hour. The Map state'sMaxConcurrency cap protects against simultaneous fan-out but not
against a runaway state machine that loops; this alarm catches that.
Notable parameter knobs:
LambdaMemoryMb— 10 GB default; Lambda allocates CPU proportionallyLambdaTimeoutSec— 900s default (Lambda hard ceiling)ReservedConcurrency—-1(unreserved) by default; set to bound costChromeSource— must match--source=passed tobuild-zip.tsWhat's NOT in this PR (deferred to 6.3+):
npx hyperframes lambda deployCLI (PR 6.5)How is it tested
sam validate --lintagainst the template — passes:Sample event payloads at
examples/aws-lambda/sample-events/cover allthree handler actions and slot into
sam local invoke RenderFunction --event <path>for local dispatch tests.End-to-end real-AWS validation lands in PR 6.3 — that workflow does
sam deploy→ start state machine execution → assert PSNR ≥ 50 dB →sam deleteagainst a HeyGen test AWS account.Test plan
sam validate --lintpasses.!Ref,!Sub,!GetAtt,!If,!Not,!Equals,!Sub).Plan,BuildChunkList,RenderChunks,Assemble) are reachable.bunx oxlint+bunx oxfmt --checkclean on the directory.🤖 Generated with Claude Code