feat(lambda): add Lambda handler, ZIP bundling, and BeginFrame probe#878
feat(lambda): add Lambda handler, ZIP bundling, and BeginFrame probe#878jrusso1020 wants to merge 1 commit into
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
| // probe so we don't add a dependency surface that could mask a | ||
| // Chrome-side issue. | ||
| const htmlPath = join(tmpdir(), `hf-beginframe-probe-${Date.now()}.html`); | ||
| await fs.writeFile(htmlPath, PROBE_HTML, "utf-8"); |
| // `du -sb` returns "<bytes>\t<path>". macOS coreutils doesn't ship `-b` — | ||
| // build-zip is CI-side anyway, where Linux coreutils is present. | ||
| try { | ||
| const stdout = execSync(`du -sb ${JSON.stringify(dir)}`, { encoding: "utf-8" }); |
5eabdf4 to
ef55431
Compare
Phase 6 of the distributed rendering plan: AWS Lambda turnkey adoption
(see DISTRIBUTED-RENDERING-PLAN.md §11 Phase 6 + §15).
This PR adds the new packages/aws-lambda/ workspace package that wraps
the OSS plan/renderChunk/assemble primitives in an AWS Lambda handler,
plus a build pipeline that bundles the handler + Chromium runtime +
ffmpeg into a deployable ZIP.
Architecture: ZIP deploy (not Docker image), Chrome via @sparticuz/chromium
with chrome-headless-shell fallback, dispatch on event.Action ∈ {plan,
renderChunk, assemble}.
The load-bearing concern — does @sparticuz/chromium's chrome-headless-shell
build honour CDP HeadlessExperimental.beginFrame? — is pinned by the new
scripts/probe-beginframe.ts regression guard. Probe boots the runtime
inside public.ecr.aws/lambda/nodejs:22, navigates to a static page, and
asserts beginFrame returns a PNG buffer. Verified locally + inside the
Docker container; both pass with hasDamage=true.
Sizes (sparticuz source): unzipped 157 MiB, zipped 99 MiB. Well under
the 240 MiB / 150 MiB in-house gates and the Lambda 250 MiB hard ceiling.
This is part of a stack of 8 PRs (3 in Phase 6a, 5 in Phase 6b); this is
PR 6.1.
ef55431 to
a1d2874
Compare
miguel-heygen
left a comment
There was a problem hiding this comment.
Review: feat(lambda) -- Lambda handler, ZIP bundling, BeginFrame probe
Read the full diff (19 files, +2648/-24). Examined all source files, tests, scripts, Dockerfile, and the plan.ts bugfix. CI notes: build/test/lint/typecheck/format all pass; regression-shard failures are Docker infra flakiness (all 8 shards fail at image setup, not test execution), unrelated to this PR's changes.
Architecture
Clean, well-structured adapter package. The handler dispatch pattern (one Lambda function, three roles via event.Action) is the right call for Step Functions ergonomics. The dependency injection via HandlerDeps keeps the handler fully testable without module-mocking gymnastics. Event unwrapping with depth limits prevents infinite loops on malformed Step Functions payloads. Exhaustiveness checks on Action and Format via never assignments catch future additions at compile time.
plan.ts fix
The .plan-work/ cleanup reorder is correct and the root cause analysis in the commit message is precise: freezePlan walked planDir while .plan-work/downloads/ still contained intermediate audio artifacts, baking transient files into the hash. Chunk workers never see those files, so their recomputed hash diverged. Moving rmSync(workDir) before freezePlan() is the right fix. The regression test (recomputePlanHashFromPlanDir(planDir) === result.planHash) is a solid structural invariant that catches the entire class of "planDir-state-at-freeze-time diverges from planDir-state-at-chunk-time" bugs.
Security and input validation
- S3 URI parsing (
parseS3Uri) validates scheme, requires both bucket and key, and throws on malformed input. Good. - Event unwrapping caps depth at 4, preventing unbounded recursion on crafted payloads.
- The
isLambdaActionguard is a strict allowlist; unknown actions are rejected before any I/O. - No user-controlled strings are interpolated into shell commands (tar/zip use the
tarnpm package orspawnSyncwith argument arrays, not string interpolation). HYPERFRAMES_LAMBDA_CHROME_PATHis validated withexistsSyncbefore use.
Test coverage
25 tests for the lambda package covering: Chrome source resolution (env var parsing, case insensitivity, fallback), S3 URI parsing edge cases, tar round-trip with stale-file cleanup, handler dispatch for all three actions + envelope unwrapping + unknown action rejection. The plan.test.ts addition is a targeted regression guard. Coverage is appropriate for thin glue code where the heavy logic lives in the producer primitives.
Minor observations (non-blocking)
-
_setSparticuzChromiumForTestsexported from the public index --packages/aws-lambda/src/index.tsre-exports the test-only_setSparticuzChromiumForTestsseam. The underscore prefix signals "internal" but it still appears in the package's public API surface. Consider excluding it from the barrel export or gating behind a@internalJSDoc + conditional export. Not blocking since this is a0.0.1package. -
build-zip.tsusesrequire()insidestageChromeHeadlessShellandwalkSize-- The file is ESM (usesimport.meta.url) but callsrequire("node:fs")in two function bodies. These work because the script itself injectscreateRequireat runtime, but it is inconsistent with the top-of-fileimport { ... } from "node:fs"pattern. The samereaddirSync/statSyncare already imported at the top. Could just use the existing imports. -
Regression shards -- All 8 regression-shard CI jobs fail at Docker image build, not at test execution. This appears to be pre-existing CI infra flakiness, not caused by this PR.
Approving. The plan.ts fix is correct and well-tested, the Lambda handler is clean thin glue with proper input validation, and the build pipeline is solid with appropriate size gates.
vanceingalls
left a comment
There was a problem hiding this comment.
Phase 6a Lambda adapter — solid bones, but ships with a regression-CI blocker, Windows-CI failure, and several real-AWS-shaped gaps. The handler shape is right (thin glue, DI seam for tests, pure-primitive boundary), chromium.ts source-selection is clean, and the plan.ts cleanup-order fix is correct. The integration is far enough along to be worth iterating, but it's not mergeable in the current state.
Calibrated strengths
packages/aws-lambda/src/handler.ts:80-101—handleris properly a thin dispatcher; theunwrapEventloop withMAX_ENVELOPE_DEPTH=4and the_exhaustive: neverswitch arm are the right shape for a Step Functions entry point.packages/aws-lambda/src/handler.ts:341-369—downloadChunkObjectsuses a pre-sized array indexed byirather than push order; assemble's chunk-ordering invariant is preserved underPromise.all. Easy to get wrong, got it right.packages/producer/src/services/distributed/plan.ts:790-808— thermSync(workDir)move + comment is correct (workDir = join(planDir, ".plan-work"),freezePlanwalksplanDirvialistPlanFilesrecursive walk → including.plan-work/before cleanup → poisonedplanHash). Reasoning in the comment matches the actual hash code infreezePlan.ts:158-181.
Findings
blocker — Dockerfile.test does not COPY packages/aws-lambda/package.json, breaking every regression shard. All 8 regression-shards are failing with bun install --frozen-lockfile: error: lockfile had changes, but lockfile is frozen at Dockerfile.test:76. Lines 64-71 of Dockerfile.test enumerate workspace package.json files individually; the new packages/aws-lambda/ workspace member is in bun.lock but its package.json never makes it into the build context, so bun sees the workspace shape diverging from the lockfile and refuses. Fix: add COPY packages/aws-lambda/package.json packages/aws-lambda/package.json to Dockerfile.test next to the other workspace COPY lines. This is the single root cause of all 8 shard failures — they all fail at the same Dockerfile step, with identical stderr. Verified by reading shard-3's log directly.
blocker — Windows tests fail on module-load of handler.test.ts (ENOENT reading "...packages\\producer\\node_modules\\hono"). chromium.test.ts (9 pass) + s3Transport.test.ts (8 pass) clear, but handler.test.ts (importing @hyperframes/producer/distributed) crashes at module load before any test runs (17 pass / 1 fail / 1 error, exit 1). The Windows runner is required (Tests on windows-latest is in the required-check set per the rollup). Either this is a real Windows incompatibility of the producer's transitive hono dep through the new aws-lambda import edge, or it's a bun-on-Windows workspace-link bug surfacing because aws-lambda is the first package to import @hyperframes/producer/distributed from a sibling workspace. Either way it has to be diagnosed and either fixed or the package opted out of the Windows test matrix with a // not supported on Windows rationale. Right now CI is red on a required check.
blocker — CodeQL javascript:js/shell-command-injected-from-environment is failing on build-zip.ts:411. directorySizeBytes() shells out via execSync(\du -sb ${JSON.stringify(dir)}`). The argument is JSON-quoted (good), but execSyncparses the whole string through/bin/sh, so dircontaining$(…)or backticks is still a shell-eval surface even with the JSON quoting around it (the JSON-quote escapes"but doesn't neutralize$()inside the quoted form —bash -c 'du -sb "$(date)"'runsdate). The trust boundary today is "internal CI dev script over dist/staging", so the practical risk is near zero — but CodeQL is a required check and the rule's own Recommendation is "use a parameterised execution form." Trivial fix: replace execSync(`du -sb ${JSON.stringify(dir)}`)withspawnSync("du", ["-sb", dir], { encoding: "utf-8" })`. Same Linux-only constraint, no shell, CodeQL clears. Apply Rule 10 — don't iterate against the predicate, follow the Recommendation.
important — the new plan.test.ts regression test does not exercise the audio path the PR body says caused the bug. The PR description states many-cuts (audio-bearing) tripped PLAN_HASH_MISMATCH because audio-mix downloaded source mp3s into .plan-work/downloads/, and the hash baked in the bytes. But the new test at plan.test.ts:212-241 uses the test file's existing FIXTURE_HTML (no <audio> elements; composition.audios.length === 0), so runAudioStage short-circuits at audioStage.ts:50 (if (audios.length > 0)) and writes nothing to workDir. After the compile-stage renameSync(compiledDir, finalCompiledDir) at plan.ts:716, .plan-work/ is essentially empty on the no-audio path. On the pre-fix code, freezePlan would walk an empty .plan-work/, see nothing in there to hash, and likely produce the same hash as the post-cleanup state. So this test would NOT have caught the original bug; it pins the contract on a path the bug didn't take. The fix is right; the test as written doesn't lock the regression. Either (a) add an audio-bearing fixture (mp3 in projectDir) so .plan-work/audio-work/ materialises, or (b) seed .plan-work/ with a sentinel file inside the test before calling plan() to simulate the bug surface — option (a) is closer to the wild-caught failure.
important — handler has zero structured logging. No console.log / console.error anywhere in handler.ts — not on event receipt, not per-action, not on error before the throw rethrows. Result payloads carry DurationMs but CloudWatch will see only stack traces on failures (from Lambda's default unhandled-error path). For ops triage on an event-driven Lambda where you can't attach a debugger, you need at minimum: a structured log line at boot with event.Action + the relevant S3 URIs, and an error log before the throw in unwrapEvent and at the catch sites in each handle* function. CloudWatch Logs Insights queries are the only triage tool on Lambda; without grep-able structured lines, ops is blind.
important — RenderChunkEvent.PlanHash is plumbed in the event type (events.ts:55) but never verified in the handler. handleRenderChunk ignores event.PlanHash entirely. The producer's renderChunk does the recompute-and-check at renderChunk.ts:404-413, so this is defense-in-depth, not a correctness hole — but the field exists in the event schema for a reason. If the intent is "trust the producer to enforce it," delete the field from RenderChunkEvent so it doesn't become a confusing contract. If the intent is "verify at the handler boundary before paying the untar cost," add the check before untarDirectory. Right now it's the worst of both: schema bloat with no enforcement.
important — build-time npm install inside staging bypasses bun's resolution + lockfile. build-zip.ts:248-275 writes a one-off package.json into dist/staging listing only puppeteer-core + (optionally) @sparticuz/chromium and runs npm install --no-package-lock inside that staging dir. Since puppeteer-core: ^24.39.1 is a caret range and --no-package-lock is set, the bundle ships whatever npm registry serves at build time, not what bun installed into the workspace. @sparticuz/chromium: 148.0.0 is exact-pinned (good), so the Chrome side is deterministic — but puppeteer-core's caret range means CI builds can ship 24.40+ while tests ran against 24.39, and Lambda binary determinism breaks across consecutive builds. Either (a) read the resolved versions from bun.lock's aws-lambda block and pin them exactly into the staging package.json, or (b) acknowledge the float in the README and add a CI gate that diffs staging/node_modules/<dep>/package.json against the workspace install. Today's behaviour is silently non-reproducible.
important — handler test uses spawnSync("tar", …) but s3Transport uses the npm tar package. handler.test.ts:332 + :359 shell out via spawnSync("tar", ["-czf", …]) to build the fixture tarballs. s3Transport.ts:113-116 uses the npm tar package (correctly — Lambda's Amazon Linux base has no tar binary in /usr/bin). The handler's own untarDirectory is what the test exercises, and that goes through npm tar, so the path-under-test is fine. But the test's fixture-building dep on the system tar binary is a flake surface — Windows ships bsdtar as tar (different invocation), bare Alpine containers don't ship tar at all. Worse, the s3Transport.ts:14 JSDoc says "We shell out to the system tar binary" — that comment is stale (the code uses the npm package). Tighten: use the npm tar package consistently in tests for fixture building, and fix the JSDoc.
nit — verify-zip-size.ts and several PR-body lines conflate MB and MiB. PR body says "245 MiB unzipped (well under the 248 MiB in-house gate, 250 MB Lambda hard cap)" — AWS's actual cap is 250 MiB (per aws lambda get-function-configuration docs), so the in-house 248 MiB is a 2 MiB margin, not a ~5 MB margin as the mixed units imply. Pick one unit (MiB) and stick with it in code, comments, and the README size table.
nit — stageChromeHeadlessShell picks "latest" Chrome cache via string sort. build-zip.ts:392-403 does readdirSync(baseDir).sort().reverse() on Chrome version directories. For 131.0.0, 131.0.1, 99.0.0, lexicographic descending picks 99.0.0 first. Today it works because Chrome versions in the cache are all 3-digit major. Will break the day Chrome ships 141.x and 99.x coexist. Use semver sort or filter for the newest by mtime.
nit — probe-beginframe.ts:65 insecure-tempfile (CodeQL). join(tmpdir(), \hf-beginframe-probe-${Date.now()}.html`)is predictable. CI-side probe, low risk, but trivial swap tomkdtempSync(join(tmpdir(), "hf-beginframe-"))` clears the alert.
nit — comment on s3Transport.ts:14-16 is wrong. Claims "We shell out to the system tar binary" — actual implementation uses the npm tar package. Misleading for the next reader.
Pre-existing reviews
GitHub Advanced Security already filed the two CodeQL inline comments (probe-beginframe.ts:65 insecure tempfile, build-zip.ts:411 shell-from-env). I've covered both above with concrete fixes; the build-zip.ts one is blocker (failing required check), the probe one is nit.
Verdict: REQUEST CHANGES
Reasoning: Three blockers — required-check regression-shards red (Dockerfile.test missing the new package COPY → all 8 shards fail at bun install --frozen-lockfile), required-check Windows tests red (handler.test.ts ENOENT on hono at module load), and required-check CodeQL red (shell-injection via execSync in build-zip.ts). The Lambda handler design and the producer fix are right; ship-ability is gated on CI + the regression-test-doesn't-cover-the-bug gap.
Review by Vai

What
Adds a new
packages/aws-lambda/workspace package — the AWS Lambdaadapter for HyperFrames distributed rendering. One Lambda handler
dispatches on
event.Action ∈ {plan, renderChunk, assemble}to thematching OSS primitive from
@hyperframes/producer/distributed, plusthe ZIP build pipeline and a BeginFrame regression-guard probe.
This is PR 6.1 in the 8-PR Phase 6 stack (see
DISTRIBUTED-RENDERING-PLAN.md§11 Phase 6 + §15). Phase 6a (PRs6.1-6.3) validates the architecture on real AWS; Phase 6b (6.4-6.8)
wraps it in the user-facing CLI + CDK construct + docs.
Why
Phase 5 (low-level CLI) was dropped because the real adopter paths are
either Temporal (HeyGen, internal) or Lambda (OSS users). For the
Lambda path to work, we need a Lambda-deployable artifact that wraps
the merged OSS distributed primitives.
The load-bearing architectural assumption is that
@sparticuz/chromium'schrome-headless-shellbuild honours CDPHeadlessExperimental.beginFramewithscreenshot: true— the onlyrendering path the engine knows about post-Phase-1 cleanup. If it
didn't, the architecture would change (fall back to bundling our own
pinned
chrome-headless-shell).The probe at
scripts/probe-beginframe.tsis a permanent regressionguard, not a one-shot verification. It runs the probe inside
public.ecr.aws/lambda/nodejs:22— the same image real Lambda uses —and asserts a PNG buffer comes back.
How
Architecture (ZIP deploy, not container image; matches Remotion Lambda's
ergonomics without copying their code):
Handler glue is intentionally thin: parse event → S3 download → call
OSS primitive (pure function over local paths) → S3 upload → return
small JSON. Result payloads sized for Step Functions history budgets
(<200 bytes per chunk per §2.4).
Chrome source is selectable at build time via
--source=sparticuz(default)or
--source=chrome-headless-shell(fallback). Handler readsHYPERFRAMES_LAMBDA_CHROME_SOURCEenv var at boot.Producer bug fix bundled
While running the eval against the PR 6.3 stack,
many-cuts(a fixturewith audio) tripped
PLAN_HASH_MISMATCH. Diagnosed locally — repros inthe producer regardless of Lambda. Root cause:
plan()cleans up<planDir>/.plan-work/AFTERfreezePlan()walks the planDir. Whenthe audio mix downloads source mp3s, they land in
.plan-work/downloads/and freezePlan hashes them. Chunk workers seethe cleaned planDir (no
.plan-work/) so their recomputed hashdiffers.
Fix: move the
.plan-work/cleanup to BEFOREfreezePlan().Added regression test in
plan.test.tsthat assertsrecomputePlanHashFromPlanDir(planDir) === result.planHashfor anyplan — catches this whole class of bug going forward.
Sizes (sparticuz source)
ZIP package shape:
tarnorunzipin/usr/bin— we use thetarnpm package overnode:zlibto handle on-the-wire.tar.gzarchives.__requireshim throws "Dynamic require ofpath" for any CJS dep that doesrequire('path')at top level (postcss does). Fixed viacreateRequirebanner in the bundle.Test plan
bun run --filter @hyperframes/aws-lambda test— 25 tests passbun test packages/producer/src/services/distributed/plan.test.ts— 16 tests pass (incl. the new planHash regression test)bunx oxlint+bunx oxfmt --check— cleanbun run --cwd packages/aws-lambda build:zip— produces a 124 MB ZIP / 245 MiB unzippedbun run --cwd packages/aws-lambda verify:zip-size— passeseval.sh) renders 4 fixtures end-to-end on Lambda; PSNR vs baseline 33-52 dB across all fixtures; speedup 1.44-1.72× at N=4 chunks for compositions ≥ 4s.🤖 Generated with Claude Code