Skip to content

fix(dind): resolve sibling bind sources via /proc/<runner-pid>/root#83

Merged
luthermonson merged 2 commits into
mainfrom
fix/dind-bind-mergedview
May 28, 2026
Merged

fix(dind): resolve sibling bind sources via /proc/<runner-pid>/root#83
luthermonson merged 2 commits into
mainfrom
fix/dind-bind-mergedview

Conversation

@luthermonson
Copy link
Copy Markdown
Contributor

Follow-up to #82

After #82 merged, ephpm workflows that use container: got past the
first sh script (/__w/_temp was correctly resolved into the runner's
upperdir) but failed on actions/checkout:

```
exec failed: exec: "/__e/node20/bin/node": stat /__e/node20/bin/node: no such file or directory
```

Root cause

The per-layer resolver in #82 returned the first lowerdir that had
the requested path. The GHA actions/runner image has
/home/runner/externals/ as a directory entry in one layer but
node20/bin/node deep inside it in a different (later) layer. The
walk picked the layer with the empty dir entry, bound an empty tree
into the sibling, and downstream exec couldn't find node20.

Fix

Resolve sources through /proc/<runner-task-pid>/root/X instead.
procfs's magic root link walks the runner's mount namespace, so dind
hands containerd the same merged overlay view the runner sees from the
inside. No layer ordering to get wrong — the kernel does the merge.

The per-layer walk is kept as a fallback for unit tests that don't have
a live PID (the e2e test sets PID=0 explicitly so it exercises the
snapshot-mount path against real overlayfs).

Wiring

  • pkg/dind/dind.go: new runnerTaskPID uint32 field on Server.
    SetRunnerRootfs signature grows by one uint32 argument.
  • pkg/dind/bindtranslate.go::translateBindSource: new resolution
    rule — when runnerTaskPID > 0, use /proc/<pid>/root/<src>.
    Returned rw (writes copy-up into the runner's own upperdir).
  • pkg/dind/containers.go::buildBindMounts: reads runnerTaskPID
    alongside the existing snapshot/bind state.
  • pkg/runtime/runtime.go: SetRunnerRootfs call moves from
    immediately-after-NewContainer to immediately-after-NewTask so
    task.Pid() is available. Sits next to SetRunnerNetNS now.

Tests

pkg/dind/bindtranslate_test.go:

  • TestTranslateBindSource_TaskPIDResolvesViaProcRoot — uses the test
    binary's own PID (os.Getpid()), plants a marker in t.TempDir(),
    asserts translation returns /proc/<pid>/root/<marker> and the file
    round-trips through that path.
  • TestTranslateBindSource_TaskPIDRejectsMissingSource — loud-failure
    regression: PID > 0 + missing source must error, not fall through to
    the layer walk.
  • Both Linux-only (procfs semantics).
  • All 9 pre-existing pure-function tests updated for the new signature
    (PID=0, exercise the snapshot-walk fallback unchanged).
  • 3 buildBindMounts integration tests unchanged (still on
    PID=0/snapshot-walk).

pkg/dind/bindtranslate_e2e_test.go:

  • Both e2e tests pass PID=0 so they exercise the snapshot walk against
    a real overlayfs snapshot (the production proc-path is exercised by
    the in-VM workflow run, not by a hermetic e2e — a live runner task is
    needed for /proc to be meaningful).

Docs

docs/arch/dind-bind-translation.md updated with the new resolution
order, a "why PID resolution beats per-layer walk" subsection
documenting the externals/node20 failure mode, and the security
envelope notes carried forward.

Test plan

  • CI green.
  • Build Windows binary, redeploy ephemerd service on host.
  • Re-run a failing ephpm workflow using container:. Expect
    actions/checkout to find node20 and the rest of the workflow
    to proceed.

The first cut of bind translation walked the runner snapshot's upperdir
then each lowerdir in order, returning the first match. That broke for
paths whose contents span multiple image layers — the GHA actions-runner
image has /home/runner/externals/ as a dir entry in layer 4 but the
actual node20/bin/node binary lives in layer 22. The walk picked layer
4, bound an empty tree, and actions/checkout failed downstream with
'exec: "/__e/node20/bin/node": no such file or directory'.

Resolve sources through /proc/<runner-task-pid>/root instead. procfs's
magic root link walks the runner's mount namespace, so the dind shim
hands containerd the same merged overlay view the runner sees from the
inside. The per-layer walk is kept as a fallback for unit tests that
don't have a live PID.

The runtime now captures task.Pid() after task.NewTask and passes it
into SetRunnerRootfs alongside the existing snapshot key + bind table,
right next to the SetRunnerNetNS call.
…d_test

portforward_test.go already declares an itoa helper. golangci-lint's
typecheck refuses the redeclaration in the same package. Inline
strconv.Itoa at the one call site instead — one fewer indirection,
no collision.
@luthermonson luthermonson merged commit 80e9dba into main May 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant