Skip to content

Add Linux arm64 sandbox runtime support#19

Merged
congwang-mk merged 11 commits intomultikernel:mainfrom
gokwok:arm64_support
Apr 25, 2026
Merged

Add Linux arm64 sandbox runtime support#19
congwang-mk merged 11 commits intomultikernel:mainfrom
gokwok:arm64_support

Conversation

@gokwok
Copy link
Copy Markdown
Contributor

@gokwok gokwok commented Apr 22, 2026

Summary

This PR adds Linux arm64/aarch64 support for sandlock's seccomp-based sandbox
runtime.

The existing implementation assumed the x86_64 Linux syscall ABI in several
places:
syscall numbers, seccomp audit architecture, raw syscall argument registers,
legacy non-*at path syscalls, struct stat layout, ptrace register capture, and
vDSO symbol/stub handling. Those assumptions prevented the runtime and test
suite from working correctly on arm64.

This series makes those paths architecture-aware while preserving the existing
x86_64 behavior.

What Changed

  • Add architecture-specific syscall and seccomp ABI definitions.
  • Make raw syscall helpers support both x86_64 and arm64 calling conventions.
  • Treat legacy path syscalls as optional, since Linux arm64 does not expose
    syscalls such as SYS_open/SYS_stat/SYS_access.
  • Use equivalent raw *at syscall ABIs in tests where arm64 lacks legacy path
    syscalls.
  • Fix COW filesystem handling on arm64:
    • use native libc::stat layout instead of x86_64 hand-packed stat bytes
    • normalize virtual COW paths
    • virtualize getcwd after chdir into COW-only directories
    • support getdents on upper-layer directory fds
  • Add arm64 checkpoint support through PTRACE_GETREGSET.
  • Add arm64 vDSO time syscall stubs for deterministic time handling.
  • Adjust integration and Python tests so arm64 coverage runs instead of being
    skipped for x86_64-only assumptions.
  • Clean up Python test warnings to keep warning-free test runs stable.

Compatibility

The x86_64 behavior is intended to remain unchanged. The arm64 implementation
uses architecture-specific constants and optional syscall registration rather
than hard-coding x86_64-only syscall numbers.

One important ABI difference is that Linux arm64 does not provide several
legacy non-*at path syscalls. For those cases, this PR tests the equivalent raw
*at ABI, for example openat(AT_FDCWD, ...), instead of pretending SYS_open
exists on arm64.

Test Plan

Verified on arm64(macos apple container / orb linux machine):

  • cargo integration tests pass
  • Python test suite passes
  • Python test suite also passes with warnings treated as errors

The final patch series keeps the x86_64 assumptions isolated behind the new
architecture layer and keeps the existing x86_64 syscall paths intact.

gokwok added 9 commits April 22, 2026 11:04
Introduce architecture-specific syscall numbers, audit arch values, and optional legacy path syscall constants so the seccomp filters and dispatch tables build on arm64 while preserving x86_64 behavior.

Signed-off-by: gokwok <531504879@qq.com>
Make the rootfs helper compile on architectures that do not expose the legacy non-*at path syscall ABI.

Signed-off-by: gokwok <531504879@qq.com>
Allow tests to reflect arm64 syscall availability and vDSO symbol naming without changing x86_64 expectations.

Signed-off-by: gokwok <531504879@qq.com>
Use libc flag values, architecture-specific memfd syscall numbers, page-safe child string reads, and host-path checks needed for arm64 runtime behavior.

Signed-off-by: gokwok <531504879@qq.com>
Adjust integration fixtures and Python tests so arm64 runs the supported coverage instead of inheriting x86_64-only assumptions.

Signed-off-by: gokwok <531504879@qq.com>
Capture arm64 registers through PTRACE_GETREGSET, patch arm64 vDSO time helpers, and stabilize deterministic getdents caching for parity tests.

Signed-off-by: gokwok <531504879@qq.com>
Use native stat layouts, normalize virtual COW paths, virtualize getcwd after COW-only chdir, and merge directory reads for upper-layer fds.

Signed-off-by: gokwok <531504879@qq.com>
Run the chroot raw path syscall coverage on arm64 by mapping legacy helper commands to equivalent *at syscalls where the legacy ABI is absent.

Signed-off-by: gokwok <531504879@qq.com>
Replace deprecated asyncio loop usage, close test files explicitly, and avoid an extra shell fork in the gather pipeline test.

Signed-off-by: gokwok <531504879@qq.com>
@congwang-mk
Copy link
Copy Markdown
Contributor

congwang-mk commented Apr 22, 2026

Thanks for the PR!

CowState.virtual_cwds is keyed by pid and entries are never removed. Scope is bounded to a single Sandbox::run() (state is dropped when the sandbox exits), so this isn't a long-lived leak — but within one run it's a latent correctness bug:

  1. Child pid X chdirs into a COW-only dir → virtual_cwds[X] = "/workdir/newdir".
  2. Child X exits. The supervisor has no pid-exit hook that removes the entry (handle_wait at resource.rs:71 only decrements proc_count; it doesn't see which pid was reaped).
  3. Kernel reuses pid X for a new child that never chdir'd.
  4. That child's getcwd / path-resolving COW handlers now return the stale virtual cwd from step 1.

Low probability (short-lived sandboxes, fresh PID namespace, would need to wrap the PID space), but the failure mode is silent and hard to debug if it ever hits. CowState.dir_cache has the same shape and is only cleaned on fd reuse, not pid exit — so a fix should probably cover both.

BTW, the Rust tests job failed on this PR: https://github.com/multikernel/sandlock/actions/runs/24758554033/job/72484409985 — please take a look and get it green before this merges.

gokwok added 2 commits April 22, 2026 18:50
Signed-off-by: gokwok <531504879@qq.com>
Signed-off-by: gokwok <531504879@qq.com>
@gokwok
Copy link
Copy Markdown
Contributor Author

gokwok commented Apr 22, 2026

Thanks for the PR!

CowState.virtual_cwds is keyed by pid and entries are never removed. Scope is bounded to a single Sandbox::run() (state is dropped when the sandbox exits), so this isn't a long-lived leak — but within one run it's a latent correctness bug:

  1. Child pid X chdirs into a COW-only dir → virtual_cwds[X] = "/workdir/newdir".
  2. Child X exits. The supervisor has no pid-exit hook that removes the entry (handle_wait at resource.rs:71 only decrements proc_count; it doesn't see which pid was reaped).
  3. Kernel reuses pid X for a new child that never chdir'd.
  4. That child's getcwd / path-resolving COW handlers now return the stale virtual cwd from step 1.

Low probability (short-lived sandboxes, fresh PID namespace, would need to wrap the PID space), but the failure mode is silent and hard to debug if it ever hits. CowState.dir_cache has the same shape and is only cleaned on fd reuse, not pid exit — so a fix should probably cover both.

BTW, the Rust tests job failed on this PR: https://github.com/multikernel/sandlock/actions/runs/24758554033/job/72484409985 — please take a look and get it green before this merges.

Thanks for the detailed review!

I've pushed the fixes to this PR branch.

I fixed both issues in the latest update:

  1. The stale per-pid COW state issue is addressed by keying COW per-process state with a stable pid identity instead of the raw pid alone. The key now includes both pid and the process start time from /proc//stat. This covers both CowState.virtual_cwds and CowState.dir_cache, and stale entries for a reused numeric pid are pruned when the new pid identity is observed.

  2. The Rust test failure was caused by COW path resolution not mapping dirfd targets that already point into the COW upper layer back to the logical workdir path. I added that upper-to-workdir mapping and extended the COW chdir test to cover mkdirat through a directory fd.

@congwang-mk congwang-mk merged commit 25afe7f into multikernel:main Apr 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants