This document is the canonical technical reference for the runtime. It maps the source tree, then walks each subsystem from guest entry to syscall return: memory layout, the EL1 shim and HVC protocol, page-table management, syscall translation, threads, fork/clone, signals, ptrace, dynamic linking, and the GDB stub.
It is aimed at contributors. For a high-level overview see ../README.md; for command-line use see usage.md; for the validation flow see testing.md.
elfuse runs one Linux guest process inside one Hypervisor.framework VM owned
by one macOS host process. Guest code executes at EL0. A small EL1 shim
(src/core/shim.S) handles exceptions, provides the syscall trap path, and
cooperates with the host through a compact HVC protocol.
aarch64 guests execute their own instructions natively on the CPU.
x86_64 guests (both static and dynamic) execute through Apple's
Rosetta translator hosted inside the same VM at link address
0x800000000000 (see x86_64-via-Apple-Rosetta).
Both architectures share the same EL1 shim, syscall surface, and
host-side handlers.
The whole runtime fits into a load -> boot -> run -> translate -> return loop:
- Load.
src/main.cparses options;src/core/elf.cparses the guest ELF and anyPT_INTERP;src/core/guest.creserves a demand-paged guest address space (up to 1 TiB IPA on M3+); the initial Linux stack is built bysrc/core/stack.c. - Boot.
src/core/bootstrap.cinstalls the EL1 shim (src/core/shim.S) in the runtime infrastructure reserve, seeds page tables, and enters EL0. For x86_64 guests, Rosetta is loaded alongside as a co-resident aarch64 binary at0x800000000000. - Run. The guest executes natively on an HVF vCPU. Pure computation never leaves the guest.
- Translate. A guest
SVC #0traps into the shim and is forwarded overHVC #5to the host.src/syscall/syscall.cdispatches into focused domain handlers (mem.c,fs.c,net.c,signal.c, ...) that translate errno values, flag layouts, struct shapes, and socket address formats between Linux and macOS. - Return. The host writes the result back into the vCPU, signals any
required TLBI (see Dynamic Page-Table Extension And
TLBI), and the shim
ERETs back to EL0.
Threads enter the same loop on their own vCPU
(src/runtime/thread.c). fork clones the loop in a new
posix_spawn-ed elfuse process and transfers state through IPC
(src/runtime/forkipc.c). execve reloads the ELF inside the
existing VM and restarts the loop (src/syscall/exec.c).
Boot sequence:
src/main.cparses options and prepares guest bootstrap state.src/core/elf.cparses the ELF image and anyPT_INTERPinterpreter.src/core/stack.cbuilds a Linux-style initial stack (argc,argv,envp,auxv).src/core/guest.creserves guest memory, page tables, and semantic regions.src/core/shim.Senters guest EL0 and forwards traps through HVC exits.src/syscall/syscall.cdispatches Linux syscalls into domain handlers.src/syscall/proc.cowns the main vCPU run loop and stop/exit integration.
Top-level areas:
src/core/: guest memory, ELF loading, bootstrap, initial stack, vDSO, EL1 shim assembly, and the embedded Rosetta translator hostsrc/syscall/: Linux syscall handlers and translation boundariessrc/runtime/: thread table, futexes, procfs helpers, fork/clone IPC, argv/comm rewriting forprctl PR_SET_NAMEsrc/debug/: crash reporting and built-in GDB RSP stub
Key files:
| File | Role |
|---|---|
src/main.c |
CLI, bootstrap, first vCPU creation, debugger startup |
src/core/guest.c |
guest memory reservation, region tracking, page tables |
src/core/elf.c |
ELF parsing, PT_LOAD, PT_INTERP, loader decisions |
src/core/stack.c |
Linux initial stack and auxv construction |
src/core/shim.S |
EL1 shim, exception vectors, HVC protocol |
src/core/shim-globals.c |
EL1-only shim_data cache (identity slots, urandom ring, attention bits) |
src/core/vdso.c |
synthetic vDSO (CNTVCT clock_gettime fast path, SVC trampolines) |
src/core/rosetta.c |
x86_64-via-Rosetta translator host, AOT cache, kbuf alias |
src/core/sysroot.c |
--sysroot / --create-sysroot provisioning |
src/syscall/syscall.c |
syscall dispatch and shared wrapper helpers |
src/syscall/mem.c |
brk, mmap, mprotect, mremap, madvise, msync |
src/syscall/fs.c, fs-stat.c, fs-xattr.c |
filesystem syscalls |
src/syscall/io.c, poll.c, fd.c, fdtable.c |
I/O, polling, FD lifecycle and table |
src/syscall/path.c |
centralized guest-to-host path resolution |
src/syscall/sidecar.c |
case-fold sidecar tokens for case-insensitive macOS volumes |
src/syscall/fuse.c |
guest-internal FUSE transport and minimal VFS |
src/syscall/inotify.c |
inotify via kqueue EVFILT_VNODE |
src/syscall/sysvipc.c |
System V shared memory and semaphores |
src/syscall/signal.c |
signal delivery, rt_sigframe, rt_sigaction |
src/syscall/time.c |
clocks, timers, setitimer, clock-ID translation |
src/syscall/sys.c |
uname, sysinfo, getrandom, prlimit64 |
src/syscall/net.c, net-abi.c, net-absock.c, net-msg.c, net-sockopt.c, netlink.c |
sockets, SCM_RIGHTS, abstract Unix sockets, netlink |
src/syscall/translate.c |
errno and shared AT_* flag translation |
src/syscall/proc.c |
vCPU run loop, wait4, ptrace coordination, HVC #6 routing |
src/syscall/exec.c |
execve: ELF reload, interpreter resolve, vCPU restart |
src/runtime/forkipc.c, fork-state.c |
fork/clone state transfer over the fork IPC channel |
src/runtime/thread.c, futex.c |
guest thread table, futex wait queues |
src/runtime/procemu.c |
/proc, /dev, and selected pseudo-files |
src/runtime/proctitle.c |
argv / comm rewriting for prctl PR_SET_NAME |
src/debug/gdbstub.c, gdbstub-rsp.c, gdbstub-reg.c |
GDB RSP stub |
Apple HVF imposes a handful of constraints that shape the rest of the design:
- W^X is enforced even with
SCTLR.WXN=0. A page-table entry cannot be both writable and executable. Use RW for data and RX for code. - HVF returns
SCTLR=0x0by default;RES1bits must be set explicitly. The shim usesSCTLR_RES1(0x30D00980) plus the desired bits. - The MMU must be enabled from inside the vCPU (via HVC #4 from the shim),
not by the host before
hv_vcpu_run(). SettingSCTLR.M=1from the host before vCPU entry causes permission faults on the first instruction fetch. GUEST_IPA_BASEmust be0. ELF binaries use absolute addresses from their link address (e.g.,0x400000); a non-zero IPA base produces translation faults.- System registers cannot be set via
MSRfrom the guest becauseHCR_EL2.TSC=1traps allMSRwrites. Boot-time sysreg installation (RES1 bits, MMU enable, TTBR0, etc.) goes through HVC #4 from the EL1 shim. Runtime EL0 sysreg traps --MSR TPIDR_EL0and similar -- are handled by the HVC #12 system-instruction trap path. - Only
HV_SYS_REG_*constants from Hypervisor.framework may be used for register IDs. - Cross-thread vCPU register access is unreliable; all register access must happen on the owning thread. This drives the snapshot protocol used by both ptrace and the GDB stub.
- HVF allows only one VM per host process. This is the reason
forkis implemented throughposix_spawnplus IPC state transfer.
Guest memory is identity-mapped (guest VA == guest IPA). Large areas use 2 MiB block descriptors by default; mappings that need mixed permissions are split to 4 KiB L3 pages (see Page Table Splitting).
Low, fixed addresses:
0x00400000 - varies: ELF LOAD segments (PIE_LOAD_BASE for ET_DYN)
0x01000000: brk base (16 MiB)
0x07800000 - 0x07800FFF: Stack guard page (PROT_NONE, dynamic position)
0x07801000 - 0x07FFFFFF: Stack (8 MiB, 4 x 2 MiB blocks, RW, grows down)
0x10000000 - 0x101FFFFF: mmap RX region (initial 2 MiB, pre-mapped RX)
0x10200000 - mmap_limit: mmap RX growth area
0x000200000000 - 0x0002001FFFFF: mmap RW region (initial 2 MiB at 8 GiB, RW)
0x000200200000 - mmap_limit: mmap RW growth area
Within-32-bit values are rendered with 8-digit padding; values that extend above 32 bits use 12-digit padding so a reader can tell the range class at a glance.
High addresses, anchored to interp_base (computed from guest_size, see
below). The runtime infrastructure reserve is a 16 MiB region placed at
[interp_base - INFRA_RESERVE, interp_base) -- in the dead zone above
mmap_limit -- so guest binaries keep the low addresses their link scripts
expect. It is present regardless of --sysroot; offsets are relative to its
base infra = interp_base - 16 MiB (exact values in src/core/guest.h):
infra + 0x000000 - 0x00FFFF: null guard (64 KiB, unmapped)
infra + 0x010000 - 0xDF5FFF: page table pool (~13.9 MiB, RW)
infra + 0xDF6000 - 0xDFFFFF: shim code slot (40 KiB, RX). Shares the PT
pool's tail 2 MiB L2 block, so that block
splits to 4 KiB L3 pages (mixed RX/RW).
infra + 0xE00000 - 0xFFFFFF: shim data + EL1 stack (2 MiB L2 block, RW;
ends at interp_base)
interp_base - varies: Dynamic linker (g->interp_base, --sysroot only)
The reserve is demand-paged (MAP_ANON), so its unused page-table-pool pages
cost no host RAM despite the generous virtual reservation. The pool holds
~3558 L3 pages (~7 GiB of split address space), enough for the many V8
isolates a Node worker_threads pool or cluster spins up.
The guest size is determined by the VM's configured IPA width (capped at 40-bit / 1 TiB):
- 36-bit IPA (64 GiB) -- native AArch64 on Apple M2:
mmap_limit ≈ 56 GiB,interp_base ≈ 60 GiB - 40-bit IPA (1 TiB) -- native AArch64 on Apple M3 and later:
mmap_limit ≈ 1016 GiB,interp_base ≈ 1020 GiB
Both mmap_limit and interp_base are computed at runtime from guest_size
and stored in guest_t. macOS demand-pages physical memory on first touch,
so the reservation costs no RAM until the guest writes to a page.
The mmap RW region starts at 8 GiB to match real Linux kernel address-space
layout, where mmap allocations sit well above text/data/brk.
For address spaces larger than 512 GiB, the L0 page table needs multiple entries (each covers 512 GiB). The page-table builders compute the L0 index from the actual IPA and allocate L1 tables on demand per L0 slot.
A 2 MiB L2 block descriptor cannot mix RW and RX permissions, but real shared
libraries combine .text (RX) and .data (RW) within a single 2 MiB range.
guest_split_block() in src/core/guest.c converts an L2 block descriptor
into a table descriptor pointing to an L3 table of 512 × 4 KiB page entries,
each with independent permissions.
Splitting is triggered by:
sys_mmapwithMAP_FIXED, when the fixed address lands in a block whose permissions differ from the request (typical case: the dynamic linker overlaying.dataRW onto a library.textRX block).sys_mprotect, when changing permissions for a sub-block range, e.g. RELRO finalization.
guest_update_perms() orchestrates the full workflow: it checks whether a
block needs splitting, splits it if so, then updates the affected L3 page
entries. Whole-block permission changes are done in place without splitting.
mmap itself uses a gap-finding allocator that walks the sorted region array
to find free address space. PROT_EXEC requests go to the RX region
(MMAP_RX_BASE = 0x10000000); other requests go to the RW region
(MMAP_BASE = 0x200000000). Address hints are honored when possible. This
arrangement makes .text and .data land in different 2 MiB blocks where
practical, and L3 splitting handles the residual cases where they share a
block.
When sys_mmap or sys_brk needs memory beyond the currently mapped page
tables, the host calls guest_extend_page_tables() to add new L2 entries.
This is safe because the vCPU is paused while servicing HVC #5. The host
accumulates the smallest sufficient TLBI request into a per-vCPU
_Thread_local cpu_tlbi_req slot (see src/core/guest.h) and translates
it into X8, X9, X10, and X11 on return from HVC #5:
X8 == 0 TLBI_NONE no flush; restore GPRs (keep X0); ERET
X8 == 1 TLBI_BROADCAST TLBI VMALLE1IS + DSB ISH + ISB
-> restore GPRs (keep X0); ERET
X8 == 2 drop-frame discard the saved GPR frame
(`add sp, sp, #256`) and ERET on the rebuilt
EL0 register state. Set by `execve` and
`rt_sigreturn` (which write the whole frame
directly into the vCPU) and by
`signal_deliver()` on the syscall-return
path (so handler PC/SP/LR/args installed by
the host are not overwritten by the stale
shim frame on ERET). `execve` additionally
issues `IC IALLU` because the new program
text may live in pages that previously held
the old text.
X8 == 3 TLBI_RANGE loop TLBI VAE1IS over `X9` (start VA),
`X10` (page count); 4 KiB granule. Used for
up to `TLBI_SELECTIVE_MAX_PAGES = 16` pages.
X8 == 4 TLBI_RANGE_LARGE single-shot TLBI RVAE1IS with the encoded
operand in `X9`. Used for 17..64 pages when
`FEAT_TLBIRANGE` is available
(`g_tlbi_range_supported`). Above 64 pages,
or when the feature is absent, the host
upgrades to broadcast (`X8 == 1`).
X11 icache hint Set to `1` when the request transitions a
page to executable (W^X swap); the shim then
issues an `IC` invalidate alongside the
chosen TLBI flavor.
TLBI VAE1IS retires any 2 MiB block entry containing the VA
(ARM ARM B2.2.5.6), so guest_split_block() callers no longer issue a
separate broadcast after the split lands.
X8 == 2 is the generic drop-saved-frame marker: the host has
rebuilt EL0 register state directly into the vCPU and the saved
syscall frame on the EL1 stack is stale, so the shim drops the frame
and ERETs without restoring GPRs. Three call sites use it:
sys_execve(src/syscall/exec.c:785, 1093) after the ELF reload.signal_rt_sigreturn(src/syscall/signal.c:1710) after restoring the saved sigframe.signal_deliver(src/syscall/signal.c:1594) when a signal is delivered on the syscall-return path; without the marker the shim would overwrite the handler PC, SP, LR, and arg-register state with the stale syscall frame onERET.
X8 (the syscall-number register) and X9/X10 are already considered
clobbered by the Linux syscall ABI, so callers never expect them to be
preserved across SVC.
Important: the first two paths (sys_execve and
signal_rt_sigreturn) return SYSCALL_EXEC_HAPPENED to bypass the
normal syscall dispatch epilogue. signal_deliver runs from inside
the epilogue. Any future code path that rebuilds EL0 register state
on the syscall-return path must write X8 = 2 the same way.
Vectors enter EL1 from EL0 traps and forward them to the host through HVC.
DC ZVA is emulated inside the shim (it zeroes 64 bytes at the cache-line-
aligned address from the Rt register); HVF traps DC ZVA via HCR_EL2.TDZ=1.
| HVC # | Purpose | Registers |
|---|---|---|
| #0 | Normal exit | X0 = exit code |
| #2 | Bad exception | X0=ESR, X1=FAR, X2=ELR, X3=SPSR, X5=vector |
| #4 | Set boot system register | X0 = reg ID (0–8), X1 = value (used by the shim during boot to install RES1 bits and enable the MMU) |
| #5 | Syscall forward | X0–X5 = args, X8 = syscall number on entry; on return X8 carries the TLBI kind (0 = none, 1 = broadcast, 3 = selective range with X9 = VA + X10 = page count, 4 = single-shot TLBI RVAE1IS with encoded operand in X9). X8 = 2 is the generic drop-saved-frame marker -- set when the host has rebuilt EL0 state directly (by execve, rt_sigreturn, and signal_deliver() on the syscall-return path) so the shim discards the saved syscall frame on ERET. X11 is the icache-flush hint (set to 1 when the request transitions a page to executable, so the shim issues IC alongside the chosen TLBI) |
| #6 | Embedder extension | X8 = call number, X0–X7 = args; routed to g->hvc6_handler if set, no-op otherwise. Handler may request a vCPU yield via proc_request_hvc6_yield() |
| #7 | MRS trap (read sysreg) | host reads register from ESR ISS; returns value in X0 |
| #9 | W^X toggle | X0 = FAR, X1 = type (0 = exec→RX, 1 = write→RW) |
| #10 | BRK from EL0 | SIGTRAP delivery / ptrace-stop; GPRs in frame |
| #11 | EL0 fault | SIGSEGV/SIGILL delivery; GPRs in frame |
| #12 | EL0 system-instruction trap | cache maintenance logging (DC CVAU, IC IVAU, …) and MSR TPIDR_EL0 emulation |
Vector entry stubs that lead to svc_handler MUST NOT clobber any GPR. The
Linux syscall ABI preserves all registers except X0 across SVC #0, and
musl/glibc rely on this for scratch registers (X9–X15). If a vector entry
writes to any GPR (e.g., mov x5, #offset) before svc_handler saves
registers, the saved value is wrong and the EL0 caller's register state is
corrupted after ERET.
Only bad_exception vectors may clobber X5 (they halt, so preservation is
unnecessary).
Linux user-space compatibility comes from explicit translation at the syscall
boundary. src/syscall/syscall.c routes syscall numbers into focused domain
files. The translation layer is responsible for:
- errno translation
- flag translation
- Linux/macOS structure layout adaptation
- guest memory copying for pointer arguments
- Linux-compatible descriptor and signal semantics
macOS and Linux errno values diverge starting around 35. linux_errno() in
src/syscall/translate.c maps via switch. Notable mappings:
| Linux | macOS |
|---|---|
EAGAIN (11) |
EAGAIN (35) |
ENOSYS (38) |
ENOSYS (78) |
ENAMETOOLONG (36) |
ENAMETOOLONG (63) |
ELOOP (40) |
ELOOP (62) |
AT_* flag bits differ between Linux and macOS:
| Flag | Linux | macOS |
|---|---|---|
AT_SYMLINK_NOFOLLOW |
0x100 |
0x20 |
AT_SYMLINK_FOLLOW |
0x400 |
0x40 |
AT_REMOVEDIR |
0x200 |
0x80 |
Most AT_* flags go through translate_at_flags() in
src/syscall/translate.c before issuing macOS calls. faccessat is a
special case: Linux defines AT_EACCESS at the same bit (0x200) as
AT_REMOVEDIR, so faccessat paths instead use
translate_faccessat_flags() (also in src/syscall/translate.c), which
interprets that bit as AT_EACCESS.
Aarch64-Linux open flags differ from x86_64. From asm-generic/fcntl.h:
| Flag | Value (octal) | Value (hex) |
|---|---|---|
O_DIRECTORY |
040000 |
0x4000 |
O_NOFOLLOW |
0100000 |
0x8000 |
O_DIRECT |
0200000 |
0x10000 |
O_LARGEFILE |
0400000 |
0x20000 (no-op on LP64) |
O_CLOEXEC |
02000000 |
0x80000 |
Linux CLOCK_MONOTONIC = 1, macOS CLOCK_MONOTONIC = 6. See
translate_clockid() in src/syscall/time.c. Other clock IDs are
translated similarly.
Socket syscalls are translated in src/syscall/net.c and friends:
AF_INET6differs: Linux10, macOS30.sockaddrhas nosa_lenbyte on Linux but does on macOS. All conversions go throughlinux_to_mac_sockaddr()andmac_to_linux_sockaddr().- Linux ORs
SOCK_NONBLOCK(0x800) andSOCK_CLOEXEC(0x80000) into the type argument; both bits must be extracted before callingsocket(). SOL_SOCKEToption numbers (SO_TYPE,SO_SNDBUF,SO_RCVBUF, …) differ between platforms and are remapped per option.
The Linux initial stack must have SP 16-byte aligned and pointing directly
at argc. Total 8-byte words on the structured area are
35 + extra + argc + envc, where:
35covers the fixed scaffolding: 15 base auxv entries (30words) + theAT_NULLauxv terminator (2words) + theenvpNULLterminator (1) + theargvNULLterminator (1) +argcitself (1) =35.extrastarts at4becauseAT_EXECFNandAT_BASEare always emitted (+2words each). It adds another2forAT_SYSINFO_EHDRwhen a vDSO is present, and another2forAT_EXECFDwhenbinfmt_miscpasses one.argcandenvcare the user-provided argument and environment counts.
If that total is odd, one padding word is pushed before auxv. Padding
goes above the structured area, never below. Post-push masking
(sp &= ~15) breaks because it inserts a gap between SP and argc. See
build_linux_stack() in src/core/stack.c for the full layout.
Aligned file-backed MAP_SHARED (fixed or non-fixed) installs a real
host mmap(MAP_FIXED|MAP_SHARED, fd) overlay onto the guest slab so
the kernel page cache keeps the mapping coherent with the file (and
with peer overlays). The slab is tracked as a sorted list of
2 MiB-aligned hvf_segment_t entries; each overlay request splits,
unmaps, and re-mmaps the host file at the exact host VA, then
re-maps the segment. Apple Silicon enforces 16 KiB host pages, so the
gap-finder advances to the next host-page boundary after each
allocation.
The snapshot pread emulation (zero first for pages beyond EOF, then
overlay file content) is the fallback for the cases where the overlay
path cannot be used: misaligned MAP_FIXED, MAP_PRIVATE file-backed
mappings, and any time the host slab cannot accept a fresh overlay at
the requested VA.
MAP_SHARED|MAP_ANONYMOUS is promoted before fork to a memfd-style
overlay (mmap_fork_prepare_anon_shared) and reattached in the child
via SCM_RIGHTS (mmap_fork_restore_overlays), so cross-fork shared
anonymous memory stays coherent. Both promotion and overlay are
disabled for Rosetta because HVF caches host VA-to-PA at hv_vm_map
time. Validation: tests/test-msync.c,
tests/test-cross-fork-mapshared.c.
Guest threads map 1:1 to host pthreads. Each guest thread owns one vCPU and
shares the same guest address space. HVF supports multiple vCPUs per VM,
each bound to the host thread that created it; multiple vCPUs share guest
physical memory via hv_vm_map(). Up to MAX_THREADS = 64 guest threads
are supported per VM.
src/runtime/thread.c and thread.h:
thread_entry_tper thread holds the vCPU handle, host pthread, per-thread signal mask,clear_child_tid(forCLONE_CHILD_CLEARTID), and the thread'sSP_EL1exception stack._Thread_local current_threadgives O(1) access from syscall handlers.- Each thread receives a 4 KiB EL1 exception stack carved out of the shim data region.
src/runtime/futex.c implements:
- The classic ops:
FUTEX_WAIT,FUTEX_WAKE,FUTEX_WAIT_BITSET,FUTEX_WAKE_BITSET,FUTEX_REQUEUE,FUTEX_CMP_REQUEUE,FUTEX_WAKE_OP. - A subset of priority-inheritance ops:
FUTEX_LOCK_PI,FUTEX_UNLOCK_PI,FUTEX_TRYLOCK_PI. Priority semantics are not actually inherited -- the ops behave as ordinary mutex acquire/release, which is enough for glibc and musl to make forward progress. futex_waitv(syscall 449) for batch waits across up to 128 futex addresses.- Robust-list cleanup:
set_robust_list/get_robust_listare wired up insrc/syscall/syscall.c, androbust_list_walk()releases owned futexes on thread exit, marking themFUTEX_OWNER_DIED.
Wait queues live in a hash table keyed by guest virtual address, with a
per-bucket mutex. Each waiter has its own condition variable for precise
wakeup. Thread exit calls futex_wake_one() on clear_child_tid, which is
how pthread_join() waits via the TID address.
Atomicity: the bucket mutex is held across the futex word read and the waiter enqueue, so the compare-and-wait is a single critical section.
| Resource | Lock | File |
|---|---|---|
mmap/brk allocators + page tables |
mmap_lock (order 1) |
src/syscall/mem.c |
| FD table | fd_lock (order 3) |
src/syscall/fdtable.c |
| Special FDs (timerfd, eventfd, signalfd) | sfd_lock (order 5a) |
src/syscall/fd.c |
| Thread table | pthread_mutex |
src/runtime/thread.c |
| Futex wait queues | pthread_mutex (per bucket) |
src/runtime/futex.c |
| FUSE (sessions, file/dir state) | global fuse_lock + per-session session->lock |
src/syscall/fuse.c |
| Sysroot snapshot | pthread_mutex |
src/syscall/proc-state.c |
Lock ordering is documented inline in those files
(mmap_lock is order 1, fd_lock is order 3, sfd_lock is order 5a)
so callers that need multiple locks acquire them in the right sequence.
See the lock-ordering comment block in src/syscall/internal.h for the
authoritative list.
Page-table consistency is preserved by the mmap_lock plus TLB broadcasts
via TLBI VMALLE1IS from any vCPU; hardware coherency is verified by
tests/test-multi-vcpu.
exit_group sets a global exit_group_requested flag, calls
thread_for_each(thread_force_exit_cb) which invokes hv_vcpus_exit() on
all worker vCPUs to break them out of hv_vcpu_run(), and joins worker
threads with a timeout so CLEARTID cleanup can complete.
- True priority inheritance for the PI futex ops. The ops are wired up but behave as ordinary mutexes; this matches what glibc and musl need for forward progress, not real RT scheduling.
- CPU affinity (
sched_setaffinity) returns the all-CPUs mask.
macOS HVF allows only one VM per process, so process-style fork cannot
clone the live VM. elfuse implements it through posix_spawn plus IPC
state transfer:
- Parent creates a
socketpair(AF_UNIX, SOCK_STREAM). - Parent
posix_spawns a newelfuse --fork-child <fd>process. - Parent serializes VM state over IPC (see paths below).
- Child receives state, creates its own VM, restores registers directly
into EL0 (bypassing the shim
_startso callee-saved GPRs survive), and enters the vCPU loop withX0 = 0(the child return fromclone). - Parent records the child in the process table and returns the child PID.
When g->shm_fd >= 0 the guest memory is file-backed (mkstemp +
unlink, MAP_SHARED). Fork takes an APFS fclonefileat snapshot of
the backing file and sends the snapshot fd over SCM_RIGHTS:
- The snapshot decouples the parent and child: subsequent parent writes cannot be observed by the child even before the child remaps. APFS copy-on-write keeps the snapshot cheap.
- Parent stays on
MAP_SHAREDand does NOT remap -- HVF caches the host VA->PA mapping fromhv_vm_map, and aMAP_FIXEDremap does not update Stage-2, so a remapping parent would observe stale pages. - Child maps the snapshot fd
MAP_PRIVATE, producing an instant CoW clone with zero data copy. - The IPC header sets
has_shm = 1andnum_regions = 0, skipping memory serialization entirely. - Child calls
guest_init_from_shm()instead ofguest_init(), and must restoreg->ttbr0from the IPC header --guest_init_from_shmzeroes the struct, and withoutttbr0page-table walks fail for all high VAs. MAP_SHARED|MAP_ANONYMOUSregions are promoted to a memfd-style overlay before fork (mmap_fork_prepare_anon_shared) and reattached in the child via SCM_RIGHTS (mmap_fork_restore_overlays) so cross-fork shared-memory coherence is preserved.
This path is roughly 50x faster than the legacy IPC copy path on large guest memories.
The CoW path is disabled when hosting Rosetta because HVF caches the host
VA->PA mapping from hv_vm_map, and Rosetta's translated code touches
the parent's slab in ways the snapshot model cannot intercept. Rosetta
forks fall back to the legacy IPC copy path.
macOS rejects MAP_PRIVATE on shm_open fds (EINVAL), so the backing
file is created via mkstemp + unlink.
When g->shm_fd < 0, the parent serializes used memory regions in 1 MiB
chunks over the socket pair, the child calls guest_init() and receives the
data into fresh guest memory. CLOEXEC semantics follow POSIX: all FDs
(including those marked CLOEXEC) are inherited across fork. CLOEXEC takes
effect at execve (see src/syscall/exec.c step 4).
sys_clone_vm() in src/runtime/forkipc.c handles CLONE_VM without
CLONE_THREAD. Unlike sys_clone_thread(), VM-clone children are waitable
via wait4 and have exit semantics (exit_signal, vm_exit_status).
Unlike the posix_spawn fork path, they share the same guest_t*, the
same guest memory, and the same page tables.
In src/runtime/forkipc.c:
hv_vcpu_create()and per-threadSP_EL1allocation.- Set the child SP,
TPIDR_EL0(TLS), and copy the parent's signal mask. pthread_create()runningvcpu_run_loop()for the child vCPU.- Return the child TID to the parent; the child runs with
X0 = 0.
sys_execve in src/syscall/exec.c reloads the ELF, loads the dynamic
interpreter for dynamically-linked targets via the shared
elf_resolve_interp() helper (also used at startup), rebuilds page tables,
and restarts the vCPU. Signal handlers are reset to SIG_DFL per POSIX:
SIG_IGN stays SIG_IGN, and pending and blocked masks are preserved.
This happens in signal_reset_for_exec() after guest_reset.
Signals are fully implemented in src/syscall/signal.c. The signal frame
matches Linux arch/arm64/kernel/signal.c:setup_rt_frame() so the C
library's __restore_rt → rt_sigreturn (syscall 139) restores state
correctly. Both musl and glibc use this mechanism.
Key points:
signal_deliver()buildslinux_rt_sigframe_ton the guest stack, redirects the vCPU PC to the handler, and setsX0 = signum,X30 = sa_restorer.signal_rt_sigreturn()restores all 31 GPRs,SP,PC, andPSTATEfrom the frame.- SIGPIPE is queued automatically when
write,writev, orpwrite64returnsEPIPE. - Guest
ITIMER_REALis emulated internally rather than being forwarded to hostsetitimer, because macOS sharesalarm()andsetitimer(ITIMER_REAL)as a single timer, andelfusealready needsalarm()for its per-iteration vCPU watchdog. signal_check_timer()is called from the vCPU loop after each syscall.- After
SYSCALL_EXEC_HAPPENED, the vCPU loop verifies thatELR_EL1is non-zero -- a defensive check against HVF register-sync bugs. - Each
thread_entry_tcarries its ownblockedmask.rt_sigprocmaskoperates oncurrent_thread->blocked, and child threads inherit the parent's mask at clone time.
clone(CLONE_VM) produces an inferior that shares guest memory; the tracer
attaches via PTRACE_SEIZE. BRK instructions trigger ptrace-stops, after
which the tracer reads and writes registers through the snapshot protocol.
In src/syscall/proc.c:
PTRACE_SEIZE-- attach without stopping; setsptraced = 1.PTRACE_CONT-- resume the stopped tracee, optionally injecting a signal.PTRACE_INTERRUPT-- force the tracee into ptrace-stop viahv_vcpus_exit().PTRACE_GETREGSET/PTRACE_SETREGSET(NT_PRSTATUS) -- read or write the tracee's register snapshot. Writes are applied on resume.
Cross-thread HVF register access is unreliable, so every actor uses a snapshot:
- Tracee snapshots its own vCPU registers into
ptrace_regsbefore entering ptrace-stop. ptrace_stopped = 1and the tracer is signaled viaptrace_cond.- Tracer reads and writes the snapshot via
GETREGSET/SETREGSET. - Tracer issues
PTRACE_CONT; the tracee clearsptrace_stoppedand is signaled viaresume_cond. - Tracee applies any dirty register changes back to the vCPU and resumes.
- Guest executes
BRK→ shim forwards via HVC #10. - If
current_thread->ptraced, the tracee callsthread_ptrace_stop(SIGTRAP). - Snapshot, broadcast, wait, resume as above.
wait4returnsWIFSTOPPED(status)withWSTOPSIG == SIGTRAPto the tracer.
sys_wait4() first checks for ptraced or VM-clone children via
thread_ptrace_wait() before falling through to the process table for
regular fork children, so ptrace-stops and VM-exit states are reported
without racing the regular wait path.
src/syscall/fuse.c implements /dev/fuse, mount(..., "fuse", ...),
and the minimal VFS dispatch entirely inside the guest VM. Guest libfuse
programs (sshfs, ntfs-3g, AppImage runtimes) run without macFUSE,
FUSE-T, or FSKit on the host.
Key shape:
- A global
fuse_lockplus per-session locks. Sessions are refcounted, so in-flight reads and writes pin the lock against daemon exit. - Per-fd alias bindings for
dup,dup3, andfcntl(F_DUPFD). - Per-file
io_in_progressplusio_condserializesreadandgetdents64against the offset field (matching Linuxf_pos_lock).lseekwaits onio_in_progress;preadskips it. - Synchronous
FUSE_INITinsys_mount; negativehdr.errorpropagates back. Mount tombstones on daemon death and surfaces-ENOTCONN. O_PATHsupport, NAME_MAX-boundedgetdents64, 8 MiBFUSE_FRAME_CAPon daemon writes, and procfs integration (/proc/{mounts,filesystems,self/mountinfo}).fuse_materialize_pathsoexecvecan load FUSE-backed binaries.- The wait path honors
SA_RESTARTand ignored signals via thesignal_pending_interruption(restart_out)helper.
Validation lives in make test-fuse-alpine, which exercises
/dev/fuse plus mount("fuse") against the staged Alpine musl
sysroot fixture.
src/runtime/procemu.c intercepts a focused set of guest-visible paths
under /proc, /dev, and a few Linux-expected compatibility files:
- Many procfs files are synthesized from host-side runtime state.
/proc/self/*is backed by internal region, FD-table, and process data.- Synthetic proc directories support common traversal patterns used by
BusyBox
ps,uptime, andtop. - Guest cwd handling preserves a virtual
/procworking directory even though the host operates on synthetic backing directories.
Related implementation: src/runtime/procemu.c, src/syscall/path.c,
src/syscall/fs.c, src/syscall/proc-state.c.
elfuse supports dynamically linked aarch64-linux ELF binaries via
--sysroot:
elfuse --sysroot /path/to/sysroot ./my-dynamic-programHow it works:
elf_load()parsesPT_INTERPto find the interpreter path (/lib/ld-linux-aarch64.so.1for glibc,/lib/ld-musl-aarch64.so.1for musl).- The interpreter is loaded as
ET_DYNatg->interp_base(computed dynamically: 60 GiB for 36-bit IPA, 1020 GiB for 40-bit IPA). build_linux_stack()passesAT_BASE(interpreter load address) andAT_EXECFN(argv[0]) in the auxiliary vector.- The entry point becomes
interp_entry + load_base; the dynamic linker takes over from there. sys_openat()redirects guest absolute paths through the sysroot: if--sysrootis set, it tries<sysroot>/<path>first.
The sysroot is inherited by fork children via IPC state transfer.
sys_execve also loads the interpreter for dynamically linked targets, so
tools that execve dynamic children (env, nice, nohup) work
correctly. elf_resolve_interp() in src/core/elf.c is shared between
src/main.c and src/syscall/exec.c.
None currently tracked for the aarch64-linux dynamic-linker path.
src/core/rosetta.c hosts Apple's Rosetta Linux translator inside the
same VM that runs aarch64 guests, so statically linked x86_64-linux ELFs
execute through a translator the host process owns rather than an
external translation service. The guest architecture is auto-detected
from the ELF header; opt out via --no-rosetta or ELFUSE_NO_ROSETTA=1.
Address-space layout:
- The translator lives in the primary buffer at a low guest physical
address but is mapped at its link address
0x800000000000via a non-identity page-table entry. This works around the 36-bit Stage-2 IPA cap on M1 / M2. - A 256 MiB kernel buffer (kbuf) at
g->kbuf_gpais aliased atKBUF_VA_BASE = 0xFFFFFFFFF0000000under TTBR1 and atKBUF_USER_VA = KBUF_VA_BASE & 0x0000FFFFFFFFFFFFunder TTBR0, so Rosetta's TaggedPointer extraction resolves both views to the same physical pages. The kbuf is always RW; nothing executable is installed there. - M5 hosts bisect the primary slab from 1 TiB to 256 GiB or 64 GiB on
hv_vm_mapfailure while keeping the IPA width at 48 so the high-VA Stage-2 entries remain reachable.
Runtime integration:
- The VZ ioctls (
CHECK 0x80456125,CAPS 0x80806123,ACTIVATE 0x6124) are trapped wheng->is_rosettais set. /proc/self/exeis redirected toROSETTA_PATH(the binfmt-misc convention Rosetta expects).- The
rosettadSCM_RIGHTS bridge implements an SHA-256-keyed AOT cache under$HOME/.cache/elfuse-rosettad/. First launch warms the cache; subsequent launches reuse translations.
Fork interaction: the CoW shm fast path is disabled for Rosetta because
HVF caches host VA-to-PA at hv_vm_map time. Rosetta forks use the
legacy IPC copy path.
Dynamic linking under Rosetta is supported. For x86_64 guests with a
PT_INTERP header, elfuse does not load the ELF segments or the
interpreter from the host side; bootstrap_prepare deliberately skips
the aarch64 loader path (src/core/bootstrap.c:415-462) and instead
hands control to Rosetta. The translator then opens the guest ELF
through Linux syscalls, reads PT_INTERP, and mmaps
/lib64/ld-linux-x86-64.so.2 (or the musl equivalent) from the
sysroot. The translated dynamic linker loads shared libraries via the
same path. Coverage lives in tests/test-rosetta-glibc.sh, which
exercises direct loader bring-up, explicit ld.so invocation,
ld.so --list, runtime dlopen, initial-exec TLS, general-dynamic
TLS via dlopen, and per-pthread TLS.
Boundaries:
--gdbis rejected because the stub serves the aarch64 view Rosetta produces, not the original x86_64 architectural state.- Two Rosetta-internal divergences are tracked in the acceptance audit
rather than papered over:
SA_RESETHANDis shadowed by Rosetta's own signal-handler state, andclone(..., CLONE_SETTLS, tls=0, ...)can hang.
src/debug/ is split by role:
gdbstub.c-- session lifecycle, stop/resume flow, packet dispatchgdbstub-rsp.c-- RSP packet transport and hex helpersgdbstub-reg.c-- register snapshot layout, restore flow,target.xml
The stub runs in all-stop mode. Because Hypervisor.framework register access must happen on the owning thread, the stopped vCPU snapshots its own state; the GDB-handler thread reads and updates the snapshot, and the owning thread restores the modified state on resume -- the same pattern used by ptrace.
The split mirrors the architectural boundary: transport and encoding are independent of guest execution; register layout is independent of socket I/O; stop/resume sequencing remains tightly coupled to process and thread state.
elfuse uses several layers of validation:
make check-- fast guest tests plus the BusyBox applet smoke suite, followed byscripts/check-syscall-coverage.pyso any newdispatch.tblentry without a direct or aliased test reference fails the build.make test-busybox-- applet coverage in isolation.make test-fuse-alpine-- guest-internal FUSE against the Alpine musl sysroot fixture.make test-gdbstub-- debugger integration.make test-rosetta-all-- the x86_64 acceptance sub-suites (CLI gating, failure modes, statics, Alpine pipelines, audit, JIT, glibc dynamic).make test-matrix-- cross-checks elfuse (aarch64), QEMU (aarch64), and elfuse (x86_64-via-Rosetta) on overlapping corpora, with per-host baselines for the Rosetta branch.
The rule for contributors is simple: match the validation depth to the
subsystem you changed. Procfs, process state, dynamic linking, and
debugging typically warrant more than make check. Touching the
Rosetta path additionally requires make test-rosetta-all so the
acceptance audit catches new divergences from the documented
Rosetta-internal failures. See testing.md for the full
target list, per-host baseline scheme, and the validation-by-change-type
table.