fix(arm64): fix SP_EL1 corruption on IRQ context switch to userspace#164
Merged
fix(arm64): fix SP_EL1 corruption on IRQ context switch to userspace#164
Conversation
- Fix walk_mapped_pages() to iterate L0[0-512] on ARM64 (was 0-256, an x86_64 assumption that skipped stack pages at L0[511]) - Unify stack constants in exec_process ARM64 paths to use USER_STACK_REGION_START instead of hardcoded 0x0000_FFFF_FF01_0000 - Add DATA_ABORT diagnostic: print TTBR0_EL1 for process identification - Add fork diagnostic: print child stack mapping and L4 frame address - Simplify telnetd to use direct socket I/O (skip PTY for now) - Add EOF handling in init_shell for non-PID-1 instances (telnet sessions) - Add TCP debug tracing via serial_println for connection debugging - Enable port forwarding (2323) in run.sh for ARM64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The IRQ return path (boot.S) restores SP from user_rsp_scratch before
ERET: `mov sp, [user_rsp_scratch]; eret`. When a context switch routes
through restore_userspace_context_arm64 or setup_first_userspace_entry_arm64,
neither function updated user_rsp_scratch. This left SP_EL1 pointing to the
*previous* thread's kernel stack after ERET to EL0.
When the next timer interrupt fired from EL0, the IRQ handler's
`sub sp, sp, #272` allocated its exception frame on the wrong kernel
stack, corrupting memory — including other threads' SVC exception
frames. This caused ELR fields to be overwritten with 0, leading to
instruction aborts at address 0x0 when returning to userspace.
Fix: set user_rsp_scratch = kernel_stack_top in both functions, ensuring
SP_EL1 is correct for the next interrupt from EL0.
Additional changes in this commit:
- Add PROCESS tracing provider (0x06xx) with lock-free trace events for
fork, exec, CoW faults, data aborts, stack mapping, and process exit
- Replace serial_println! with lock-free trace events in CoW fault
handler, data abort handler, fork/exec paths, and stack mapping
- Add FORK_TOTAL, EXEC_TOTAL, COW_FAULT_TOTAL counters
- Fix fork child x0=0: ARM64 syscall0 uses lateout("x0") so x0 is
undefined at SVC entry; forked child must explicitly get x0=0
- Improve instruction abort handler: identify crashing process via
TTBR0, terminate with SIGSEGV instead of hanging
- Improve init_shell panic handler to print file:line location
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
restore_userspace_context_arm64andsetup_first_userspace_entry_arm64did not setuser_rsp_scratch, leaving SP_EL1 pointing to the wrong kernel stack after ERET to EL0. The next IRQ from EL0 would allocate its exception frame on the wrong stack, corrupting memory and causing instruction aborts at address 0x0.syscall0useslateout("x0"), so x0 is undefined at SVC entry. Forked children now explicitly get x0=0.serial_println!in hot paths that caused lock contention deadlocks.Test plan
🤖 Generated with Claude Code