Skip to content

fix(arm64): fix SP_EL1 corruption on IRQ context switch to userspace#164

Merged
ryanbreen merged 2 commits intomainfrom
fix/arm64-context-switch-sp-corruption
Feb 6, 2026
Merged

fix(arm64): fix SP_EL1 corruption on IRQ context switch to userspace#164
ryanbreen merged 2 commits intomainfrom
fix/arm64-context-switch-sp-corruption

Conversation

@ryanbreen
Copy link
Owner

Summary

  • Fix SP_EL1 corruption: restore_userspace_context_arm64 and setup_first_userspace_entry_arm64 did not set user_rsp_scratch, leaving SP_EL1 pointing to the wrong kernel stack after ERET to EL0. The next IRQ from EL0 would allocate its exception frame on the wrong stack, corrupting memory and causing instruction aborts at address 0x0.
  • Fix fork child x0=0: ARM64 syscall0 uses lateout("x0"), so x0 is undefined at SVC entry. Forked children now explicitly get x0=0.
  • Add PROCESS tracing provider (0x06xx): Lock-free trace events for fork, exec, CoW faults, data aborts, stack mapping, and process exit — replacing serial_println! in hot paths that caused lock contention deadlocks.
  • Improve instruction abort handler: Identifies crashing process via TTBR0 and terminates with SIGSEGV instead of hanging.

Test plan

  • ARM64 kernel builds with zero warnings
  • 84/84 boot tests pass
  • Full telnet lifecycle works: TELNETD_LISTENING → TCP handshake → TELNETD_CONNECTED → TELNETD_SHELL_FORKED → TELNETD_SESSION_ENDED → TELNETD_LISTENING
  • Zero instruction aborts, zero data aborts

🤖 Generated with Claude Code

ryanbreen and others added 2 commits February 6, 2026 07:23
- Fix walk_mapped_pages() to iterate L0[0-512] on ARM64 (was 0-256,
  an x86_64 assumption that skipped stack pages at L0[511])
- Unify stack constants in exec_process ARM64 paths to use
  USER_STACK_REGION_START instead of hardcoded 0x0000_FFFF_FF01_0000
- Add DATA_ABORT diagnostic: print TTBR0_EL1 for process identification
- Add fork diagnostic: print child stack mapping and L4 frame address
- Simplify telnetd to use direct socket I/O (skip PTY for now)
- Add EOF handling in init_shell for non-PID-1 instances (telnet sessions)
- Add TCP debug tracing via serial_println for connection debugging
- Enable port forwarding (2323) in run.sh for ARM64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The IRQ return path (boot.S) restores SP from user_rsp_scratch before
ERET: `mov sp, [user_rsp_scratch]; eret`. When a context switch routes
through restore_userspace_context_arm64 or setup_first_userspace_entry_arm64,
neither function updated user_rsp_scratch. This left SP_EL1 pointing to the
*previous* thread's kernel stack after ERET to EL0.

When the next timer interrupt fired from EL0, the IRQ handler's
`sub sp, sp, #272` allocated its exception frame on the wrong kernel
stack, corrupting memory — including other threads' SVC exception
frames. This caused ELR fields to be overwritten with 0, leading to
instruction aborts at address 0x0 when returning to userspace.

Fix: set user_rsp_scratch = kernel_stack_top in both functions, ensuring
SP_EL1 is correct for the next interrupt from EL0.

Additional changes in this commit:
- Add PROCESS tracing provider (0x06xx) with lock-free trace events for
  fork, exec, CoW faults, data aborts, stack mapping, and process exit
- Replace serial_println! with lock-free trace events in CoW fault
  handler, data abort handler, fork/exec paths, and stack mapping
- Add FORK_TOTAL, EXEC_TOTAL, COW_FAULT_TOTAL counters
- Fix fork child x0=0: ARM64 syscall0 uses lateout("x0") so x0 is
  undefined at SVC entry; forked child must explicitly get x0=0
- Improve instruction abort handler: identify crashing process via
  TTBR0, terminate with SIGSEGV instead of hanging
- Improve init_shell panic handler to print file:line location

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryanbreen ryanbreen merged commit 4563526 into main Feb 6, 2026
1 of 2 checks passed
@ryanbreen ryanbreen deleted the fix/arm64-context-switch-sp-corruption branch February 6, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant