Skip to content

fix(runtime): fix probabilistic hang in AICPU-AICore handshake#2

Open
zhusy54 wants to merge 1 commit intomainfrom
a2a3sim-stuck
Open

fix(runtime): fix probabilistic hang in AICPU-AICore handshake#2
zhusy54 wants to merge 1 commit intomainfrom
a2a3sim-stuck

Conversation

@zhusy54
Copy link
Owner

@zhusy54 zhusy54 commented Mar 11, 2026

Summary

  • Fix memory ordering bug in AICPU-AICore handshake that caused probabilistic hangs in multi-round execution (e.g., multi-round-paged-attention with 10 rounds)
  • Add release barrier (dcci) on AICore side between writing physical_core_id/core_type and signaling aicore_done, ensuring AICPU reads fresh values
  • Add acquire barrier (__sync_synchronize) on AICPU side after observing aicore_done, preventing stale reads of handshake data fields
  • Remove racy DATA_MAIN_BASE clear in AICore that could overwrite EXIT_SIGNAL written by AICPU
  • Make shutdown_aicore() unconditional so late-arriving threads (e.g., orchestrator) still send exit signals to their assigned cores

Root Cause

Without memory barriers, AICPU could observe aicore_done != 0 while physical_core_id was still stale (e.g., 0). This caused multiple cores to map to the same register address. When AICPU sent EXIT_SIGNAL via platform_deinit_aicore_regs(), it wrote to the wrong register block, leaving the actual AICore thread spinning forever.

Testing

  • 25/25 runs of multi-round-paged-attention pass (previously ~50% hang rate)
  • All 11 simulation tests pass (./ci.sh -p a2a3sim)
  • Code review passed

…d hang

In multi-round execution the DeviceRunner singleton keeps AICore threads
alive across rounds. DATA_MAIN_BASE retains the EXIT_SIGNAL from the
previous round and must be cleared before the next round begins.

Previously the clear happened after the handshake wait, creating a race:
if the AICore thread is descheduled between the wait and the clear,
AICPU may complete all tasks and write a new EXIT_SIGNAL, which the
belated clear then overwrites with 0, hanging the AICore thread forever.

Move the clear to before the handshake wait in all runtimes that use
register-based dispatch (tensormap_and_ringbuffer, host_build_graph).
Also add the missing clear to a5/host_build_graph.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants