fix(runtime): fix probabilistic hang in AICPU-AICore handshake#2
Open
fix(runtime): fix probabilistic hang in AICPU-AICore handshake#2
Conversation
…d hang In multi-round execution the DeviceRunner singleton keeps AICore threads alive across rounds. DATA_MAIN_BASE retains the EXIT_SIGNAL from the previous round and must be cleared before the next round begins. Previously the clear happened after the handshake wait, creating a race: if the AICore thread is descheduled between the wait and the clear, AICPU may complete all tasks and write a new EXIT_SIGNAL, which the belated clear then overwrites with 0, hanging the AICore thread forever. Move the clear to before the handshake wait in all runtimes that use register-based dispatch (tensormap_and_ringbuffer, host_build_graph). Also add the missing clear to a5/host_build_graph.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
multi-round-paged-attentionwith 10 rounds)physical_core_id/core_typeand signalingaicore_done, ensuring AICPU reads fresh values__sync_synchronize) on AICPU side after observingaicore_done, preventing stale reads of handshake data fieldsDATA_MAIN_BASEclear in AICore that could overwriteEXIT_SIGNALwritten by AICPUshutdown_aicore()unconditional so late-arriving threads (e.g., orchestrator) still send exit signals to their assigned coresRoot Cause
Without memory barriers, AICPU could observe
aicore_done != 0whilephysical_core_idwas still stale (e.g., 0). This caused multiple cores to map to the same register address. When AICPU sentEXIT_SIGNALviaplatform_deinit_aicore_regs(), it wrote to the wrong register block, leaving the actual AICore thread spinning forever.Testing
multi-round-paged-attentionpass (previously ~50% hang rate)./ci.sh -p a2a3sim)