Refactor: reorganize examples and tests by architecture (a2a3/a5)#1
Open
hw-native-sys-bot wants to merge 3 commits intomainfrom
Open
Refactor: reorganize examples and tests by architecture (a2a3/a5)#1hw-native-sys-bot wants to merge 3 commits intomainfrom
hw-native-sys-bot wants to merge 3 commits intomainfrom
Conversation
…w-native-sys#256) - Standardize ALL_CASES to 3 identical cases in paged_attention, batch_paged_attention, and paged_attention_unroll for fair comparison - Case1: QHEADS=16, HEADDIM=128, BLOCKSIZE=128, batch=256 - Case2: QHEADS=64, HEADDIM=128, BLOCKSIZE=64, batch=64 - Case3: QHEADS=64, HEADDIM=256, BLOCKSIZE=64, batch=64 - All cases: KVHEADS=1, context_len=8192, query_seqlen=1 - Remove CaseVarSeq from batch_paged_attention (not needed for benchmark) - Add dtype field to paged_attention_unroll cases and parameterize generate_inputs/paged_attention to read dtype from params
…tive-sys#249) * Add: MixedKernels type and resource shape definitions - Add pto_submit_types.h with MixedKernels struct, PTO2ResourceShape enum, PTO2SubtaskSlot enum, and active_mask/shape conversion helpers - Remove PTO2WorkerType enum from pto_runtime2_types.h (superseded by resource shapes) * Refactor: submit API from (kernel_id, worker_type) to MixedKernels - Change submit_task signature to take MixedKernels& instead of (kernel_id, worker_type), enabling multi-kernel mixed-task submission - Add pto2_rt_submit_aic_task / pto2_rt_submit_aiv_task convenience wrappers for single-kernel tasks - Implement pto2_submit_mixed_task with active_mask computation, AIV normalization (aiv1-only → aiv0 slot), and shape-based queue routing - Add mixed_task_id and subslot fields to PTO2DispatchPayload - Migrate all orchestration call sites to new API * Refactor: two-stage completion and shape-based ready queues in scheduler - Change ready queues from worker-type indexed to shape-based indexed (PTO2_NUM_RESOURCE_SHAPES queues instead of PTO2_NUM_WORKER_TYPES) - Add on_subtask_complete() for per-core subtask done-bit tracking - Rename on_task_complete to on_mixed_task_complete (fires only when all subtasks of a mixed task finish) - Route release_fanin_and_check_ready enqueue through shape-based queue using pto2_active_mask_to_shape() - Remove stale extern declarations left from self-consumed check move * Refactor: cluster-based dispatch and core assignment in executor - Add Cluster struct (1 AIC + 2 AIV) and extend CoreStateTracker with clusters[], core_idle[], and find_cluster_for_shape() - Add shape_resource_count() constexpr lookup and get_dispatch_order() with even/odd thread differentiation for queue probe order - Extract pop_ready_task() and dispatch_subtask_to_core() helpers - Replace 5 duplicated dispatch blocks with unified table-driven loop - Adapt local-first dispatch to cluster model (find_cluster_for_shape instead of per-type idle pool, overflow to shape-based global queue) - Rewrite assign/reassign_cores_to_threads for cluster-aligned assignment - Wire completion path through on_subtask_complete/on_mixed_task_complete - Fix completed_tasks_ to increment only on mixed-task completion, not per-subtask, preventing early scheduler termination * Add: mixed_example covering all 5 resource shapes - AIC_AIV_X2 (matmul + add + mul), AIC_ONLY (matmul), AIV_X1 (add), AIV_X2 (add + mul), AIC_AIV_X1 (matmul + add) per iteration - 5 kernels: matmul, add, mul, add_standalone, mul_standalone - 9 output tensors with golden verification (4 iterations × 5 shapes) * Docs: submit by cluster docs * Fix review comment
78c4ebb to
c8e53f0
Compare
- Move examples from runtime-first layout (host_build_graph/, aicpu_build_graph/, tensormap_and_ringbuffer/) to arch-first layout (a2a3/<runtime>/, a5/<runtime>/) - Move device tests to matching tests/device_tests/<arch>/ layout - Update ci.sh to extract arch from path and track per-task platforms, replacing global HW_PLATFORM/SIM_PLATFORM variables - Add print_log_on_fail param to run_task() and fix attempt number display (off-by-one) in summary output - Update benchmark_rounds.sh with -p/--platform flag to derive arch from platform name - Update CLAUDE.md example path to new layout
c8e53f0 to
83537f9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
examples/andtests/device_tests/by architecture: move existing runtime dirs (aicpu_build_graph,host_build_graph,tensormap_and_ringbuffer) undera2a3/subdirectorya5/paged_attentionexample and device test, copied fromtensormap_and_ringbuffer/paged_attentionwith a5-specific adaptations:Stride→pto::Stridepipe_barrier(PIPE_V)callsci.shdiscovery logic to filter by architecture prefix instead of runtime, matching platform name (stripsimsuffix) to top-level directoryTesting
./ci.sh -p a2a3sim)./ci.sh -p a2a3)./ci.sh -p a5)