Refactor: reorganize examples and tests by architecture (a2a3/a5) by hw-native-sys-bot · Pull Request #1 · hw-native-sys-bot/simpler

hw-native-sys-bot · 2026-03-11T07:37:21Z

Summary

Reorganize examples/ and tests/device_tests/ by architecture: move existing runtime dirs (aicpu_build_graph, host_build_graph, tensormap_and_ringbuffer) under a2a3/ subdirectory
Add a5/paged_attention example and device test, copied from tensormap_and_ringbuffer/paged_attention with a5-specific adaptations:
- Stride → pto::Stride
- Remove all pipe_barrier(PIPE_V) calls
Update ci.sh discovery logic to filter by architecture prefix instead of runtime, matching platform name (strip sim suffix) to top-level directory

Testing

Simulation tests pass (./ci.sh -p a2a3sim)
Hardware tests pass (./ci.sh -p a2a3)
A5 platform tests pass (./ci.sh -p a5)

…w-native-sys#256) - Standardize ALL_CASES to 3 identical cases in paged_attention, batch_paged_attention, and paged_attention_unroll for fair comparison - Case1: QHEADS=16, HEADDIM=128, BLOCKSIZE=128, batch=256 - Case2: QHEADS=64, HEADDIM=128, BLOCKSIZE=64, batch=64 - Case3: QHEADS=64, HEADDIM=256, BLOCKSIZE=64, batch=64 - All cases: KVHEADS=1, context_len=8192, query_seqlen=1 - Remove CaseVarSeq from batch_paged_attention (not needed for benchmark) - Add dtype field to paged_attention_unroll cases and parameterize generate_inputs/paged_attention to read dtype from params

…tive-sys#249) * Add: MixedKernels type and resource shape definitions - Add pto_submit_types.h with MixedKernels struct, PTO2ResourceShape enum, PTO2SubtaskSlot enum, and active_mask/shape conversion helpers - Remove PTO2WorkerType enum from pto_runtime2_types.h (superseded by resource shapes) * Refactor: submit API from (kernel_id, worker_type) to MixedKernels - Change submit_task signature to take MixedKernels& instead of (kernel_id, worker_type), enabling multi-kernel mixed-task submission - Add pto2_rt_submit_aic_task / pto2_rt_submit_aiv_task convenience wrappers for single-kernel tasks - Implement pto2_submit_mixed_task with active_mask computation, AIV normalization (aiv1-only → aiv0 slot), and shape-based queue routing - Add mixed_task_id and subslot fields to PTO2DispatchPayload - Migrate all orchestration call sites to new API * Refactor: two-stage completion and shape-based ready queues in scheduler - Change ready queues from worker-type indexed to shape-based indexed (PTO2_NUM_RESOURCE_SHAPES queues instead of PTO2_NUM_WORKER_TYPES) - Add on_subtask_complete() for per-core subtask done-bit tracking - Rename on_task_complete to on_mixed_task_complete (fires only when all subtasks of a mixed task finish) - Route release_fanin_and_check_ready enqueue through shape-based queue using pto2_active_mask_to_shape() - Remove stale extern declarations left from self-consumed check move * Refactor: cluster-based dispatch and core assignment in executor - Add Cluster struct (1 AIC + 2 AIV) and extend CoreStateTracker with clusters[], core_idle[], and find_cluster_for_shape() - Add shape_resource_count() constexpr lookup and get_dispatch_order() with even/odd thread differentiation for queue probe order - Extract pop_ready_task() and dispatch_subtask_to_core() helpers - Replace 5 duplicated dispatch blocks with unified table-driven loop - Adapt local-first dispatch to cluster model (find_cluster_for_shape instead of per-type idle pool, overflow to shape-based global queue) - Rewrite assign/reassign_cores_to_threads for cluster-aligned assignment - Wire completion path through on_subtask_complete/on_mixed_task_complete - Fix completed_tasks_ to increment only on mixed-task completion, not per-subtask, preventing early scheduler termination * Add: mixed_example covering all 5 resource shapes - AIC_AIV_X2 (matmul + add + mul), AIC_ONLY (matmul), AIV_X1 (add), AIV_X2 (add + mul), AIC_AIV_X1 (matmul + add) per iteration - 5 kernels: matmul, add, mul, add_standalone, mul_standalone - 9 output tensors with golden verification (4 iterations × 5 shapes) * Docs: submit by cluster docs * Fix review comment

- Move examples from runtime-first layout (host_build_graph/, aicpu_build_graph/, tensormap_and_ringbuffer/) to arch-first layout (a2a3/<runtime>/, a5/<runtime>/) - Move device tests to matching tests/device_tests/<arch>/ layout - Update ci.sh to extract arch from path and track per-task platforms, replacing global HW_PLATFORM/SIM_PLATFORM variables - Add print_log_on_fail param to run_task() and fix attempt number display (off-by-one) in summary output - Update benchmark_rounds.sh with -p/--platform flag to derive arch from platform name - Update CLAUDE.md example path to new layout

ChaoWao and others added 2 commits March 11, 2026 15:30

ChaoWao force-pushed the refactor-arch-case-folders branch 2 times, most recently from 78c4ebb to c8e53f0 Compare March 11, 2026 12:37

ChaoWao force-pushed the refactor-arch-case-folders branch from c8e53f0 to 83537f9 Compare March 11, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: reorganize examples and tests by architecture (a2a3/a5)#1

Refactor: reorganize examples and tests by architecture (a2a3/a5)#1
hw-native-sys-bot wants to merge 3 commits intomainfrom
refactor-arch-case-folders

hw-native-sys-bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hw-native-sys-bot commented Mar 11, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants