Skip to content

Refactor: reorganize examples and tests by architecture (a2a3/a5)#1

Open
hw-native-sys-bot wants to merge 3 commits intomainfrom
refactor-arch-case-folders
Open

Refactor: reorganize examples and tests by architecture (a2a3/a5)#1
hw-native-sys-bot wants to merge 3 commits intomainfrom
refactor-arch-case-folders

Conversation

@hw-native-sys-bot
Copy link
Owner

Summary

  • Reorganize examples/ and tests/device_tests/ by architecture: move existing runtime dirs (aicpu_build_graph, host_build_graph, tensormap_and_ringbuffer) under a2a3/ subdirectory
  • Add a5/paged_attention example and device test, copied from tensormap_and_ringbuffer/paged_attention with a5-specific adaptations:
    • Stridepto::Stride
    • Remove all pipe_barrier(PIPE_V) calls
  • Update ci.sh discovery logic to filter by architecture prefix instead of runtime, matching platform name (strip sim suffix) to top-level directory

Testing

  • Simulation tests pass (./ci.sh -p a2a3sim)
  • Hardware tests pass (./ci.sh -p a2a3)
  • A5 platform tests pass (./ci.sh -p a5)

ChaoWao and others added 2 commits March 11, 2026 15:30
…w-native-sys#256)

- Standardize ALL_CASES to 3 identical cases in paged_attention,
  batch_paged_attention, and paged_attention_unroll for fair comparison
- Case1: QHEADS=16, HEADDIM=128, BLOCKSIZE=128, batch=256
- Case2: QHEADS=64, HEADDIM=128, BLOCKSIZE=64, batch=64
- Case3: QHEADS=64, HEADDIM=256, BLOCKSIZE=64, batch=64
- All cases: KVHEADS=1, context_len=8192, query_seqlen=1
- Remove CaseVarSeq from batch_paged_attention (not needed for benchmark)
- Add dtype field to paged_attention_unroll cases and parameterize
  generate_inputs/paged_attention to read dtype from params
…tive-sys#249)

* Add: MixedKernels type and resource shape definitions

- Add pto_submit_types.h with MixedKernels struct, PTO2ResourceShape enum,
  PTO2SubtaskSlot enum, and active_mask/shape conversion helpers
- Remove PTO2WorkerType enum from pto_runtime2_types.h (superseded by
  resource shapes)

* Refactor: submit API from (kernel_id, worker_type) to MixedKernels

- Change submit_task signature to take MixedKernels& instead of
  (kernel_id, worker_type), enabling multi-kernel mixed-task submission
- Add pto2_rt_submit_aic_task / pto2_rt_submit_aiv_task convenience
  wrappers for single-kernel tasks
- Implement pto2_submit_mixed_task with active_mask computation, AIV
  normalization (aiv1-only → aiv0 slot), and shape-based queue routing
- Add mixed_task_id and subslot fields to PTO2DispatchPayload
- Migrate all orchestration call sites to new API

* Refactor: two-stage completion and shape-based ready queues in scheduler

- Change ready queues from worker-type indexed to shape-based indexed
  (PTO2_NUM_RESOURCE_SHAPES queues instead of PTO2_NUM_WORKER_TYPES)
- Add on_subtask_complete() for per-core subtask done-bit tracking
- Rename on_task_complete to on_mixed_task_complete (fires only when
  all subtasks of a mixed task finish)
- Route release_fanin_and_check_ready enqueue through shape-based
  queue using pto2_active_mask_to_shape()
- Remove stale extern declarations left from self-consumed check move

* Refactor: cluster-based dispatch and core assignment in executor

- Add Cluster struct (1 AIC + 2 AIV) and extend CoreStateTracker with
  clusters[], core_idle[], and find_cluster_for_shape()
- Add shape_resource_count() constexpr lookup and get_dispatch_order()
  with even/odd thread differentiation for queue probe order
- Extract pop_ready_task() and dispatch_subtask_to_core() helpers
- Replace 5 duplicated dispatch blocks with unified table-driven loop
- Adapt local-first dispatch to cluster model (find_cluster_for_shape
  instead of per-type idle pool, overflow to shape-based global queue)
- Rewrite assign/reassign_cores_to_threads for cluster-aligned assignment
- Wire completion path through on_subtask_complete/on_mixed_task_complete
- Fix completed_tasks_ to increment only on mixed-task completion, not
  per-subtask, preventing early scheduler termination

* Add: mixed_example covering all 5 resource shapes

- AIC_AIV_X2 (matmul + add + mul), AIC_ONLY (matmul), AIV_X1 (add),
  AIV_X2 (add + mul), AIC_AIV_X1 (matmul + add) per iteration
- 5 kernels: matmul, add, mul, add_standalone, mul_standalone
- 9 output tensors with golden verification (4 iterations × 5 shapes)

* Docs: submit by cluster docs

* Fix review comment
@ChaoWao ChaoWao force-pushed the refactor-arch-case-folders branch 2 times, most recently from 78c4ebb to c8e53f0 Compare March 11, 2026 12:37
- Move examples from runtime-first layout (host_build_graph/,
  aicpu_build_graph/, tensormap_and_ringbuffer/) to arch-first
  layout (a2a3/<runtime>/, a5/<runtime>/)
- Move device tests to matching tests/device_tests/<arch>/ layout
- Update ci.sh to extract arch from path and track per-task
  platforms, replacing global HW_PLATFORM/SIM_PLATFORM variables
- Add print_log_on_fail param to run_task() and fix attempt
  number display (off-by-one) in summary output
- Update benchmark_rounds.sh with -p/--platform flag to derive
  arch from platform name
- Update CLAUDE.md example path to new layout
@ChaoWao ChaoWao force-pushed the refactor-arch-case-folders branch from c8e53f0 to 83537f9 Compare March 11, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants