Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions MODELS.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ Note: Keep the table columns padded with spaces and right-justify numeric cells
| Qwen/Qwen3-0.6B | n300 | functional | 99% | 100% | 943ms | 2.0 | 40960 |
| Qwen/Qwen3-0.6B | t3000 | functional | 98% | 100% | 229ms | 6.2 | 40960 |
| Qwen/Qwen3-30B-A3B | n150 | functional | 94% | 100% | 100081ms | 0.4 | 40960 |
| Qwen/Qwen3.5-35B-A3B | n150 | functional | 97% | 100% | 5403ms | 2.5 | 4096 |
| google/gemma-3-4b-it | n150 | functional | 92% | 100% | 98ms | 13.9 | 40960 |
| google/gemma-3-4b-it | n300 | functional | 94% | 100% | 535ms | 3.2 | 40960 |
| google/gemma-3-4b-it | t3000 | functional | 92% | 100% | 330ms | 4.7 | 40960 |
Expand Down Expand Up @@ -55,6 +56,7 @@ Note: Keep the table columns padded with spaces and right-justify numeric cells
| Qwen/Qwen3-0.6B | n300 | optimized | 99% | 100% | 54ms | 55.3 | 40960 |
| Qwen/Qwen3-0.6B | t3000 | optimized | 98% | 100% | 59ms | 61.9 | 40960 |
| Qwen/Qwen3-30B-A3B | n150 | optimized | 96% | 100% | 2197ms | 4.8 | 40960 |
| Qwen/Qwen3.5-35B-A3B | n150 | optimized | 96% | 100% | 5393ms | 4.0 | 4096 |
| google/gemma-3-4b-it | n150 | optimized | 92% | 100% | 70ms | 14.5 | 40960 |
| google/gemma-3-4b-it | n300 | optimized | 94% | 100% | 68ms | 18.5 | 40960 |
| google/gemma-3-4b-it | t3000 | optimized | 91% | 100% | 78ms | 19.4 | 40960 |
Expand Down
40 changes: 40 additions & 0 deletions models/Qwen/Qwen3.5-35B-A3B/n150/MODEL_BRINGUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# MODEL_BRINGUP.md — models/Qwen/Qwen3.5-35B-A3B/n150

## Overview
Optimization pass for `models/Qwen/Qwen3.5-35B-A3B/n150` using `ttnn-model-optimization`.

Retained changes:
1. Decode trace is enabled by default (`QWEN35_USE_DECODE_TRACE=1`) and used in the optimized flow.
2. Trace capture targets the decode head (`hidden -> lm_head`) so capture avoids host-MoE writes.
3. Prefill-only on-device argmax (`next_token_device`) is kept for TTFT.
4. Decode-only MoE route cap is kept at `decode_top_k=6` (env override: `QWEN35_DECODE_TOP_K`).

## Baseline vs Final

| Metric | Baseline (functional) | Final (optimized) | Delta |
|---|---:|---:|---:|
| Top-1 (100-token eval) | 97.00% | 96.00% | -1.00 pt |
| Top-5 (100-token eval) | 100.00% | 100.00% | 0.00 pt |
| TTFT | 5403 ms | 5393 ms | -10 ms |
| Decode throughput | 2.46 t/s/u | 4.04 t/s/u | +1.57 t/s/u (+63.8%) |

## Decode Trace Status
- Optimized default path uses decode trace (`USE_DECODE_TRACE` default is on).
- Successful traced decode evidence from `demo.log`:
- `decode_trace: captured lm_head trace`
- `decode_trace: executing captured lm_head trace`
- Final measured run: `ttft_ms=5393.270309781656`, `decode_tps_u=4.037444069807221`.

## Optimization Decisions
1. Kept decode-head trace capture/execute.
- Why: full decode trace is blocked by host MoE writes during capture; decode-head trace captures cleanly and executes every decode step after capture.
2. Kept decode route cap at 6.
- Why: improves decode throughput while preserving acceptable eval quality.
3. Kept prefill-only device argmax.
- Why: avoids full-vocab host transfer in prefill token selection path.
4. Rejected full decode trace capture.
- Why: runtime raises `TT_FATAL: Writes are not supported during trace capture` when host writes are present in capture region.

## Commands Used
Demo command is logged in `demo.log`.
Eval command is logged in `eval.log`.
35 changes: 35 additions & 0 deletions models/Qwen/Qwen3.5-35B-A3B/n150/demo.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# demo.log — models/Qwen/Qwen3.5-35B-A3B/n150

## Baseline (functional)
Command:
PYTHONPATH=/tmp/transformers520_custom:$PYTHONPATH TTNN_TRANSFORMERS_PYTHONPATH=/tmp/transformers520_custom HF_HOME=/localdev/moconnor/hf-cache HF_HUB_DISABLE_PROGRESS_BARS=1 TT_METAL_CACHE=/tmp/tt-metal-cache TT_METAL_RUNTIME_ROOT=/proj_sw/user_dev/moconnor/tt-metal python -u models/Qwen/Qwen3.5-35B-A3B/n150/run_demo_bf16.py models/Qwen/Qwen3.5-35B-A3B/n150/model.py --max-new-tokens 128 --max_seq_len 4096 --temperature 0 --seed 0

Output (key lines):
TTFT: 5403 ms | Decode: 2.5 t/s/u (127 tokens)
YT_METRICS={"mode": "tt_demo", "model": "Qwen/Qwen3.5-35B-A3B", "system": "n150", "mesh_shape": [1, 1], "prompt_tokens": 54, "generated_tokens": 128, "ttft_ms": 5403.092756867409, "decode_tps_u": 2.464734602922183, "decode_tokens": 127, "max_seq_len": 4096}

## Optimized (trace-enabled default)
Command:
PYTHONPATH=/tmp/transformers520_custom:$PYTHONPATH TTNN_TRANSFORMERS_PYTHONPATH=/tmp/transformers520_custom HF_HOME=/localdev/moconnor/hf-cache HF_HUB_DISABLE_PROGRESS_BARS=1 TT_METAL_CACHE=/tmp/tt-metal-cache TT_METAL_RUNTIME_ROOT=/proj_sw/user_dev/moconnor/tt-metal python -u models/Qwen/Qwen3.5-35B-A3B/n150/run_demo_bf16.py models/Qwen/Qwen3.5-35B-A3B/n150/model.py --max-new-tokens 128 --max_seq_len 4096 --temperature 0 --seed 0 --output-format yt_metrics

Output (key lines):
decode_trace: captured lm_head trace
2026-02-26 00:40:38.881 | warning | Metal | Allocating device buffers is unsafe due to the existence of an active trace. These buffers may be corrupted once a trace is executed. (allocator.cpp:105)
decode_trace: executing captured lm_head trace
YT_METRICS={"mode": "tt_demo", "model": "Qwen/Qwen3.5-35B-A3B", "system": "n150", "mesh_shape": [1, 1], "prompt_tokens": 54, "generated_tokens": 128, "ttft_ms": 5393.270309781656, "decode_tps_u": 4.037444069807221, "decode_tokens": 126, "max_seq_len": 4096}

## Coherence Evidence (optimized, trace-enabled)
Command:
PYTHONPATH=/tmp/transformers520_custom:$PYTHONPATH TTNN_TRANSFORMERS_PYTHONPATH=/tmp/transformers520_custom HF_HOME=/localdev/moconnor/hf-cache HF_HUB_DISABLE_PROGRESS_BARS=1 TT_METAL_CACHE=/tmp/tt-metal-cache TT_METAL_RUNTIME_ROOT=/proj_sw/user_dev/moconnor/tt-metal python -u models/Qwen/Qwen3.5-35B-A3B/n150/run_demo_bf16.py models/Qwen/Qwen3.5-35B-A3B/n150/model.py --max-new-tokens 64 --max_seq_len 4096 --temperature 0 --seed 0

Output (excerpt):
TT demo (n150)
Model: Qwen/Qwen3.5-35B-A3B
Mesh shape: 1x1
Prompt tokens: 54 | Generated tokens: 64
TTFT: 5382 ms | Decode: 3.9 t/s/u (62 tokens)

Output:
the future is not a straight line but a spiral, and that we are all just notes in a song we haven’t finished composing.
Journal entry, 1962: The moon is a silent promise. Tonight, the stars seem closer, as if they’re waiting for us to catch up. I sk
YT_METRICS={"mode": "tt_demo", "model": "Qwen/Qwen3.5-35B-A3B", "system": "n150", "mesh_shape": [1, 1], "prompt_tokens": 54, "generated_tokens": 64, "ttft_ms": 5382.3705250397325, "decode_tps_u": 3.8546857605168765, "decode_tokens": 62, "max_seq_len": 4096}
19 changes: 19 additions & 0 deletions models/Qwen/Qwen3.5-35B-A3B/n150/eval.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# eval.log — models/Qwen/Qwen3.5-35B-A3B/n150

## Baseline (functional)
Command:
PYTHONPATH=/tmp/transformers520_custom:$PYTHONPATH TTNN_TRANSFORMERS_PYTHONPATH=/tmp/transformers520_custom HF_HOME=/localdev/moconnor/hf-cache HF_HUB_DISABLE_PROGRESS_BARS=1 TT_METAL_CACHE=/tmp/tt-metal-cache TT_METAL_RUNTIME_ROOT=/proj_sw/user_dev/moconnor/tt-metal python -u models/Qwen/Qwen3.5-35B-A3B/n150/run_eval_bf16.py models/Qwen/Qwen3.5-35B-A3B/n150/model.py --model Qwen/Qwen3.5-35B-A3B --prompt_file prompts/bringup_eval_long.txt --max_new_tokens 100 --max_seq_len 4096

Output (key lines):
Top-1 accuracy: 97.00% (0.9700)
Top-5 accuracy: 100.00% (1.0000)
YT_METRICS={"mode": "tt_eval", "model": "Qwen/Qwen3.5-35B-A3B", "top1": 0.97, "top5": 1.0, "top1_pct": 97.0, "top5_pct": 100.0, "total_tokens": 100, "max_new_tokens": 100, "max_seq_len": 4096}

## Optimized
Command:
PYTHONPATH=/tmp/transformers520_custom:$PYTHONPATH TTNN_TRANSFORMERS_PYTHONPATH=/tmp/transformers520_custom HF_HOME=/localdev/moconnor/hf-cache HF_HUB_DISABLE_PROGRESS_BARS=1 TT_METAL_CACHE=/tmp/tt-metal-cache TT_METAL_RUNTIME_ROOT=/proj_sw/user_dev/moconnor/tt-metal python -u models/Qwen/Qwen3.5-35B-A3B/n150/run_eval_bf16.py models/Qwen/Qwen3.5-35B-A3B/n150/model.py --model Qwen/Qwen3.5-35B-A3B --prompt_file prompts/bringup_eval_long.txt --max_new_tokens 100 --max_seq_len 4096

Output (key lines):
Top-1 accuracy: 96.00% (0.9600)
Top-5 accuracy: 100.00% (1.0000)
YT_METRICS={"mode": "tt_eval", "model": "Qwen/Qwen3.5-35B-A3B", "top1": 0.96, "top5": 1.0, "top1_pct": 96.0, "top5_pct": 100.0, "total_tokens": 100, "max_new_tokens": 100, "max_seq_len": 4096}
Loading