FonaTech
diff --git a/‎README.md‎
Lines changed: 59 additions & 5 deletions b/‎README.md‎
Lines changed: 59 additions & 5 deletions
diff --git a/‎README_zh.md‎
Lines changed: 55 additions & 6 deletions b/‎README_zh.md‎
Lines changed: 55 additions & 6 deletions
@@ -2,7 +2,7 @@
 
 **A storage-aware MoE stack built for SSD+DRAM hybrid inference, with a full six-stage training pipeline.**
 
-[![PyPI](https://img.shields.io/pypi/v/project-chronos)](https://pypi.org/project/project-chronos/)
+[![PyPI](https://img.shields.io/pypi/v/Project_Chronos)](https://pypi.org/project/Project_Chronos/)
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
 [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
 
@@ -104,7 +104,7 @@ Even the worst case does not hard-stop generation:
 output = avail[i] * expert_output + (1.0 - avail[i]) * shared_expert_output
 ```
 
-The shared expert is always resident, so generation continues while the missing expert finishes loading in the background. Quality degrades smoothly and recovers automatically once the expert becomes available.
+The shared expert is always resident, so generation continues while the missing expert finishes loading in the background. For exact lazy/offload comparison modes, Chronos synchronously materializes only the selected missing expert and evicts low-LRU experts to stay inside the resident budget; it does not silently full-load all experts. Quality degrades smoothly only when fallback mode is explicitly enabled.
 
 ---
 
@@ -187,6 +187,14 @@ flowchart LR
 | 5 GRPO | `train_chronos_grpo.py` | `PG * A - beta * KL` with `ToyReward` or pluggable `LMRewardModel` | 0.10 |
 | 6 Distill | `train_chronos_distill.py` | `alpha * T^2 * KL(student || teacher) + (1 - alpha) * CE` | 0.05 |
 
+Training dtype and resource policy:
+
+- `--dtype auto` is the default. MPS/MLX resolve to BF16-first for training stability, CUDA/XPU resolve to FP16, and CPU resolves to FP32 unless `--dtype float16` or `--dtype bfloat16` is set explicitly.
+- CPU training configures PyTorch to use physical cores by default. Override with `--cpu_threads` or `--cpu_budget_percent`.
+- On macOS, MPS/MLX training forces DataLoader workers to `0` by default to avoid Metal command-buffer crashes from multiprocessing. CPU/CUDA still use worker processes; advanced users can override the guard with `CHRONOS_ALLOW_METAL_DATALOADER_WORKERS=1`.
+- Native MLX training pushes UI logs, scalar readouts, and chart points every `log_interval` steps, and Web UI Stop is checked at each batch boundary.
+- The Web UI writes a warning-only `<checkpoint>.verify.json` after each stage. It checks no-mask vs all-available MoE parity and, on Apple Silicon, MLX prefill logits against the PyTorch CPU baseline.
+
 The full six-stage comparison harness lives in `tools/compare_minimind_chronos_v3.py`.
 
 ---
@@ -219,6 +227,7 @@ d.describe()    # human-readable capability summary
 - **First-class backends for training and inference**: `cpu`, `mps`, `cuda`, `mlx`
 - **Inference-only / experimental**: `vulkan` when PyTorch was custom-built with `USE_VULKAN=ON`
 - **Third-party extension hook**: `opencl`, via `chronos/backend/ext/opencl.py:PROBE()`
+- **Apple Silicon policy**: inference auto still prefers MLX; training keeps MLX on the native `chronos.mlx.*` path instead of calling `torch.model.to("mlx")`.
 
 Honest note: upstream PyTorch does not ship a real OpenCL backend, and Vulkan support is still niche. Chronos provides a dispatcher seam so external integrations can plug in cleanly without touching core code.
 
@@ -304,7 +313,7 @@ All lambda terms are searchable with Optuna TPE, together with structural hyperp
 ## Installation
 
 ```bash
-pip install project-chronos
+pip install Project_Chronos
 ```
 
 Or from source:
@@ -318,7 +327,7 @@ pip install -e ".[dev]"
 **MLX (Apple Silicon):**
 
 ```bash
-pip install "project-chronos[mlx]"
+pip install "Project_Chronos[mlx]"
 ```
 
 **vLLM serving (optional, Linux + CUDA only):**
@@ -337,7 +346,7 @@ pip install vllm
 
 ## Quick start
 
-### Web UI (M6: 7 tabs, 4 languages)
+### Web UI (M6: 8 tabs, 4 languages)
 
 ```bash
 chronos-ui
@@ -351,12 +360,31 @@ Tabs included:
 - `Train` with its own `data_path`
 - `6-Stage Pipeline` with per-stage dataset paths
 - `Inference`
+- `Export` for FP16/Q8_0 safetensors and GGUF deployment artifacts
 - `Benchmark` with Markdown table + bar plot
 - `Auto-Tune` with persistent logs and one-click `Apply Best -> Config`
 - `IO Monitor`
 
 Built-in i18n: `zh-Hans`, `zh-Hant`, `en`, `ja`
 
+### Deployment export
+
+```bash
+chronos export \
+    --model_path ./out/sft_384_moe.pth \
+    --output_dir ./exports/sft_384 \
+    --formats fp16-safetensors q8_0-safetensors fp16-gguf q8_0-gguf
+```
+
+Exports include `config.json`, `chronos_export_manifest.json`, and Chronos
+metadata for MoE top-k, shared fallback experts, lookahead router, hybrid
+attention, and optional expert-cache layout. Chronos can load exported
+`safetensors`/`GGUF` artifacts through its native lazy expert loader.
+
+Compatibility note: the GGUF files use `general.architecture=chronos`. Stock
+Ollama/llama.cpp builds need a Chronos architecture adapter to execute them
+correctly; Chronos is not a LLaMA tensor-layout clone.
+
 ### Stage 1: pretrain
 
 ```bash
@@ -386,6 +414,32 @@ python train_chronos_distill.py \
     --alpha 0.7 --temperature 4.0
 ```
 
+### Checkpoint and offload diagnostics
+
+Every new `.pth` checkpoint writes a sibling `*.config.json` with the MoE
+topology that cannot be recovered from tensor shapes, including
+`num_experts_per_tok`. Use the diagnostic command to verify chat-template
+generation, no-mask vs all-available masked drift, cold shared fallback,
+LookaheadRouter prediction quality, and SSD/RAM/VRAM offload stats.
+
+```bash
+python diagnose_checkpoint.py \
+    --model_path ./out/sft_384_moe.pth \
+    --config_path ./chronos_config.json \
+    --sft_data ./Dataset/sft_t2t.jsonl \
+    --mlx_parity \
+    --device cpu
+
+# or through the unified CLI:
+chronos diagnose --model_path ./out/sft_384_moe.pth --config_path ./chronos_config.json
+```
+
+For backend speed and dtype sanity checks:
+
+```bash
+python benchmark_training_backends.py --backends cpu mps mlx --dtypes auto bfloat16 float16 --steps 2
+```
+
 ### End-to-end comparison (minimind vs Chronos)
 
 ```bash
 
@@ -2,7 +2,7 @@
 
 **一套从架构层原生支持 SSD+DRAM 混合加载推理的 MoE 框架，配套完整的 6 阶段训练链路。**
 
-[![PyPI](https://img.shields.io/pypi/v/project-chronos)](https://pypi.org/project/project-chronos/)
+[![PyPI](https://img.shields.io/pypi/v/Project_Chronos)](https://pypi.org/project/Project_Chronos/)
 [![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
 [![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
 
@@ -98,7 +98,7 @@ flowchart TB
 output = avail[i] * expert_output + (1.0 - avail[i]) * shared_expert_output
 ```
 
-共享专家（常驻 VRAM）按比例混入，生成流**从不中断**，精度平滑降级，专家后台加载完毕后自动恢复。
+共享专家常驻；在 exact lazy/offload 对比模式下，Chronos 只会同步物化当前被路由选中的缺失专家，并按 LRU 卸载低热度专家以守住 resident budget，不会静默全量加载所有专家。只有显式启用 fallback 模式时，质量才会按共享专家路径平滑降级。
 
 ---
 
@@ -181,6 +181,14 @@ flowchart LR
 | 5 GRPO      | `train_chronos_grpo.py`    | PG·A − β·KL（含 ToyReward / 可插 LMRewardModel）| 0.10 |
 | 6 Distill   | `train_chronos_distill.py` | α·T²·KL(s‖t) + (1−α)·CE             | 0.05 |
 
+训练精度与资源策略：
+
+- 默认 `--dtype auto`。MPS/MLX 训练优先解析为 BF16 以保证稳定性，CUDA/XPU 解析为 FP16；CPU 默认 FP32，只有显式传 `--dtype float16` 或 `--dtype bfloat16` 才启用 CPU autocast。
+- CPU 训练默认使用物理核心；可用 `--cpu_threads` 或 `--cpu_budget_percent` 覆盖。
+- macOS 上 MPS/MLX 训练默认强制 DataLoader workers 为 `0`，避免 multiprocessing 触发 Metal command-buffer 崩溃。CPU/CUDA 仍可使用 worker 进程；高级用户可用 `CHRONOS_ALLOW_METAL_DATALOADER_WORKERS=1` 覆盖这个保护。
+- MLX 原生训练会按 `log_interval` 同步 UI 日志、标量读数和图表点；Web UI Stop 会在每个 batch 边界检查并停止。
+- Web UI 每阶段保存后会写出 warning-only 的 `<checkpoint>.verify.json`，检查 no-mask vs all-available MoE 等价性，并在 Apple Silicon 上对比 MLX prefill logits 与 PyTorch CPU baseline。
+
 完整 6 阶段端到端对比见 `tools/compare_minimind_chronos_v3.py`。
 
 ---
@@ -212,6 +220,7 @@ d.describe()       # 人类可读的能力总览
 - **一等公民（训练 + 推理）**：`cpu`、`mps`、`cuda`、`mlx`
 - **推理仅 / 实验性**：`vulkan`（仅当 PyTorch `USE_VULKAN=ON` 自定义构建时存在）
 - **第三方插件钩子**：`opencl`（替换 `chronos/backend/ext/opencl.py:PROBE()`）
+- **Apple Silicon 策略**：推理 auto 仍优先 MLX；训练会走原生 `chronos.mlx.*` 路径，不会把 PyTorch 模型错误地 `.to("mlx")`。
 
 诚实声明：上游 PyTorch 没有 OpenCL 后端、Vulkan 也仅在自定义构建中可用。Chronos 提供 dispatcher 接缝，使第三方插件无需改核心代码即可接入。
 
@@ -296,7 +305,7 @@ D_{\mathrm{KL}}
 ## 安装  (Not Ready in PyPI Yet)
 
 ```bash
-pip install project-chronos
+pip install Project_Chronos
 ```
 
 或从源码：
@@ -309,7 +318,7 @@ pip install -e ".[dev]"
 
 **MLX（Apple Silicon）：**
 ```bash
-pip install "project-chronos[mlx]"
+pip install "Project_Chronos[mlx]"
 ```
 
 **vLLM 服务（可选，仅 Linux+CUDA）：**
@@ -327,15 +336,33 @@ pip install vllm
 
 ## 快速开始
 
-### Web UI（M6 — 7 个 Tab，4 种语言）
+### Web UI（M6 — 8 个 Tab，4 种语言）
 
 ```bash
 chronos-ui
 # 或
 python chronos_app.py
 ```
 
-包含：⚙️ Config（含右侧实时参数估算面板，合并了 Designer）/ 🏋️ Train（拥有 data_path）/ 🧪 6-Stage Pipeline（每阶段独立数据路径）/ 💬 Inference / 📊 Benchmark（Markdown 表 + BarPlot）/ 🔬 Auto-Tune（持久化日志 + 一键 Apply Best → Config）/ 📡 IO Monitor。i18n 支持 zh-Hans / zh-Hant / en / ja。
+包含：⚙️ Config（含右侧实时参数估算面板，合并了 Designer）/ 🏋️ Train（拥有 data_path）/ 🧪 6-Stage Pipeline（每阶段独立数据路径）/ 💬 Inference（含懒加载 vs 全量 DRAM 对比）/ 📦 Export（FP16/Q8_0 safetensors 与 GGUF）/ 📊 Benchmark（Markdown 表 + BarPlot）/ 🔬 Auto-Tune（持久化日志 + 一键 Apply Best → Config）/ 📡 IO Monitor。i18n 支持 zh-Hans / zh-Hant / en / ja。
+
+### 部署导出
+
+```bash
+chronos export \
+    --model_path ./out/sft_384_moe.pth \
+    --output_dir ./exports/sft_384 \
+    --formats fp16-safetensors q8_0-safetensors fp16-gguf q8_0-gguf
+```
+
+导出产物包含 `config.json`、`chronos_export_manifest.json`，并写入 MoE
+top-k、共享 fallback expert、lookahead router、混合注意力、可选专家缓存布局等
+Chronos 元数据。Chronos 原生 loader 可以从导出的 `safetensors` / `GGUF`
+产物直接走懒加载专家链路。
+
+兼容性说明：GGUF 使用 `general.architecture=chronos`。未实现 Chronos
+architecture 的 stock Ollama/llama.cpp 不能正确执行该模型；Chronos 不是
+LLaMA tensor layout 的简单改名。
 
 ### Stage 1：预训练
 
@@ -366,6 +393,28 @@ python train_chronos_distill.py \
     --alpha 0.7 --temperature 4.0
 ```
 
+### Checkpoint 与 offload 诊断
+
+新的 `.pth` checkpoint 会同步写出同名 `*.config.json`，保存无法从权重形状反推的 MoE 拓扑字段，例如 `num_experts_per_tok`。诊断命令会检查 chat template 生成、`no_mask` vs `all_available` masked drift、全冷 shared fallback、LookaheadRouter 预测质量，以及 SSD/RAM/VRAM offload 统计。
+
+```bash
+python diagnose_checkpoint.py \
+    --model_path ./out/sft_384_moe.pth \
+    --config_path ./chronos_config.json \
+    --sft_data ../Dataset/sft_t2t.jsonl \
+    --mlx_parity \
+    --device cpu
+
+# 或使用统一 CLI：
+chronos diagnose --model_path ./out/sft_384_moe.pth --config_path ./chronos_config.json
+```
+
+后端速度与 dtype sanity check：
+
+```bash
+python benchmark_training_backends.py --backends cpu mps mlx --dtypes auto bfloat16 float16 --steps 2
+```
+
 ### 端到端对比（minimind vs Chronos）
 
 ```bash