Skip to content

Commit f4cd713

Browse files
authored
Add files via upload
1 parent a65f06c commit f4cd713

60 files changed

Lines changed: 10007 additions & 1073 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 59 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**A storage-aware MoE stack built for SSD+DRAM hybrid inference, with a full six-stage training pipeline.**
44

5-
[![PyPI](https://img.shields.io/pypi/v/project-chronos)](https://pypi.org/project/project-chronos/)
5+
[![PyPI](https://img.shields.io/pypi/v/Project_Chronos)](https://pypi.org/project/Project_Chronos/)
66
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
77
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
88

@@ -104,7 +104,7 @@ Even the worst case does not hard-stop generation:
104104
output = avail[i] * expert_output + (1.0 - avail[i]) * shared_expert_output
105105
```
106106

107-
The shared expert is always resident, so generation continues while the missing expert finishes loading in the background. Quality degrades smoothly and recovers automatically once the expert becomes available.
107+
The shared expert is always resident, so generation continues while the missing expert finishes loading in the background. For exact lazy/offload comparison modes, Chronos synchronously materializes only the selected missing expert and evicts low-LRU experts to stay inside the resident budget; it does not silently full-load all experts. Quality degrades smoothly only when fallback mode is explicitly enabled.
108108

109109
---
110110

@@ -187,6 +187,14 @@ flowchart LR
187187
| 5 GRPO | `train_chronos_grpo.py` | `PG * A - beta * KL` with `ToyReward` or pluggable `LMRewardModel` | 0.10 |
188188
| 6 Distill | `train_chronos_distill.py` | `alpha * T^2 * KL(student || teacher) + (1 - alpha) * CE` | 0.05 |
189189

190+
Training dtype and resource policy:
191+
192+
- `--dtype auto` is the default. MPS/MLX resolve to BF16-first for training stability, CUDA/XPU resolve to FP16, and CPU resolves to FP32 unless `--dtype float16` or `--dtype bfloat16` is set explicitly.
193+
- CPU training configures PyTorch to use physical cores by default. Override with `--cpu_threads` or `--cpu_budget_percent`.
194+
- On macOS, MPS/MLX training forces DataLoader workers to `0` by default to avoid Metal command-buffer crashes from multiprocessing. CPU/CUDA still use worker processes; advanced users can override the guard with `CHRONOS_ALLOW_METAL_DATALOADER_WORKERS=1`.
195+
- Native MLX training pushes UI logs, scalar readouts, and chart points every `log_interval` steps, and Web UI Stop is checked at each batch boundary.
196+
- The Web UI writes a warning-only `<checkpoint>.verify.json` after each stage. It checks no-mask vs all-available MoE parity and, on Apple Silicon, MLX prefill logits against the PyTorch CPU baseline.
197+
190198
The full six-stage comparison harness lives in `tools/compare_minimind_chronos_v3.py`.
191199

192200
---
@@ -219,6 +227,7 @@ d.describe() # human-readable capability summary
219227
- **First-class backends for training and inference**: `cpu`, `mps`, `cuda`, `mlx`
220228
- **Inference-only / experimental**: `vulkan` when PyTorch was custom-built with `USE_VULKAN=ON`
221229
- **Third-party extension hook**: `opencl`, via `chronos/backend/ext/opencl.py:PROBE()`
230+
- **Apple Silicon policy**: inference auto still prefers MLX; training keeps MLX on the native `chronos.mlx.*` path instead of calling `torch.model.to("mlx")`.
222231

223232
Honest note: upstream PyTorch does not ship a real OpenCL backend, and Vulkan support is still niche. Chronos provides a dispatcher seam so external integrations can plug in cleanly without touching core code.
224233

@@ -304,7 +313,7 @@ All lambda terms are searchable with Optuna TPE, together with structural hyperp
304313
## Installation
305314

306315
```bash
307-
pip install project-chronos
316+
pip install Project_Chronos
308317
```
309318

310319
Or from source:
@@ -318,7 +327,7 @@ pip install -e ".[dev]"
318327
**MLX (Apple Silicon):**
319328

320329
```bash
321-
pip install "project-chronos[mlx]"
330+
pip install "Project_Chronos[mlx]"
322331
```
323332

324333
**vLLM serving (optional, Linux + CUDA only):**
@@ -337,7 +346,7 @@ pip install vllm
337346

338347
## Quick start
339348

340-
### Web UI (M6: 7 tabs, 4 languages)
349+
### Web UI (M6: 8 tabs, 4 languages)
341350

342351
```bash
343352
chronos-ui
@@ -351,12 +360,31 @@ Tabs included:
351360
- `Train` with its own `data_path`
352361
- `6-Stage Pipeline` with per-stage dataset paths
353362
- `Inference`
363+
- `Export` for FP16/Q8_0 safetensors and GGUF deployment artifacts
354364
- `Benchmark` with Markdown table + bar plot
355365
- `Auto-Tune` with persistent logs and one-click `Apply Best -> Config`
356366
- `IO Monitor`
357367

358368
Built-in i18n: `zh-Hans`, `zh-Hant`, `en`, `ja`
359369

370+
### Deployment export
371+
372+
```bash
373+
chronos export \
374+
--model_path ./out/sft_384_moe.pth \
375+
--output_dir ./exports/sft_384 \
376+
--formats fp16-safetensors q8_0-safetensors fp16-gguf q8_0-gguf
377+
```
378+
379+
Exports include `config.json`, `chronos_export_manifest.json`, and Chronos
380+
metadata for MoE top-k, shared fallback experts, lookahead router, hybrid
381+
attention, and optional expert-cache layout. Chronos can load exported
382+
`safetensors`/`GGUF` artifacts through its native lazy expert loader.
383+
384+
Compatibility note: the GGUF files use `general.architecture=chronos`. Stock
385+
Ollama/llama.cpp builds need a Chronos architecture adapter to execute them
386+
correctly; Chronos is not a LLaMA tensor-layout clone.
387+
360388
### Stage 1: pretrain
361389

362390
```bash
@@ -386,6 +414,32 @@ python train_chronos_distill.py \
386414
--alpha 0.7 --temperature 4.0
387415
```
388416

417+
### Checkpoint and offload diagnostics
418+
419+
Every new `.pth` checkpoint writes a sibling `*.config.json` with the MoE
420+
topology that cannot be recovered from tensor shapes, including
421+
`num_experts_per_tok`. Use the diagnostic command to verify chat-template
422+
generation, no-mask vs all-available masked drift, cold shared fallback,
423+
LookaheadRouter prediction quality, and SSD/RAM/VRAM offload stats.
424+
425+
```bash
426+
python diagnose_checkpoint.py \
427+
--model_path ./out/sft_384_moe.pth \
428+
--config_path ./chronos_config.json \
429+
--sft_data ./Dataset/sft_t2t.jsonl \
430+
--mlx_parity \
431+
--device cpu
432+
433+
# or through the unified CLI:
434+
chronos diagnose --model_path ./out/sft_384_moe.pth --config_path ./chronos_config.json
435+
```
436+
437+
For backend speed and dtype sanity checks:
438+
439+
```bash
440+
python benchmark_training_backends.py --backends cpu mps mlx --dtypes auto bfloat16 float16 --steps 2
441+
```
442+
389443
### End-to-end comparison (minimind vs Chronos)
390444

391445
```bash

README_zh.md

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**一套从架构层原生支持 SSD+DRAM 混合加载推理的 MoE 框架,配套完整的 6 阶段训练链路。**
44

5-
[![PyPI](https://img.shields.io/pypi/v/project-chronos)](https://pypi.org/project/project-chronos/)
5+
[![PyPI](https://img.shields.io/pypi/v/Project_Chronos)](https://pypi.org/project/Project_Chronos/)
66
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
77
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)
88

@@ -98,7 +98,7 @@ flowchart TB
9898
output = avail[i] * expert_output + (1.0 - avail[i]) * shared_expert_output
9999
```
100100

101-
共享专家(常驻 VRAM)按比例混入,生成流**从不中断**,精度平滑降级,专家后台加载完毕后自动恢复
101+
共享专家常驻;在 exact lazy/offload 对比模式下,Chronos 只会同步物化当前被路由选中的缺失专家,并按 LRU 卸载低热度专家以守住 resident budget,不会静默全量加载所有专家。只有显式启用 fallback 模式时,质量才会按共享专家路径平滑降级
102102

103103
---
104104

@@ -181,6 +181,14 @@ flowchart LR
181181
| 5 GRPO | `train_chronos_grpo.py` | PG·A − β·KL(含 ToyReward / 可插 LMRewardModel)| 0.10 |
182182
| 6 Distill | `train_chronos_distill.py` | α·T²·KL(s‖t) + (1−α)·CE | 0.05 |
183183

184+
训练精度与资源策略:
185+
186+
- 默认 `--dtype auto`。MPS/MLX 训练优先解析为 BF16 以保证稳定性,CUDA/XPU 解析为 FP16;CPU 默认 FP32,只有显式传 `--dtype float16``--dtype bfloat16` 才启用 CPU autocast。
187+
- CPU 训练默认使用物理核心;可用 `--cpu_threads``--cpu_budget_percent` 覆盖。
188+
- macOS 上 MPS/MLX 训练默认强制 DataLoader workers 为 `0`,避免 multiprocessing 触发 Metal command-buffer 崩溃。CPU/CUDA 仍可使用 worker 进程;高级用户可用 `CHRONOS_ALLOW_METAL_DATALOADER_WORKERS=1` 覆盖这个保护。
189+
- MLX 原生训练会按 `log_interval` 同步 UI 日志、标量读数和图表点;Web UI Stop 会在每个 batch 边界检查并停止。
190+
- Web UI 每阶段保存后会写出 warning-only 的 `<checkpoint>.verify.json`,检查 no-mask vs all-available MoE 等价性,并在 Apple Silicon 上对比 MLX prefill logits 与 PyTorch CPU baseline。
191+
184192
完整 6 阶段端到端对比见 `tools/compare_minimind_chronos_v3.py`
185193

186194
---
@@ -212,6 +220,7 @@ d.describe() # 人类可读的能力总览
212220
- **一等公民(训练 + 推理)**`cpu``mps``cuda``mlx`
213221
- **推理仅 / 实验性**`vulkan`(仅当 PyTorch `USE_VULKAN=ON` 自定义构建时存在)
214222
- **第三方插件钩子**`opencl`(替换 `chronos/backend/ext/opencl.py:PROBE()`
223+
- **Apple Silicon 策略**:推理 auto 仍优先 MLX;训练会走原生 `chronos.mlx.*` 路径,不会把 PyTorch 模型错误地 `.to("mlx")`
215224

216225
诚实声明:上游 PyTorch 没有 OpenCL 后端、Vulkan 也仅在自定义构建中可用。Chronos 提供 dispatcher 接缝,使第三方插件无需改核心代码即可接入。
217226

@@ -296,7 +305,7 @@ D_{\mathrm{KL}}
296305
## 安装 (Not Ready in PyPI Yet)
297306

298307
```bash
299-
pip install project-chronos
308+
pip install Project_Chronos
300309
```
301310

302311
或从源码:
@@ -309,7 +318,7 @@ pip install -e ".[dev]"
309318

310319
**MLX(Apple Silicon):**
311320
```bash
312-
pip install "project-chronos[mlx]"
321+
pip install "Project_Chronos[mlx]"
313322
```
314323

315324
**vLLM 服务(可选,仅 Linux+CUDA):**
@@ -327,15 +336,33 @@ pip install vllm
327336

328337
## 快速开始
329338

330-
### Web UI(M6 — 7 个 Tab,4 种语言)
339+
### Web UI(M6 — 8 个 Tab,4 种语言)
331340

332341
```bash
333342
chronos-ui
334343
#
335344
python chronos_app.py
336345
```
337346

338-
包含:⚙️ Config(含右侧实时参数估算面板,合并了 Designer)/ 🏋️ Train(拥有 data_path)/ 🧪 6-Stage Pipeline(每阶段独立数据路径)/ 💬 Inference / 📊 Benchmark(Markdown 表 + BarPlot)/ 🔬 Auto-Tune(持久化日志 + 一键 Apply Best → Config)/ 📡 IO Monitor。i18n 支持 zh-Hans / zh-Hant / en / ja。
347+
包含:⚙️ Config(含右侧实时参数估算面板,合并了 Designer)/ 🏋️ Train(拥有 data_path)/ 🧪 6-Stage Pipeline(每阶段独立数据路径)/ 💬 Inference(含懒加载 vs 全量 DRAM 对比)/ 📦 Export(FP16/Q8_0 safetensors 与 GGUF)/ 📊 Benchmark(Markdown 表 + BarPlot)/ 🔬 Auto-Tune(持久化日志 + 一键 Apply Best → Config)/ 📡 IO Monitor。i18n 支持 zh-Hans / zh-Hant / en / ja。
348+
349+
### 部署导出
350+
351+
```bash
352+
chronos export \
353+
--model_path ./out/sft_384_moe.pth \
354+
--output_dir ./exports/sft_384 \
355+
--formats fp16-safetensors q8_0-safetensors fp16-gguf q8_0-gguf
356+
```
357+
358+
导出产物包含 `config.json``chronos_export_manifest.json`,并写入 MoE
359+
top-k、共享 fallback expert、lookahead router、混合注意力、可选专家缓存布局等
360+
Chronos 元数据。Chronos 原生 loader 可以从导出的 `safetensors` / `GGUF`
361+
产物直接走懒加载专家链路。
362+
363+
兼容性说明:GGUF 使用 `general.architecture=chronos`。未实现 Chronos
364+
architecture 的 stock Ollama/llama.cpp 不能正确执行该模型;Chronos 不是
365+
LLaMA tensor layout 的简单改名。
339366

340367
### Stage 1:预训练
341368

@@ -366,6 +393,28 @@ python train_chronos_distill.py \
366393
--alpha 0.7 --temperature 4.0
367394
```
368395

396+
### Checkpoint 与 offload 诊断
397+
398+
新的 `.pth` checkpoint 会同步写出同名 `*.config.json`,保存无法从权重形状反推的 MoE 拓扑字段,例如 `num_experts_per_tok`。诊断命令会检查 chat template 生成、`no_mask` vs `all_available` masked drift、全冷 shared fallback、LookaheadRouter 预测质量,以及 SSD/RAM/VRAM offload 统计。
399+
400+
```bash
401+
python diagnose_checkpoint.py \
402+
--model_path ./out/sft_384_moe.pth \
403+
--config_path ./chronos_config.json \
404+
--sft_data ../Dataset/sft_t2t.jsonl \
405+
--mlx_parity \
406+
--device cpu
407+
408+
# 或使用统一 CLI:
409+
chronos diagnose --model_path ./out/sft_384_moe.pth --config_path ./chronos_config.json
410+
```
411+
412+
后端速度与 dtype sanity check:
413+
414+
```bash
415+
python benchmark_training_backends.py --backends cpu mps mlx --dtypes auto bfloat16 float16 --steps 2
416+
```
417+
369418
### 端到端对比(minimind vs Chronos)
370419

371420
```bash

0 commit comments

Comments
 (0)