ddlBoJack · ame700 · May 23, 2026 · May 23, 2026
diff --git a/README.md b/README.md
@@ -18,6 +18,7 @@
 </div>
 
 # News
+- [May. 2026] 🚀 Added ONNX export support with a hybrid inference recipe (backbone + extracted classifier head), int8 quantization, and a FunASR-free runtime example. See [`scripts/onnx/`](./scripts/onnx/README.md).
 - [Oct. 2024] 🔧 We update the usage in the FunASR interface with source selection. "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users. **We recommend using FunASR interface for a smooth landing.**
 - [Jun. 2024] 🔧 We fix a bug in emotion2vec+. Please re-pull the latest code. 
 - [May. 2024] 🔥 Speech emotion recognition foundation model: **emotion2vec+**, with 9-class emotions has been released on [Model Scope](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary) and [Hugging Face](https://huggingface.co/emotion2vec). Check out a series of emotion2vec+ (seed, base, large) models for SER with high performance **(We recommend this release instead of the Jan. 2024 release)**. 

diff --git a/scripts/onnx/README.md b/scripts/onnx/README.md
@@ -0,0 +1,150 @@
+# ONNX export workflow for emotion2vec
+
+End-to-end recipe for converting emotion2vec models (including the fine-tuned
+`emotion2vec_plus_*` classifiers) to **ONNX**, running them with
+`onnxruntime`, and validating the output against FunASR's `generate()`.
+
+## Background
+
+FunASR [PR #2359](https://github.com/modelscope/FunASR/pull/2359) (merged
+January 2025, shipped in `funasr >= 1.2.3`) added a `model.export()` path
+that traces the SSL backbone to ONNX. However:
+
+- The exported `forward` returns the **backbone output only** — shape
+  `[batch, sequence_length, embed_dim]` — i.e. the *features*, not the
+  9-class emotion probabilities.
+- For the fine-tuned classifier variants (`emotion2vec_plus_seed`,
+  `emotion2vec_plus_base`, `emotion2vec_plus_large`), the classification
+  head — a single `Linear(embed_dim, num_classes)` named `proj` — must be
+  **extracted separately from `model.pt`** and applied at inference time.
+- The exported file is named `emotion2vec` (no extension) — rename to
+  `*.onnx` for clarity.
+
+This directory provides the missing scripts plus a corrected int8
+quantization workflow.
+
+## The hybrid inference recipe
+
+```
+raw 16 kHz Float32 waveform   shape: [1, num_samples]
+            │
+            ▼  ONNX backbone (in onnxruntime)
+features                       shape: [1, T, embed_dim]
+            │
+            ▼  mean-pool over the time axis
+pooled                         shape: [embed_dim]
+            │
+            ▼  proj head (extracted from model.pt):  logits = W · pooled + b
+logits                         shape: [num_classes]
+            │
+            ▼  softmax
+probabilities                  shape: [num_classes]
+```
+
+The waveform-normalization step (`(x - mean) / sqrt(var + 1e-5)`) is
+**folded into the exported ONNX graph** by FunASR's `export_forward`, so
+no JS/Python preprocessing of the audio is required — feed the raw
+waveform straight in.
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `export_backbone.py`   | Wraps `AutoModel(...).export(type='onnx', ...)`. |
+| `extract_head.py`      | Pulls `proj.weight`, `proj.bias`, and label names from `model.pt` + `tokens.txt` into a small JSON. |
+| `quantize.py`          | Dynamic int8 quantization, **with two refinements**: per-channel weight scales, and skipping activation×activation MatMul nodes (the attention's Q·Kᵀ and softmax·V, which quantize poorly). |
+| `validate.py`          | Runs FunASR `generate()` and the ONNX-hybrid path on the same audio and reports per-emotion drift. |
+| `inference_example.py` | Minimal standalone runtime — WAV in, emotion out, **no FunASR or PyTorch at runtime**. |
+| `requirements.txt`     | Python dependencies. |
+
+## Usage
+
+Install dependencies (a fresh venv is recommended):
+
+```bash
+pip install -r requirements.txt
+```
+
+### Step 1 — export the backbone
+
+```bash
+python export_backbone.py --model iic/emotion2vec_plus_large
+```
+
+The exported file lands in the ModelScope cache directory, typically
+`~/.cache/modelscope/hub/models/<model_id>/`. It is named `emotion2vec`
+(no extension). Rename it:
+
+```bash
+# Linux / macOS
+mv ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/emotion2vec \
+   emotion2vec.onnx
+```
+
+### Step 2 — extract the classifier head
+
+```bash
+python extract_head.py \
+  --checkpoint ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/model.pt \
+  --tokens     ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/tokens.txt \
+  --output     emotion2vec_head.json
+```
+
+Produces a ~160 KB JSON: `{labels: [...], weight: [[...]], bias: [...]}`.
+
+### Step 3 (optional) — int8-quantize the ONNX
+
+```bash
+python quantize.py --input emotion2vec.onnx --output emotion2vec.int8.onnx
+```
+
+Typical size reduction: ~3× (e.g. 649 MB → 195 MB).
+
+### Step 4 — validate numerically against FunASR
+
+```bash
+python validate.py --model iic/emotion2vec_plus_large \
+                   --onnx  emotion2vec.onnx \
+                   --head  emotion2vec_head.json
+```
+
+On `emotion2vec_plus_large`, the fp32 ONNX matches FunASR `generate()`
+within ~3e-05 (numerical fp32 noise). The int8 build (step 3) drifts on
+the order of 1e-04 on confident inputs.
+
+### Step 5 — minimal runtime example
+
+```bash
+python inference_example.py --onnx emotion2vec.onnx \
+                            --head emotion2vec_head.json \
+                            --wav  some_clip_16k_mono.wav
+```
+
+This runs the entire hybrid pipeline using only `onnxruntime` + `numpy` +
+the head JSON — no `funasr` or `torch` at runtime. Useful for porting
+inference to other languages: the recipe (`session.run` → mean-pool →
+linear → softmax) is a handful of lines.
+
+## Notes
+
+- **`extract_features` vs full forward** — FunASR's `export_meta.py` wires
+  the export's `forward` to call `_original_forward(features_only=True)`,
+  which is equivalent to `extract_features`. The classifier `proj` is
+  applied *outside* this forward in `inference()`, which is why it's
+  absent from the ONNX.
+- **`emotion2vec_base` (representation model)** — has no `proj` head. The
+  ONNX backbone is the whole story; use the features directly.
+  `extract_head.py` will exit with a clear error if `proj.weight` isn't
+  found in the checkpoint.
+- **int8 quantization drift** — naive `quantize_dynamic` with
+  `op_types_to_quantize=['MatMul']` quantizes *every* MatMul including
+  the attention's activation×activation matmuls (Q·Kᵀ, softmax·V), which
+  drifts heavily (worst-case ~0.17 of probability mass on uncertain
+  inputs). `quantize.py` excludes those nodes by inspecting which MatMul
+  inputs are graph initializers (i.e. weights). This mirrors what
+  `torch.quantize_dynamic(model, {nn.Linear})` does naturally — those
+  matmuls aren't `nn.Linear` modules, so torch leaves them alone.
+- **Per-channel weights** — `per_channel=True` in `quantize_dynamic`
+  gives one scale per output channel rather than one per tensor;
+  standard practice for transformer weights and a meaningful drift
+  reduction.
diff --git a/scripts/onnx/export_backbone.py b/scripts/onnx/export_backbone.py
@@ -0,0 +1,69 @@
+"""
+Export the emotion2vec backbone to ONNX via FunASR's built-in exporter.
+
+This uses the model.export() path added in FunASR PR #2359 ("Make Emotion2vec
+support onnx", merged January 2025, shipped in funasr >= 1.2.3).
+
+The exported ONNX represents the SSL backbone only:
+    input   float32  [batch, num_samples]    (raw 16 kHz waveform)
+    output  float32  [batch, T, embed_dim]   (frame-level features)
+
+For fine-tuned classifier variants (emotion2vec_plus_*), the proj head is NOT
+in the exported graph - extract it separately with extract_head.py.
+
+Usage:
+    python export_backbone.py --model iic/emotion2vec_plus_large
+
+The file is written to the ModelScope cache directory and is named
+"emotion2vec" with no extension - rename to *.onnx for clarity.
+"""
+
+import argparse
+import os
+import sys
+
+try:
+    sys.stdout.reconfigure(encoding="utf-8")
+except Exception:
+    pass
+
+from funasr import AutoModel
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--model", default="iic/emotion2vec_plus_large",
+                    help="ModelScope model id (default: iic/emotion2vec_plus_large)")
+    ap.add_argument("--opset", type=int, default=13,
+                    help="ONNX opset version (default: 13, matches PR #2359)")
+    ap.add_argument("--quantize", action="store_true",
+                    help="Apply FunASR's built-in quantization during export. "
+                         "Not recommended - use quantize.py for a tuned int8 build.")
+    args = ap.parse_args()
+
+    print(f"Loading {args.model} ...")
+    model = AutoModel(model=args.model, disable_update=True)
+
+    print(f"Exporting to ONNX  (opset={args.opset}, quantize={args.quantize}) ...")
+    result = model.export(type="onnx", quantize=args.quantize, opset_version=args.opset)
+    print(f"export() returned: {result}")
+
+    paths = [result] if isinstance(result, (str, os.PathLike)) else list(result or [])
+    for p in paths:
+        p = str(p)
+        if os.path.isdir(p):
+            print(f"\nDIR  {p}")
+            for f in sorted(os.listdir(p)):
+                fp = os.path.join(p, f)
+                if os.path.isfile(fp):
+                    size = os.path.getsize(fp) / 1e6
+                    print(f"     {f}  ({size:.1f} MB)")
+        elif os.path.isfile(p):
+            print(f"\nFILE {p}  ({os.path.getsize(p) / 1e6:.1f} MB)")
+
+    print("\nNote: the exported ONNX is named 'emotion2vec' with no extension.")
+    print("      Rename it to 'emotion2vec.onnx' before passing to the next steps.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/onnx/extract_head.py b/scripts/onnx/extract_head.py
@@ -0,0 +1,105 @@
+"""
+Extract the classification head (proj layer + labels) from a fine-tuned
+emotion2vec_plus_* checkpoint into a JSON file.
+
+FunASR's model.export() exports the SSL backbone only. For the fine-tuned
+classifier variants the model architecture is:
+
+    backbone(waveform) -> features [T, embed_dim]
+    pooled = features.mean(time)                    # mean-pool over frames
+    logits = proj(pooled)                           # Linear(embed_dim, num_classes)
+    probs  = softmax(logits)
+
+The proj layer (`Linear(embed_dim, num_classes)`) lives in the checkpoint
+under the keys `proj.weight` and `proj.bias`. We dump those plus the label
+names (read from tokens.txt) into a small JSON, so the classifier can be
+applied at inference time in any language - it's just a matmul and a softmax.
+
+Usage:
+    python extract_head.py \\
+        --checkpoint ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/model.pt \\
+        --tokens     ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/tokens.txt \\
+        --output     emotion2vec_head.json
+
+For the SSL representation models (e.g. emotion2vec_base) there is no proj
+head; the script exits with a clear error in that case.
+"""
+
+import argparse
+import json
+import os
+import sys
+
+try:
+    sys.stdout.reconfigure(encoding="utf-8")
+except Exception:
+    pass
+
+import torch
+
+
+def normalize_label(raw: str) -> str:
+    """Map a raw token to a clean english label.
+
+    FunASR's tokens.txt entries look like "<chinese>/english" (e.g. "生气/angry"),
+    plus a special "<unk>" token which we surface as "unknown".
+    """
+    if not raw or raw.strip() == "<unk>":
+        return "unknown"
+    if "/" in raw:
+        return raw.split("/")[-1].strip().lower()
+    return raw.strip().lower()
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    ap.add_argument("--checkpoint", required=True, help="path to model.pt")
+    ap.add_argument("--tokens", required=True, help="path to tokens.txt")
+    ap.add_argument("--output", default="emotion2vec_head.json",
+                    help="output JSON path (default: emotion2vec_head.json)")
+    args = ap.parse_args()
+
+    print(f"Loading checkpoint: {args.checkpoint}")
+    ck = torch.load(args.checkpoint, map_location="cpu")
+    # FunASR / fairseq checkpoints are dicts with a 'model' sub-dict; some are
+    # plain state_dicts. Handle both.
+    if isinstance(ck, dict) and "model" in ck:
+        sd = ck["model"]
+    else:
+        sd = ck
+
+    if "proj.weight" not in sd or "proj.bias" not in sd:
+        sys.exit(
+            "ERROR: proj.weight / proj.bias not found in checkpoint.\n"
+            "       This is likely an SSL/representation model (e.g. emotion2vec_base)\n"
+            "       with no classification head. The ONNX backbone alone is the\n"
+            "       complete inference graph for that variant - use its features\n"
+            "       directly. This script is for fine-tuned classifier variants\n"
+            "       (emotion2vec_plus_seed / _base / _large)."
+        )
+
+    W = sd["proj.weight"]
+    B = sd["proj.bias"]
+    print(f"  proj.weight {tuple(W.shape)}   proj.bias {tuple(B.shape)}")
+
+    with open(args.tokens, encoding="utf-8") as f:
+        raw_labels = [line.strip() for line in f if line.strip()]
+    labels = [normalize_label(lab) for lab in raw_labels]
+
+    if len(labels) != W.shape[0]:
+        sys.exit(f"ERROR: label count ({len(labels)}) != proj output dim ({W.shape[0]})")
+
+    print(f"  labels: {labels}")
+
+    out = {
+        "labels": labels,
+        "weight": W.tolist(),  # shape [num_classes, embed_dim]
+        "bias": B.tolist(),    # shape [num_classes]
+    }
+    with open(args.output, "w") as f:
+        json.dump(out, f)
+    print(f"Wrote {args.output} ({os.path.getsize(args.output)} bytes)")
+
+
+if __name__ == "__main__":
+    main()