Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
</div>

# News
- [May. 2026] 🚀 Added ONNX export support with a hybrid inference recipe (backbone + extracted classifier head), int8 quantization, and a FunASR-free runtime example. See [`scripts/onnx/`](./scripts/onnx/README.md).
- [Oct. 2024] 🔧 We update the usage in the FunASR interface with source selection. "ms" or "modelscope" for China mainland users; "hf" or "huggingface" for other overseas users. **We recommend using FunASR interface for a smooth landing.**
- [Jun. 2024] 🔧 We fix a bug in emotion2vec+. Please re-pull the latest code.
- [May. 2024] 🔥 Speech emotion recognition foundation model: **emotion2vec+**, with 9-class emotions has been released on [Model Scope](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary) and [Hugging Face](https://huggingface.co/emotion2vec). Check out a series of emotion2vec+ (seed, base, large) models for SER with high performance **(We recommend this release instead of the Jan. 2024 release)**.
Expand Down
150 changes: 150 additions & 0 deletions scripts/onnx/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# ONNX export workflow for emotion2vec

End-to-end recipe for converting emotion2vec models (including the fine-tuned
`emotion2vec_plus_*` classifiers) to **ONNX**, running them with
`onnxruntime`, and validating the output against FunASR's `generate()`.

## Background

FunASR [PR #2359](https://github.com/modelscope/FunASR/pull/2359) (merged
January 2025, shipped in `funasr >= 1.2.3`) added a `model.export()` path
that traces the SSL backbone to ONNX. However:

- The exported `forward` returns the **backbone output only** — shape
`[batch, sequence_length, embed_dim]` — i.e. the *features*, not the
9-class emotion probabilities.
- For the fine-tuned classifier variants (`emotion2vec_plus_seed`,
`emotion2vec_plus_base`, `emotion2vec_plus_large`), the classification
head — a single `Linear(embed_dim, num_classes)` named `proj` — must be
**extracted separately from `model.pt`** and applied at inference time.
- The exported file is named `emotion2vec` (no extension) — rename to
`*.onnx` for clarity.

This directory provides the missing scripts plus a corrected int8
quantization workflow.

## The hybrid inference recipe

```
raw 16 kHz Float32 waveform shape: [1, num_samples]
▼ ONNX backbone (in onnxruntime)
features shape: [1, T, embed_dim]
▼ mean-pool over the time axis
pooled shape: [embed_dim]
▼ proj head (extracted from model.pt): logits = W · pooled + b
logits shape: [num_classes]
▼ softmax
probabilities shape: [num_classes]
```

The waveform-normalization step (`(x - mean) / sqrt(var + 1e-5)`) is
**folded into the exported ONNX graph** by FunASR's `export_forward`, so
no JS/Python preprocessing of the audio is required — feed the raw
waveform straight in.

## Files

| File | Purpose |
|------|---------|
| `export_backbone.py` | Wraps `AutoModel(...).export(type='onnx', ...)`. |
| `extract_head.py` | Pulls `proj.weight`, `proj.bias`, and label names from `model.pt` + `tokens.txt` into a small JSON. |
| `quantize.py` | Dynamic int8 quantization, **with two refinements**: per-channel weight scales, and skipping activation×activation MatMul nodes (the attention's Q·Kᵀ and softmax·V, which quantize poorly). |
| `validate.py` | Runs FunASR `generate()` and the ONNX-hybrid path on the same audio and reports per-emotion drift. |
| `inference_example.py` | Minimal standalone runtime — WAV in, emotion out, **no FunASR or PyTorch at runtime**. |
| `requirements.txt` | Python dependencies. |

## Usage

Install dependencies (a fresh venv is recommended):

```bash
pip install -r requirements.txt
```

### Step 1 — export the backbone

```bash
python export_backbone.py --model iic/emotion2vec_plus_large
```

The exported file lands in the ModelScope cache directory, typically
`~/.cache/modelscope/hub/models/<model_id>/`. It is named `emotion2vec`
(no extension). Rename it:

```bash
# Linux / macOS
mv ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/emotion2vec \
emotion2vec.onnx
```

### Step 2 — extract the classifier head

```bash
python extract_head.py \
--checkpoint ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/model.pt \
--tokens ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/tokens.txt \
--output emotion2vec_head.json
```

Produces a ~160 KB JSON: `{labels: [...], weight: [[...]], bias: [...]}`.

### Step 3 (optional) — int8-quantize the ONNX

```bash
python quantize.py --input emotion2vec.onnx --output emotion2vec.int8.onnx
```

Typical size reduction: ~3× (e.g. 649 MB → 195 MB).

### Step 4 — validate numerically against FunASR

```bash
python validate.py --model iic/emotion2vec_plus_large \
--onnx emotion2vec.onnx \
--head emotion2vec_head.json
```

On `emotion2vec_plus_large`, the fp32 ONNX matches FunASR `generate()`
within ~3e-05 (numerical fp32 noise). The int8 build (step 3) drifts on
the order of 1e-04 on confident inputs.

### Step 5 — minimal runtime example

```bash
python inference_example.py --onnx emotion2vec.onnx \
--head emotion2vec_head.json \
--wav some_clip_16k_mono.wav
```

This runs the entire hybrid pipeline using only `onnxruntime` + `numpy` +
the head JSON — no `funasr` or `torch` at runtime. Useful for porting
inference to other languages: the recipe (`session.run` → mean-pool →
linear → softmax) is a handful of lines.

## Notes

- **`extract_features` vs full forward** — FunASR's `export_meta.py` wires
the export's `forward` to call `_original_forward(features_only=True)`,
which is equivalent to `extract_features`. The classifier `proj` is
applied *outside* this forward in `inference()`, which is why it's
absent from the ONNX.
- **`emotion2vec_base` (representation model)** — has no `proj` head. The
ONNX backbone is the whole story; use the features directly.
`extract_head.py` will exit with a clear error if `proj.weight` isn't
found in the checkpoint.
- **int8 quantization drift** — naive `quantize_dynamic` with
`op_types_to_quantize=['MatMul']` quantizes *every* MatMul including
the attention's activation×activation matmuls (Q·Kᵀ, softmax·V), which
drifts heavily (worst-case ~0.17 of probability mass on uncertain
inputs). `quantize.py` excludes those nodes by inspecting which MatMul
inputs are graph initializers (i.e. weights). This mirrors what
`torch.quantize_dynamic(model, {nn.Linear})` does naturally — those
matmuls aren't `nn.Linear` modules, so torch leaves them alone.
- **Per-channel weights** — `per_channel=True` in `quantize_dynamic`
gives one scale per output channel rather than one per tensor;
standard practice for transformer weights and a meaningful drift
reduction.
69 changes: 69 additions & 0 deletions scripts/onnx/export_backbone.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
"""
Export the emotion2vec backbone to ONNX via FunASR's built-in exporter.

This uses the model.export() path added in FunASR PR #2359 ("Make Emotion2vec
support onnx", merged January 2025, shipped in funasr >= 1.2.3).

The exported ONNX represents the SSL backbone only:
input float32 [batch, num_samples] (raw 16 kHz waveform)
output float32 [batch, T, embed_dim] (frame-level features)

For fine-tuned classifier variants (emotion2vec_plus_*), the proj head is NOT
in the exported graph - extract it separately with extract_head.py.

Usage:
python export_backbone.py --model iic/emotion2vec_plus_large

The file is written to the ModelScope cache directory and is named
"emotion2vec" with no extension - rename to *.onnx for clarity.
"""

import argparse
import os
import sys

try:
sys.stdout.reconfigure(encoding="utf-8")
except Exception:
pass

from funasr import AutoModel


def main() -> None:
ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--model", default="iic/emotion2vec_plus_large",
help="ModelScope model id (default: iic/emotion2vec_plus_large)")
ap.add_argument("--opset", type=int, default=13,
help="ONNX opset version (default: 13, matches PR #2359)")
ap.add_argument("--quantize", action="store_true",
help="Apply FunASR's built-in quantization during export. "
"Not recommended - use quantize.py for a tuned int8 build.")
args = ap.parse_args()

print(f"Loading {args.model} ...")
model = AutoModel(model=args.model, disable_update=True)

print(f"Exporting to ONNX (opset={args.opset}, quantize={args.quantize}) ...")
result = model.export(type="onnx", quantize=args.quantize, opset_version=args.opset)
print(f"export() returned: {result}")

paths = [result] if isinstance(result, (str, os.PathLike)) else list(result or [])
for p in paths:
p = str(p)
if os.path.isdir(p):
print(f"\nDIR {p}")
for f in sorted(os.listdir(p)):
fp = os.path.join(p, f)
if os.path.isfile(fp):
size = os.path.getsize(fp) / 1e6
print(f" {f} ({size:.1f} MB)")
elif os.path.isfile(p):
print(f"\nFILE {p} ({os.path.getsize(p) / 1e6:.1f} MB)")

print("\nNote: the exported ONNX is named 'emotion2vec' with no extension.")
print(" Rename it to 'emotion2vec.onnx' before passing to the next steps.")


if __name__ == "__main__":
main()
105 changes: 105 additions & 0 deletions scripts/onnx/extract_head.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""
Extract the classification head (proj layer + labels) from a fine-tuned
emotion2vec_plus_* checkpoint into a JSON file.

FunASR's model.export() exports the SSL backbone only. For the fine-tuned
classifier variants the model architecture is:

backbone(waveform) -> features [T, embed_dim]
pooled = features.mean(time) # mean-pool over frames
logits = proj(pooled) # Linear(embed_dim, num_classes)
probs = softmax(logits)

The proj layer (`Linear(embed_dim, num_classes)`) lives in the checkpoint
under the keys `proj.weight` and `proj.bias`. We dump those plus the label
names (read from tokens.txt) into a small JSON, so the classifier can be
applied at inference time in any language - it's just a matmul and a softmax.

Usage:
python extract_head.py \\
--checkpoint ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/model.pt \\
--tokens ~/.cache/modelscope/hub/models/iic/emotion2vec_plus_large/tokens.txt \\
--output emotion2vec_head.json

For the SSL representation models (e.g. emotion2vec_base) there is no proj
head; the script exits with a clear error in that case.
"""

import argparse
import json
import os
import sys

try:
sys.stdout.reconfigure(encoding="utf-8")
except Exception:
pass

import torch


def normalize_label(raw: str) -> str:
"""Map a raw token to a clean english label.

FunASR's tokens.txt entries look like "<chinese>/english" (e.g. "生气/angry"),
plus a special "<unk>" token which we surface as "unknown".
"""
if not raw or raw.strip() == "<unk>":
return "unknown"
if "/" in raw:
return raw.split("/")[-1].strip().lower()
return raw.strip().lower()


def main() -> None:
ap = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
ap.add_argument("--checkpoint", required=True, help="path to model.pt")
ap.add_argument("--tokens", required=True, help="path to tokens.txt")
ap.add_argument("--output", default="emotion2vec_head.json",
help="output JSON path (default: emotion2vec_head.json)")
args = ap.parse_args()

print(f"Loading checkpoint: {args.checkpoint}")
ck = torch.load(args.checkpoint, map_location="cpu")
# FunASR / fairseq checkpoints are dicts with a 'model' sub-dict; some are
# plain state_dicts. Handle both.
if isinstance(ck, dict) and "model" in ck:
sd = ck["model"]
else:
sd = ck

if "proj.weight" not in sd or "proj.bias" not in sd:
sys.exit(
"ERROR: proj.weight / proj.bias not found in checkpoint.\n"
" This is likely an SSL/representation model (e.g. emotion2vec_base)\n"
" with no classification head. The ONNX backbone alone is the\n"
" complete inference graph for that variant - use its features\n"
" directly. This script is for fine-tuned classifier variants\n"
" (emotion2vec_plus_seed / _base / _large)."
)

W = sd["proj.weight"]
B = sd["proj.bias"]
print(f" proj.weight {tuple(W.shape)} proj.bias {tuple(B.shape)}")

with open(args.tokens, encoding="utf-8") as f:
raw_labels = [line.strip() for line in f if line.strip()]
labels = [normalize_label(lab) for lab in raw_labels]

if len(labels) != W.shape[0]:
sys.exit(f"ERROR: label count ({len(labels)}) != proj output dim ({W.shape[0]})")

print(f" labels: {labels}")

out = {
"labels": labels,
"weight": W.tolist(), # shape [num_classes, embed_dim]
"bias": B.tolist(), # shape [num_classes]
}
with open(args.output, "w") as f:
json.dump(out, f)
print(f"Wrote {args.output} ({os.path.getsize(args.output)} bytes)")


if __name__ == "__main__":
main()
Loading