Reduce mish error by an alternative without softplus op#2618
Reduce mish error by an alternative without softplus op#2618ChinChangYang wants to merge 3 commits intoapple:mainfrom
Conversation
| inputs = _get_inputs(context, node, expected=1) | ||
| x = inputs[0] | ||
|
|
||
| softplus = mb.softplus(x=x) |
There was a problem hiding this comment.
Looking at the PyTorch documentation, it seems the existing implementation is correct:
https://docs.pytorch.org/docs/stable/generated/torch.nn.Mish.html
There was a problem hiding this comment.
If the existing (software) implementation is correct, it must be a hardware precision issue in the Neural Engine. This PR provides a (software) workaround to circumvent the precision issue. I anticipate that Apple’s low-level (hardware) developers will investigate this issue.
JiwaniZakir
left a comment
There was a problem hiding this comment.
The algebraic derivation is correct — x * tanh(ln(1+eˣ)) simplifies to x·eˣ·(eˣ+2) / (e²ˣ+2eˣ+2), which is equivalent to the new formulation. However, there is a numerical stability concern for large negative x values: as x → -∞, e = exp(x) → 0, causing emep2 = e*(e+2) → 0 and thus tdemep2 = 2/emep2 overflowing to infinity. The final real_div(x, inf) does produce the correct limit of 0, but this intermediate overflow may behave inconsistently across backends or hardware, which ironically trades one source of numerical error for another.
The original three-op path (softplus → tanh → mul) avoids this by computing softplus(x) = ln(1+eˣ) ≈ 0 directly for large negative x, never producing an overflow. It would strengthen this PR to include explicit test cases covering the large-negative-x regime (e.g., x = -30, -100) and to document which backends/targets exhibited the original softplus error, so reviewers can assess whether this tradeoff is worthwhile. The intermediate variable names (emep2, tdemep2, optdemep2) in ops.py are also difficult to parse; expanding the comment to label each step with the full subexpression (e.g., # 1 + 2/(e*(e+2))) would make the code far more maintainable.
…d inputs Test uses a Conv2d+Mish+Flatten+Linear model with explicit uniform weights (conv=1.0, bias=0.0) and linspace inputs at three scales (0.1, 3.5, 11.0), producing known mish input intervals (~[-0.9,0.9], ~[-31.5,31.5], ~[-99,99]) to demonstrate stability across large negative and positive values on Neural Engine. This addresses the PR apple#2618 review feedback requesting deterministic test coverage of the large-negative-x regime (x=-30, x=-100). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the detailed review, @JiwaniZakir! I've added a Test DesignThe test uses a Conv2d(1,16,3)+Mish+Flatten+Linear model (the minimal model size for CoreML to route to Neural Engine) with fixed uniform weights and fixed input values:
With kernel_size=3 and weight=1.0, each interior conv output pixel ≈ 9 × local input value, so the mish input interval is ≈ [-9×scale, 9×scale]. Three scales are tested:
ResultsWith original softplus-based mish (main branch) +
With new exp-based mish (this PR) +
This confirms the original softplus error is Neural Engine specific — CPU produces correct results with both implementations. The error manifests once mish inputs reach the ±30 range on NE with FP16, and the new formulation resolves it. Note: the default test configuration uses |
|
The NaN at |
I justify my approach as follows:
|
|
The algebraic identity |
You are right. I've clamp mish input to handle |
|
+1 on cutting a new release — this fix makes a significant difference in practice (NE error dropping from ~2.95 to ~0.0017 is substantial). Given CI is passing and the code has been reviewed, it would be worth prioritizing a patch release so users aren't stuck working around the mish instability on NE hardware. |
|
@ChinChangYang - Please rebase your changes on top of latest |
…d inputs Test uses a Conv2d+Mish+Flatten+Linear model with explicit uniform weights (conv=1.0, bias=0.0) and linspace inputs at three scales (0.1, 3.5, 11.0), producing known mish input intervals (~[-0.9,0.9], ~[-31.5,31.5], ~[-99,99]) to demonstrate stability across large negative and positive values on Neural Engine. This addresses the PR apple#2618 review feedback requesting deterministic test coverage of the large-negative-x regime (x=-30, x=-100). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mish(-inf) is mathematically 0, but the exp-based formula produces NaN because exp(-inf)=0 leads to division by zero and -inf/finite=-inf. Clamping x to [-100, inf] before computation avoids this since mish(-100) ≈ 0 to full precision. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
56419eb to
bd96119
Compare
Rebased on top of latest |
|
I'm still not convinced there is an issue here. Your new unit test passes without your fix. Can you create a unit test which fails without the fix? |
…d inputs Test uses a Conv2d+Mish+Flatten+Linear model with explicit uniform weights (conv=1.0, bias=0.0) and linspace inputs at three scales (0.1, 3.5, 11.0), producing known mish input intervals (~[-0.9,0.9], ~[-31.5,31.5], ~[-99,99]) to demonstrate stability across large negative and positive values on Neural Engine. This addresses the PR apple#2618 review feedback requesting deterministic test coverage of the large-negative-x regime (x=-30, x=-100). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Setup Conda Clone Build Post the output that is generated by the following commands: Hardware model Software versions Run test EDIT
|
Pulled from origin/mish-stability-baseline (9850e2c) while reproducing the steps in apple#2618 (comment 4248090200). Adds a Conv2d+Mish+Flatten+Linear model with fixed weights and linspace inputs at three scales to cover mish input intervals approximately [-0.9,0.9], [-31.5,31.5], and [-99,99] — exercising the large-negative-x regime where the softplus-based decomposition shows numerical error on Apple Neural Engine. https://claude.ai/code/session_011rzEeksHFoyTyUVPQL5DDQ
|
Rebased on top of latest |
|
I get "Exception: Unable to load CoreML.framework. Cannot make predictions." errors in your test: |
Added coremlpython make target (required for the prediction proxy on macOS; without it pytest fails with Unable to load CoreML.framework. Cannot make predictions.) |
|
|
|
|
The algebraic reformulation is sound — factoring out the softplus avoids the log(1 + exp(x)) accumulation error on NE hardware. The -100 clamp is a reasonable practical bound given mish(-100) underflows to zero in float32, though it's worth a comment in the code explaining why that specific value was chosen rather than, say, log(FLT_EPSILON). Would also be good to verify the formula's numerical behavior near x=0 on NE, since that's where the division chain (e*(e+2)) is smallest and rounding could still bite. |
|
Independent reproduction on M4 Pro (Mac mini, macOS 26.4.1) Followed the reproduction steps exactly. On mish-stability-baseline (no fix), 6 failures with COMPUTE_UNITS=CPU_AND_NE: FAILED test_mish_stability[...mlprogram, fp16, scale=3.5] (Plus 4 fp16 small-scale / fp32 cases that also failed — happy to share full output.) After checking out reduce-mish-error: 6 passed. Hardware: Apple M4 Pro, 24 GB, macOS 26.4.1 (build 25E253), Xcode CLT, Python 3.11.15, coremltools 9.0, torch 2.4.1. Note for anyone reproducing: cmake defaulted to building x86_64 dylibs on my system, causing RuntimeError: BlobWriter not loaded. |
|
The formula derivation is correct — factoring |
Numeric results were the same as AdamGibbons1982. |
|
The algebraic reformulation is clever — expressing mish without softplus avoids the precision loss that accumulates when computing |
Fix the high numerical error in mish activation #2359.
Algorithm:
The input is clamped to
[-100, inf]becausemish(-inf)is mathematically 0, but the exp-based formula produces NaN whenexp(-inf) = 0leads to division by zero and-inf / finite = -inf. Sincemish(-100) ≈ 0to full precision, clamping at -100 avoids this edge case without affecting accuracy.Evaluation:
In the following experiments, the mean absolute errors are evaluated by the method in #2359 (comment).
Before this change, NE generates high numerical error:
With the new algorithm, NE generates low numerical error:
Test Coverage:
Added
test_mish_stabilitywith fixed Conv2d weights (1.0) and fixedlinspaceinputs at three scales, producing known mish input intervals:Results with original softplus-based mish +
CPU_AND_NE:Results with new exp-based mish +
CPU_AND_NE: all 6 tests PASS.This confirms the error is Neural Engine specific and manifests once mish inputs reach the ±30 range on NE with FP16.
Conclusion:
Overall, the change enhances the accuracy and reliability of the mish activation in Core ML models.