Skip to content

libm: constant time fpclassifyf#1784

Open
xinitrcn1 wants to merge 1 commit into
managarm:masterfrom
xinitrcn1:patch-1
Open

libm: constant time fpclassifyf#1784
xinitrcn1 wants to merge 1 commit into
managarm:masterfrom
xinitrcn1:patch-1

Conversation

@xinitrcn1

@xinitrcn1 xinitrcn1 commented May 13, 2026

Copy link
Copy Markdown

Not tested with real world data.

tested over entire 2^32 range and it matches the original.

of course branchless != good

gcc 16.1

"__fpclassifyf_old":
        vmovd   edx, xmm0
        mov     eax, edx
        shr     eax, 23
        and     eax, 255
        je      .L8
        mov     ecx, 4
        cmp     eax, 255
        je      .L9
        mov     eax, ecx
        ret
.L9:
        sal     edx, 9
        mov     ecx, 1
        cmp     edx, 1
        sbb     ecx, -1
        mov     eax, ecx
        ret
.L8:
        add     edx, edx
        cmp     edx, 1
        sbb     ecx, ecx
        and     ecx, 8
        add     ecx, 8
        mov     eax, ecx
        ret
"__fpclassifyf_new":
        vmovd   ecx, xmm0
        mov     edx, 3
        mov     eax, ecx
        shr     eax, 23
        inc     eax
        cmp     al, dl
        cmovbe  edx, eax
        and     ecx, 8388607
        setne   al
        sal     eax, 2
        or      eax, edx
        mov     edx, 573645376
        movzx   eax, al
        sal     eax, 2
        sarx    edx, edx, eax
        mov     eax, 1
        and     edx, 15
        shlx    eax, eax, edx
        ret

clangarm

__fpclassifyf_old:
        fmov    w1, s0
        ubfx    x2, x1, 23, 8
        cbz     w2, .L8
        mov     w0, 4
        cmp     w2, 255
        beq     .L9
        ret
.L9:
        cmp     wzr, w1, lsl 9
        cset    w0, ne
        add     w0, w0, 1
        ret
.L8:
        cmp     wzr, w1, lsl 1
        mov     w0, 16
        mov     w1, 8
        csel    w0, w0, w1, eq
        ret
__fpclassifyf_new:
        fmov    w2, s0
        mov     w3, 1
        mov     w4, 3
        mov     w0, 8768
        movk    w0, 0x2231, lsl 16
        add     w1, w3, w2, lsr 23
        and     w1, w1, 255
        cmp     w1, 3
        csel    w1, w1, w4, ls
        tst     x2, 8388607
        cset    w2, ne
        orr     w1, w1, w2, lsl 2
        ubfiz   w1, w1, 2, 8
        asr     w0, w0, w1
        and     w0, w0, 15
        lsl     w0, w3, w0
        ret

@xinitrcn1

xinitrcn1 commented May 13, 2026

Copy link
Copy Markdown
Author

addendum:

isfinite(x)

can just be

INFINITY > fabs(x)

and

bool isfinite(float x) {
  union {float f; uint32_t i;} u = {x};
  u.i &= 0x7fffffff;
  return INFINITY > u.f;
}

should also dis-require the use of fpclassify() in the respective macro, unknown if x87 or m68k is cared about

@no92

no92 commented May 13, 2026

Copy link
Copy Markdown
Member

I'm currently thinking about how to integrate this; a couple of preliminary thoughts:

  • this function, much like most others in our libm, are imported from musl - if we add our own implementation, it would be best to move it outside of the musl-generic-math subdirectory.
  • The algorithm is quite clever, but could probably use some explanation as to what is going on. A few comments would probably suffice.
  • This looks like it could be generalized to also cover the double and even long double case using a templated function, but that'd require selecting a trait based on the type + width (for long double). Something like template <std::floating_point T, int MantDigits = std::numeric_limits<T>::digits> (untested)
  • I benchmarked this on my machine using different -march settings, and the (small) performance benefit of this requires x86-64-v3. Maybe this should be selected using Function multiversioning using attributes like __attribute__((target("avx,bmi2")))?
  • The lookup table should preferably be generated using a consteval function instead of hardcoding it - not only is that less brittle, but also better documentation for how it works :^)

@xinitrcn1

xinitrcn1 commented May 13, 2026

Copy link
Copy Markdown
Author

the algorithm:

have c be composed of bits "mii" (1 m bit, 2 i bits)

if exponent = 0x00 -> ii = 1
if exponent = 0xff -> ii = 0
accomplished via ii + 1 = [0x00 + 1 = 0x01, 0xff + 1 = 0x00]
if neither 0x00 or 0xff -> ii = 2 (this usually translates into cmovbe)

insert m bit simply checking that mantissa != 0

shift [by nibbles] a magic constant where each of the nibbles correspond to log2() of the corresponding FP classify define macro

issue is making this with log2() automagically with macros is quite painful, the easiest I could see is it being

#define X(x) ((x)==16?4:(x)==8?3:(x)==4?2:(x)==2?1:(x)==1?0:0)

then just construct by

    return 1 << ((0x22312240 >> (c * 4)) & 0xf);

#define X(x, s) ((((x)==16?4:(x)==8?3:(x)==4?2:(x)==2?1:(x)==1?0:0)) << (s*4))
    return 1 << (((X(1,0) | X(16,1) | X(4,2) | X(4,3) | X(2,4) | X(8, 5) | X(4,6) | X(4,7)) >> (c * 4)) & 0xf);
#undef x

I considered for double and long double but I worried if ia64 80-bit long double was to be accounted for

@no92

no92 commented May 13, 2026

Copy link
Copy Markdown
Member

I get how it works, it's just that it would be best to explain it in-band with comments in the code for posterity.

What I'm wondering is if this is worth the hassle to integrate - the improvement over the existing version is not that large (~20 % on my machine with appropriate -march), but on pre-x86_64-v3 it is more than 100 % slower. Switching between implementations would be possible with FMVs or ifuncs, but is the maintenance cost worth that?

@no92

no92 commented May 13, 2026

Copy link
Copy Markdown
Member

I have typed out an untested generic version of this earlier today: https://gist.github.com/no92/42903883ff18fa8ec92b7c1d643192d4

@xinitrcn1

Copy link
Copy Markdown
Author

there is extra uneeded movzx due to type widening on that implementation
additionally the arithmethic shift at the end isnt used, in this case shouldn't be an issue
either way

template <std::floating_point T>
constexpr int branchless_fpclassify(T x) {
	using traits = detail::fp_traits<T>;
	using U = typename traits::uint_type;
	U u = std::bit_cast<U>(x);
	uint8_t exp = (u >> traits::exp_shift) & traits::exp_mask;
	uint8_t c = (exp + 1) & traits::exp_mask;
	c = (c > 2) ? 2 : c;
	c |= ((u & traits::frac_mask) != 0) << 2;
	constexpr int32_t magic_table = detail::generate_fpclassify_table();
	return 1 << ((magic_table >> uint8_t(c * 4)) & 0xF);
}

would be a more faithful generic func

of course the question of "is this worth integrating for" is uh....

@xinitrcn1

Copy link
Copy Markdown
Author

Be of note that if FP_CLASSIFY constants were just 0,1,2,3 it would be far cheaper to do the last part (which mind you takes a lot of instructions to pull off)

@xinitrcn1

Copy link
Copy Markdown
Author

@no92

no92 commented May 13, 2026

Copy link
Copy Markdown
Member

Mind you that FP_ZERO does not fit in 4 bit. Changing FP_* macro values would be an ABI break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants