Conversation
|
Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr
cc @rust-lang/miri Some changes occurred to the platform-builtins intrinsics. Make sure the cc @antoyo, @GuillaumeGomez, @bjorn3, @calebzulawski, @programmerjake |
This comment has been minimized.
This comment has been minimized.
| use crate::num::Wrapping; | ||
|
|
||
| // i.e. 0b100010001...0001 in binary. | ||
| const MASK: u128 = 0x1111_1111_1111_1111_1111_1111_1111_1111; |
There was a problem hiding this comment.
| const MASK: u128 = 0x1111_1111_1111_1111_1111_1111_1111_1111; | |
| const MASK: u128 = !0 / 0xF; |
seems easier to read
There was a problem hiding this comment.
Is it? When I look at that expression I'd have to trust the comment that it does what it says. The 0x1111 literal is long but to me it's clearer.
f564229 to
47da3e1
Compare
This comment has been minimized.
This comment has been minimized.
My understanding was that
The tuple returning method is a possible future expansion, not intended to be currently implemented. I don't know the best way to implement the widening operation, but we can get the high-order bits using clmul as a primitive. In LLVM's arbitrary precision integer library, clmulr(a, b) = bitreverse(clmul(bitreverse(a), bitreverse(b)))
clmulh(a, b) = clmulr(a, b) >> 1https://github.com/llvm/llvm-project/blob/66ba9c9475c7628893f387f1b9d9d3056b05d995/llvm/include/llvm/ADT/APInt.h#L2457-L2476 |
|
Right, it's just the first widening function that would have that signature. If that is the plan then it's pretty simple: you just widen first and then do For This is running into CI sometimes using earlier LLVM releases without the new intrinsic. I asked in #t-compiler/llvm > Adding an LLVM 22 intrinsic how/if we can work around that. |
actually I expect you can probably just use ; cleaned up slightly
define void @u128_carrying_mul(ptr sret([32 x i8]) align 16 %_0, i128 noundef %a, i128 noundef %b, i128 noundef %c) unnamed_addr #0 {
start:
%0 = zext i128 %a to i256
%1 = zext i128 %b to i256
%2 = zext i128 %c to i256
%3 = mul nuw i256 %1, %0
%4 = add nuw i256 %3, %2
%5 = trunc i256 %4 to i128
%6 = lshr i256 %4, 128
%7 = trunc nuw i256 %6 to i128
store i128 %5, ptr %_0, align 16
%_0.repack1 = getelementptr inbounds nuw i8, ptr %_0, i64 16
store i128 %7, ptr %_0.repack1, align 16
ret void
} |
| /// This approach uses a bitmask of the form `0b100010001...0001` to avoid carry spilling. | ||
| /// When carries do occur, they wind up in a "hole" of zeros and are subsequently masked | ||
| /// out of the result. |
There was a problem hiding this comment.
This approach with 4-bit digits works up to integers with 4 * 15 = 60 bits. Past that, one digit can overflow to the next.
For u64, it does actually work for this "non-widening" operation, since the top digit may be computed as 16, but there is no next digit that would be affected. The wide result would be erroneous however. E.g. x.carryless_mul(x) with x = MASK as u64 as u128.
The impl for u128::carryless_mul is currently incorrect for that reason. You could probably extend the approach to use 5-bit digits, but it's likely better to just implement it in terms of u64::carryless_mul.
some tests against a naive impl: playground
This comment was marked as resolved.
This comment was marked as resolved.
| unsafe { a.unchecked_funnel_shr(b, shift) } | ||
| } | ||
|
|
||
| /// Carryless multiply. |
There was a problem hiding this comment.
Is this meant to be a self-contained description? To me these words mean nothing.^^
| #[doc = concat!("let a = ", $clmul_lhs, stringify!($SelfT), ";")] | ||
| #[doc = concat!("let b = ", $clmul_rhs, stringify!($SelfT), ";")] | ||
| /// | ||
| #[doc = concat!("assert_eq!(a.carryless_mul(b), ", $clmul_result, ");")] |
There was a problem hiding this comment.
The only example here uses huge numbers which means the example is entirely useless if one doesn't already know what this operation does. I for one still have no clue after reading all this.^^
Which carry is discarded? All the ones that show up when adding up the results of the first steps of long multiplication? Or also all the carries that occur within the long multiplication steps? This needs way more detail. Doesn't that mean the result depends on the base? No base is mentioed in the first 2 paragraphs, and even in the third paragraph it's only mentioned in an obscure way.
There was a problem hiding this comment.
maybe describe it as equivalent to:
impl uN {
pub fn carryless_mul(self, other: Self) -> Self {
let mut retval = 0;
for i in 0..Self::BITS {
if (other >> i) & 1 != 0 {
retval ^= self << i;
}
}
retval
}
}There was a problem hiding this comment.
That is a good specification. I haven't really found/been able to come up with a good intuitive explanation, and even if you understand the mechanics, how/when to apply this function is still hard to see.
The only example here uses huge numbers
So in part that is to test that the implementation doesn't just mask off the upper bits. Because of how these functions are generated with a macro, getting examples to work with type inference and without overflowing literals is a bit finicky.
I'll try again to come up with something more insightful.
There was a problem hiding this comment.
Regular long multiplication would be using += instead of ^=, right?
I have no idea why that would be called "carryless", but that does seem like a reasonable way to explain this, yeah. And maybe we should also do the const-eval/Miri implementation that way to avoid relying on the fallback impl that seems to want to be clever. We generally don't want to be clever for intrinsics in Miri.
There was a problem hiding this comment.
Regular long multiplication would be using
+=instead of^=, right?
Exactly.
It is carryless in the sense that the sum of two bits (x,y) produces a carry bit = x & y and the wrapped sum = x ^ y, so using ^ instead of + is just throwing out all the carries of bitwise addition. That is, consider the equivalence x + y = (x ^ y) + ((x & y) << 1).
The connection to polynomial multiplication is that there you naturally don't carry between the coefficients: (x + 1)(x + 1) = x^2 + 2x + 1, and the 2x doesn't just turn into x^2. Rather, if you're computing the coefficients modulo 2, you get 2x = 0x. Indeed, 0b11.carryless_mul(0b11) == 0b101, where the polynomials are represented by their valuation at x = 2.
|
maybe look at the latest C++ proposal for a bunch of good examples: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3642r3.html |
47da3e1 to
ced05ab
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
This comment has been minimized.
This comment has been minimized.
ced05ab to
16a0add
Compare
This comment has been minimized.
This comment has been minimized.
| fn carryless_mul(self, rhs: Self) -> Self { | ||
| // For u128 the 0b100010001...0001 trick above does not work, so we use an implementation | ||
| // that uses 64-bit carryless multiplication. | ||
| karatsuba_u128(self, rhs).1 |
There was a problem hiding this comment.
you should be able to have a much more efficient implementation if you also have a carryless_mul_high (which could also be a useful intrinsic itself):
fixed version in #152132 (comment)
original broken version
(edit: the below carryless_mul_high has the overflow bug: https://github.com/rust-lang/rust/pull/152132/changes#r2766586213)
impl const CarrylessMul for u128 {
#[inline]
fn carryless_mul(self, rhs: Self) -> Self {
let l = u64::carryless_mul(self as u64, rhs as u64);
let lh = u64::carryless_mul(self as u64, (rhs >> 64) as u64);
let hl = u64::carryless_mul((self >> 64) as u64, rhs as u64);
let h = lh ^ hl ^ carryless_mul_high(self as u64, rhs as u64);
((h as u128) << 64) | l as u128
}
}
const fn carryless_mul_high(x: u64, y: u64) -> u64 {
const fn mul_h(x: u64, y: u64) -> u64 {
x.carrying_mul(y, 0).1
}
// i.e. 0b100010001...0001 in binary.
const MASK: u64 = 0x1111_1111_1111_1111u64;
const M0: u64 = MASK;
const M1: u64 = M0 << 1;
const M2: u64 = M1 << 1;
const M3: u64 = M2 << 1;
let x0 = x & M0;
let x1 = x & M1;
let x2 = x & M2;
let x3 = x & M3;
let y0 = y & M0;
let y1 = y & M1;
let y2 = y & M2;
let y3 = y & M3;
let z0 = mul_h(x0, y0) ^ mul_h(x1, y3) ^ mul_h(x2, y2) ^ mul_h(x3, y1);
let z1 = mul_h(x0, y1) ^ mul_h(x1, y0) ^ mul_h(x2, y3) ^ mul_h(x3, y2);
let z2 = mul_h(x0, y2) ^ mul_h(x1, y1) ^ mul_h(x2, y0) ^ mul_h(x3, y3);
let z3 = mul_h(x0, y3) ^ mul_h(x1, y2) ^ mul_h(x2, y1) ^ mul_h(x3, y0);
(z0 & M0) | (z1 & M1) | (z2 & M2) | (z3 & M3)
}There was a problem hiding this comment.
I think this won't quite work for u64 (or larger) because of the reasons in #152132 (comment)
Specifically, if in some mul_h(xi, yj) both inputs have 16 bits set (so xi == Mi and yj == Mj), then the polynomial product will have a coefficient of 16 at index 64 + i + j, which will carry a one to the coefficient at 68 + i + j. There might be some trick to make it work, but that sounds complex. I'd first consider implementing the widening version u32 X u32 -> u64, and then extending that to larger sizes with karatsuba.
edit: Oh, you had already noticed, and GH just doesn't automatically reload threads.
Anyway, I do agree that having either carryless_mul_high or widening_carryless_mul or some such at least internally is a good idea, since that is then easily extended to larger sizes.
There was a problem hiding this comment.
ok, here's a fixed version that incorporates a check for if the multiply would overflow the 4-bit parts:
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2024&gist=e24ec0122afcca4b8b05239aa7f7c3c4
impl const CarrylessMul for u128 {
#[inline]
fn carryless_mul(self, rhs: Self) -> Self {
let l = u64::carryless_mul(self as u64, rhs as u64);
let lh = u64::carryless_mul(self as u64, (rhs >> 64) as u64);
let hl = u64::carryless_mul((self >> 64) as u64, rhs as u64);
let h = lh ^ hl ^ carryless_mul_high(self as u64, rhs as u64);
((h as u128) << 64) | l as u128
}
}
const fn carryless_mul_high(x: u64, y: u64) -> u64 {
// i.e. 0b100010001...0001 in binary.
const MASK: u64 = 0x1111_1111_1111_1111u64;
const M0: u64 = MASK;
const M1: u64 = M0 << 1;
const M2: u64 = M1 << 1;
const M3: u64 = M2 << 1;
macro_rules! mul {
($x_mask_shift:literal, $y_mask_shift:literal) => {{
let x = x & (MASK << $x_mask_shift);
let y = y & (MASK << $y_mask_shift);
if x == MASK << $x_mask_shift && y == MASK << $y_mask_shift {
// only case where the multiply overflows the 4-bit parts
0x0101_0101_0101_0101u64 << ($x_mask_shift + $y_mask_shift)
} else {
x.carrying_mul(y, 0).1
}
}};
}
let z0 = mul!(0, 0) ^ mul!(1, 3) ^ mul!(2, 2) ^ mul!(3, 1);
let z1 = mul!(0, 1) ^ mul!(1, 0) ^ mul!(2, 3) ^ mul!(3, 2);
let z2 = mul!(0, 2) ^ mul!(1, 1) ^ mul!(2, 0) ^ mul!(3, 3);
let z3 = mul!(0, 3) ^ mul!(1, 2) ^ mul!(2, 1) ^ mul!(3, 0);
(z0 & M0) | (z1 & M1) | (z2 & M2) | (z3 & M3)
}There was a problem hiding this comment.
Huh, that should indeed work. I do wonder how much the codegen suffers from the extra checks, but I suppose that's something to examine with actual benchmarks.
There was a problem hiding this comment.
if the ifs are replaced with select_unpredictable, it seems pretty reasonable: https://rust.godbolt.org/z/9q9oTf9T3
There was a problem hiding this comment.
The PR uses this version now, the karatsuba construction might still be useful if we do want the full 256-bit result (which apparently is what crypto libraries actually use?)
There was a problem hiding this comment.
Yes, for cryptography GHASH/POLYVAL compute a 256-bit product then reduce it to 128-bits.
However, give us a clmul+clmul2/pmull+pmull2-style API where we can extract the low and high 128-bit parts of the product and we're fine with that and can do Karatsuba ourselves (we actually break down Karatsuba into 2-steps for each of our CLMUL/PMULL backends so we can defer the final product assembly and subsequent Montgomery reduction as part of a powers-of-H optimization for long messages)
Edit: apparently there's also an alternative to Karatsuba when using CLMUL/PMULL I need to look into
There was a problem hiding this comment.
@traviscross you suggested adding carrying_carryless_mul, what signature did you intend it to have? Today we have:
pub const fn carrying_mul(self, rhs: u16, carry: u16) -> (u16, u16)
pub const fn widening_mul(self, rhs: u16) -> (u16, u16)
Now this PR adds widening_carryless_mul with a non-standard signature, at T-libs-api's request:
pub const fn widening_carryless_mul(self, rhs: u16) -> u32
I'm not sure how much sense carrying_carryless_mul(self, rhs: u16, carry: u16) -> (u16, u16) makes? Generally that carry would just be 0? Maybe it does have applications I'm missing though.
There was a problem hiding this comment.
carrying_carryless_mul is for when you want a bigint carryless-mul:
fn carryless_mul_u1024_by_u64(a: &mut [u64; 16], b: u64) {
let mut carry = 0;
for a in a {
(*a, carry) = a.carrying_carryless_mul(b, carry);
}
}it's just like:
fn mul_u1024_by_u64(a: &mut [u64; 16], b: u64) {
let mut carry = 0;
for a in a {
(*a, carry) = a.carrying_mul(b, carry);
}
}There was a problem hiding this comment.
ah, the carry is just a convenience really. Makes sense, I added an implementation now. u128 needs some special care, and currently LLVM will not optimize it to use a 128-bit or 256-bit clmul. So we might still want the -> (Self, Self) standard widening variant as an intrinsic.
| } | ||
|
|
||
| #[rustc_const_unstable(feature = "core_intrinsics_fallbacks", issue = "none")] | ||
| const fn karatsuba_u128(h: u128, y: u128) -> (u128, u128) { |
There was a problem hiding this comment.
Firstly, this is returning (high, low), whereas e.g. uN::carrying_mul uses (low, high), which I find preferable. Either way, a comment noting that wouldn't hurt.
Secondly, I think this could be improved by outlining the definition of the widening clmul for the half-sized integer. Let's say
const fn widening_clmul_u64(x: u64, y: u64) -> (u64, u64) {
let lo = x.carryless_mul(y);
let xr = x.reverse_bits();
let yr = y.reverse_bits();
let hi = xr.carryless_mul(yr).reverse_bits() >> 1;
(lo, hi)
}Finally, computing the high bits of CLMUL this way may be quite slow on targets without a fast bit-reversal.
There was a problem hiding this comment.
Finally, computing the high bits of CLMUL this way may be quite slow on targets without a fast bit-reversal.
FWIW BearSSL also implemented its own SWAR-style bit reversal:
(See also rev32)
There was a problem hiding this comment.
Finally, computing the high bits of CLMUL this way may be quite slow on targets without a fast bit-reversal.
There isn't that much we can do about that though, right? We already let LLVM decide the codgen in most cases, for miri/const evaluation we're using the simpler implementation, so it's really just gcc and cranelift where this implementation would even be used.
|
I added |
tracking issue: #152080
ACP: rust-lang/libs-team#738
This defers to LLVM's
llvm.clmulwhen available, and otherwise falls back to a method from thepolyvalcrate (link).Some things are missing, which I think we can defer:
const(I think I ran into a bootstrapping issue). That is fine for now, I think instdarchwe can't really use this intrinsic at the moment, we'd only want the scalar version to replace some riscv intrinsics.